RL Bug and Suggested Improvements


Critical bugs


146. (?) Attribute absent from the attribute definition file are ignored

65. Excel misunderstans the rules file.
Will: In its infinite wisdom, Excel believes that when you have "=>" in a cell, you're editing a function. When that function doesn't make sense, Excel shows "#NAME".
Phil: A file containing  A <= 100 -> C = low  works for me!
140. (?) Rendering results consume too much memory and takes ages.
This is easy to check by putting a print statement before rendering starts and after it finishes. We can also try using the java profiler (java documentation). Possible solution
  1. Produce results on demand, rather than storing them in memory.


Needed functionality


42. (?) Extend use of prior rules during learning.
Prior rules are currently handled in the simplest possible way--see notes in SAL.processPriorRules().
(Will 2002/03/02)
Implement the following 9 options:
EFFECT OF PRIOR RULES ON LEARNING (CHOOSE ONE)
  1. No effect.
  2. Data they cover need not be covered by new rules.
  3. They seed the learning and may be superceded by better rules.
EFFECT ON PRIOR RULE STATISTICS (CHOOSE ONE)
  1. No effect.
  2. Update statistics, with current data as additional data.
  3. Re-evaluate statistics based only on the current data.
What should happen if a prior rule has an attribute that is not defined in the data or is defined differently (like different values or discretizations)? Discretize the data like the rule? Omit the rule or omit the attribute?

49. (?) Improve rule application.
(Will 2002/03/02)
144. (?) Invent model building (Phil).
Thagard has done lots of
work.
103. Separate functionality from UI code.
  1. Done by Will - Separate IO from all else.
  2. Put learning-related code into a new class RL.
  3. Create Class RlRunner, and move the main() method from MainFrame there.
  4. Modify main() to parse the command line arguments and instantiate the appropriate UI class (initially MainFrame).
  5. Create Listeners monitoring the state of objects in the classes that contain functionality.
  6. Create interfaces implemented by RL, RlRunner and the UI classes.
Other interface classes can be added to allow RL to be used as a pre-compiled library and as a web resource.
104. Create a separate class, implementing a command line interface.
Only try this after
Bug 103 is fixed.
105. Ability to store current state (value hierarchy, search parameters).
Will has done a lot of work here, and is charging on.
106. Ability to import previous states.
107. Ability to import rules.
Will almost has this done!
127. (?) How should cross-validation be handled?
Currently, only rules from the last run are saved.
Bruce wants to see: Eric wants to see the intersection of the rule sets but wants all rule sets to be saved.
Will has exotic ideas.
112. After importing the data, make a frequency table and plot distributions of:
117. Separate the errors from the NoPredictions in the ***Predictions*** table.
122. Rule chaning
123. Be able to specify min and max for continuous attributes without editing value hierarchy
124. Be able to input constraints from file
126. (?) Discretize continuous values better.
Equal interval and equal frequency are already implemented by Will.
MinSplits (Wang & Goh 2002) and ConMerge (Wang & Liu 1998) are smart algorithms that discretize hierarchically. See also "Discretization" in Eric's page of RL-related papers.
Bruce: "The reason I like to be able to control discretization for each attribute separately is that we sometimes know something about the data (!) E.g., for medical records, pt. temperature has ranges that are dictated by convention, not by distributions -- and an investigator with unusual data may wish to introduce new ranges. Among meningitis patients, a fever of 101 is "normal", for example. It's always a good idea to have a default method that makes good sense for all attributes unless otherwise specified. For that, I think looking at clusters of values to determine intervals would be reasonable. The UI would need to ask somex like (A) How many intervals do you want to use? AND What is is the minimum number of cases in any one interval? or (B) What degree of separation do you want between intervals (% or number)? (A) could reasonably default to 5 [very high, high, middle, low, v.low] -- although I'd want to be able to override this for any attribute. Finding the N clusters with greatest separation is standard stuff. An assumption of Euclidean distance is reasonable -- maybe there's a reason to introduce other options but I doubt if it makes much difference. (B) would be harder to explain, and to implement. It's probably overkill, at least until someone wants to do a thorough investigation of interval setting by hierarchical clustering. Whether or not intervals are put into a hierarchy seems to be a separate question. I think it is a very nice feature to have, which increases RL's power, and the existing code seems to be compatible with (A)."
139. Improve RL's guess about value types in data files.
Read the value into a string. Try to parse it into an integer. If it works, it's an integer. If not, try double. If it works, it's a double. If not, does the string contain a non-leading and non-trailing space. If it works, it's a set value. If not, is it the missing-value marker "?" or the empty string? If so, it's a missing value. If not, it's a symbolic value. This should be modified for "<= N values implies discrete".
128. (?) For HTML output, flag cases that have contradictory predictions.
Huh?
129. For HMTL output, show rules by predicted class.


User Interface Bugs

147. File dialog field needs to be cleared before showing ...or after processing.
131. (?) Rename "Export Rules" and "Export Results" to "Save Rules" and "Save Results". Similarly rename Import to Load.
132. Add label "Folds" after the "Cross-Validation" text field for the number of folds.
102. (?) Redesign the output. (Phil)
This is inextricably tied to
model building. Make output interactive. Show the parameters, rules and evaluation in a tab. Clicking on a rule or set of rules shows the train and test data matched and correctly predicted by the selected rules. Should probably be done with a context menu and/or buttons, and pop up a window showing the data. This relates to memory considerations.
Another useful output would be a nicely laid-out graph relating rules (nodes) to data (nodes). This can also be interactive, memory permitting. Allow the user to remove from the data all attributes that appear in the current ruleset.
109. (?) Attribute import and export should be in Attribute dialog. (Will) Add a buttons for "Save" and "Exit".
110. Button "Find rules" should be disabled or pop up a message box, if there are no output variables or no data.
115. Add a button to the Attributes dialog to sort attributes in the original order.
121. Total progress bar for bias space search.
141. Produce up-to-date user help (Bruce)
142. Costs: change the input format so that +ve is good and -ve is bad.
145. Default import directory should be the directory where RL was run from. (Mark)
Phil: Are you sure?
137. Put tooltips on all labels and controls to explain what they are for.
It may not be obvious to a new user. For example:
Control Tooltip
Minimum CF The found rules will have at least this confidence
Beam Width A wider beam makes the search for rules more exhaustive, but also slower
Maximum Conjuncts The greatest number of conjuncts that any rule will have in its condition. Rules with more conjuncts are more specific. For example the rule A=a & B=b -> C=c has two conjuncts.
Minimum Conjuncts The fewest conjuncts that any rule will have in its condition. Rules with fewer conjuncts are more general. For example the rule A=a & B=b -> C=c has two conjuncts.
Minimum TP Minimum True-Positive rate is the smallest fraction of correct predictions to total predictions, that any rule is allowed to make during training.
Maximum FP Maximum False-Positive rate is the greatest fraction of false predictions to total predictions that any rule is allowed to make during training.
Minimum Coverage The smallest number of data that any rule must cover.
Inductive Strengthening The smallest number of data that any rule must cover, that are not covered by more general rules.
Use Prior Rules Prior rules are rules that you know already. If you check this box, RL will look for other rules during its learning process.
Prune Specialized Rules ????
Chain Rules Does not work at the moment.
Rule Scoring Method Specifies what data should be used for evaluating each rule during training
Cross-Validation The training data is divided into N (almost) equal partitions; RL uses each one in turn for evaluating rules learned using the remaining N-1 partitions.
Training Data ????
Test Data Rules are evaluated on the test data that you specified in the Import Data dialog.
Validation Data ????