» fss

Feature Subset Selection (FSS)

The case for FSS

Repeated result: throwing out features rarely damages a theory

Feature removal enables the processing of very large data sets (e.g. see the text mining example, below).

And, finally, feature removal is sometimes very useful:

Excess attributes

Why FSS?

Problem

FSS types:

     

INFO GAIN

RELIEF

CBS (consistency-based evaluation)

WRAPPER

PRINCIPAL COMPONENTS ANALYSIS (PCA)

(The traditional way to do FSS.)

Latent Semantic Indexing

CFS (correlation-based feature selection)

And the winner is:

Other Methods

Other methods not explored by Hall and Holmes...

FastMap

When data to large for LDA,PCA then use Fastmap.

To map N points in D dimensions to D-1:

To map down N points in D dimensions down to D=3 (so you can visualize it); repeat the above process D-3 times.

Heuristic: does not do as well as (say) PCA.

IMPORTANT NOTE: PCA, LSI, FastMap, etc ...:

Tf*IDF

FSS is essential in text mining:

Before doing anything else, remove all the words you can.

To begin with:

But what makes a word interesting:

Build a symbol table of all the remaining words

IMPORTANT: all the above a linear-time transforms requiring three passes (at most) over the data. For text mining, such efficiency is vital.

Note: At ICSM'08, Menzies and Marcus reduced 10,000 words to the top 100 Tf*IDF terms, then sorted those by InfoGain.

Nomograms

[Mozina04] argue that what really matters is the effect of a cell on the output variables. With knowledge of this goal, it is possible to design a visualization, called a nomogram, of that data. Nomograms do not confuse the user with needless detail. For example, here's nomogram describing who survived and who died on the Titanic:

Of 2201 passengers on Titanic, 711 (32.3%) survived. To predict who will survive, the contribution of each attribute is measured as a point score (topmost axis in the nomogram), and the individual point scores are summed to determine the probability of survival (bottom two axes of the nomogram).

The nomogram shows the case when we know that the passenger is a child; this score is slightly less than 50 points, and increases the posterior probability to about 52%. If we further know that the child traveled in first class (about 70 points), the points would sum to about 120, with a corresponding probability of survival of about 80%.

It is simple to calculate nomogram values for single ranges. All we want is something that is far more probable in a goal class than in other classes.

We use logs since products can be visualized via simple addition. This addition can be converted back to a probability as follows. If the sum is "f" and the target class occurs "N" times out of "I" instances, then the probability of that class is "p=N/I" and and the sum's probability is:

function points2p(f,p) { return 1 / (1 + E^(-1*log(p/(1 - p)) - f )) }
(For the derivation of this expression, see equation 7 of [Mozina04]. Note that their equation has a one-bracket typo.)

Besides enabling prediction, the nomogram reveals the structure of the model and the relative influences of attribute values on the chances of surviving. For the Titanic data set:

Therefore, with nomograms, we can play cost-benefit games. Consider the survival benefits of

Note that pretending to be a child has much less benefit than the other two steps. This is important since the more elaborate the model, the more demanding it becomes. Simpler models are faster to apply, use less resources (sensors and actuators), etc. So when the iceberg hits, get into a dress and buy a first-class ticket. But forget the lollipop.

I've applied some nomogram visualization to the output of Fenton's Bayes nets. The results found some simple rules that predicted for worst case defects:

B-squared

We've had some problems with nomograms where something is relatively more common in one class than other but, in absolute terms, is still rare.

E.g. in 10,000 examples, if Martians appear twice as much in the low IQ sample than the high IQ sample, then nomograms would say that being from Mars predicts for being dumb.

But the number of Martians on Earth is very small (100, or even less) so compared to the size of the total population, this is not an interesting finding.

Hence, we've modified Nomograms as follows.

The squared "b" value penalizes any X that occurs with a low frequency in C.

We can generate the same nomogram graphs, as above, using B-squared instead of LOR.

Beyond Feature Selection

Note that the LOR and B-squared are range selectors, not feature feature selectors. That is, they don't just comment on which attributes are important, but which ranges within those attributes are important.

This means we can use "feature selection" for other applications.

Instance Selection

Just as it it useful to prune features (columns), it is also useful to either prune instances (rows) or replace a larger number of rows with a smaller number of artificially generated representative examples.

How to perform instance selection?

Rule Generation

A rule condition is a collection of ranges, combined as disjuncts, conjuncts, negation.

If we have a range selector, why not run it as a pre-processor to a rule generator?

A Better PRISM

Possible rule set for class b:

The PRISM Algorithm

A simple covering algorithm developed by Cendrowska in 1987.

Available in the WEKA

The algorithm is hardly profound but it does illustrate a simple way to handle an interesting class of covering learners.

Pseudo-code:

For each class C
 Initialize E to the instance set
 While E contains instances in class C
    Create a rule R with an empty left-hand side that predicts class C
    Until R is perfect (or there are no more attributes to use) do
       For each attribute A not mentioned in R, and each value v,
              Consider adding the condition A = v to the left-hand side of R
              Select A and v to maximize the accuracy p/t
               (break ties by choosing the condition with the largest p)
       Add A = v to R
        Remove the instances covered by R from E 

Methods like PRISM (for dealing with one class) are separate-and-conquer algorithms:

See also INDUCT,RIPPER,etc)

Over-fitting avoidance

Standard PRISM has no over-fitting avoidance strategy

Other options

Generalizing PRISM

PRISM is really a simple example of a class of separate and conquer algorithms. The basic algorithm is simple and supports a wide range of experimentation (e.g. INDUCT)

WHICH

Reference: Defect prediction from static code features: current results, limitations, new approaches, ASE journal, May 2010.

Sort the ranges (scored using, say, B-squared)

Repeat until top-of-stack's score stabilizes:

Top of stack stabilizes remarkably early:

  • Special cases:
  • Applications: