Feature selection

Pearson Coefficient:

Measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.

import numpy as np from scipy.stats import pearsonr np.random.seed(0) size = 300 x = np.random.normal(0, 1, size) print "Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)) print "Higher noise", pearsonr(x, x + np.random.normal(0, 10, size))

`Lower noise (0.71824836862138386, 7.3240173129992273e-49) Higher noise (0.057964292079338148, 0.31700993885324746)`

Sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Use sklearn, pipeline to get the job faster.

Major Drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables. For example, a correlation between x and x2 is zero or when x is centered on 0.

`x ``=` `np.random.uniform(``-``1``, ``1``, ``100000``)` `print` `pearsonr(x, x``*``*``2``)[``0``]`

` -0.00230804707612`

Pearson Correlation Chart

https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/506px-Correlation_examples2.svg.png

Source: http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/