data science notes2
Feature selection
Pearson Coefficient:
Measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.
import numpy as np
from scipy.stats import pearsonr
np.random.seed( 0 )
size = 300
x = np.random.normal( 0 , 1 , size)
print "Lower noise" , pearsonr(x, x + np.random.normal( 0 , 1 , size))
print "Higher noise" , pearsonr(x, x + np.random.normal( 0 , 10 , size))
|
`Lower noise (0.71824836862138386, 7.3240173129992273e-49) Higher noise (0.057964292079338148, 0.31700993885324746)`
Sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Use sklearn, pipeline to get the job faster.
Major Drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables. For example, a correlation between x and x2 is zero or when x is centered on 0.
`x ``=` `np.random.uniform(``-``1``, ``1``, ``100000``)` `print` `pearsonr(x, x``*``*``2``)[``0``]` |
` -0.00230804707612`
Pearson Correlation Chart
https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/506px-Correlation_examples2.svg.png
Source: http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/