Note: This is also still draft state.

Goal: To learn Feature Engg and other extremely cools techniques that had been shared on the Kaggle.com.

Note: This Page is (not) a copy paste or replication but a summary of things I have noticed from these Kagglers. Do not assume.

  • using Name Title for predicting Age - Master, Mr, Mrs, Miss, Captain, Officer,..

  • use of Title, Sex, Pclass for predicting - Age

Tribute to those awesome programmers.

  • https://www.kaggle.com/creepykoala/titanic/study-of-tree-and-forest-algorithms/run/237275

    • Kaggler : https://www.kaggle.com/creepykoala
  • a

    • Kaggler:

Decision Tree Visualisation and Submission

Here you can see how

https://www.kaggle.com/yildirimarda/titanic/titanic-test3/output

https://www.kaggle.io/svf/134152/3c521ead07195a4add31513c96b51631/Rplot001.png

How to visualise Tree Graphs

>> from IPython.display import Image >>> dot_data = StringIO() >>> tree.export_graphviz(clf, out_file=dot_data, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) >>> graph = pydot.graph_from_dot_data(dot_data.getvalue()) >>> Image(graph.create_png()) http://scikit-learn.org/stable/modules/tree.html

How to check correlation between columns with respect to Survival

<code class="  language-python">train <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">"../input/train.csv"</span><span class="token punctuation">,</span> dtype<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">"Age"</span><span class="token punctuation">:</span> np<span class="token punctuation">.</span>float64<span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token punctuation">)</span>

<span class="token comment"># Replacing missing ages with median</span>
train<span class="token punctuation">[</span><span class="token string">"Age"</span><span class="token punctuation">]</span><span class="token punctuation">[</span>np<span class="token punctuation">.</span>isnan<span class="token punctuation">(</span>train<span class="token punctuation">[</span><span class="token string">"Age"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">]</span> <span class="token operator">=</span> np<span class="token punctuation">.</span>median<span class="token punctuation">(</span>train<span class="token punctuation">[</span><span class="token string">"Age"</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
train<span class="token punctuation">[</span><span class="token string">"Survived"</span><span class="token punctuation">]</span><span class="token punctuation">[</span>train<span class="token punctuation">[</span><span class="token string">"Survived"</span><span class="token punctuation">]</span><span class="token operator">==</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"Survived"</span>
train<span class="token punctuation">[</span><span class="token string">"Survived"</span><span class="token punctuation">]</span><span class="token punctuation">[</span>train<span class="token punctuation">[</span><span class="token string">"Survived"</span><span class="token punctuation">]</span><span class="token operator">==</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"Died"</span>
train<span class="token punctuation">[</span><span class="token string">"ParentsAndChildren"</span><span class="token punctuation">]</span> <span class="token operator">=</span> train<span class="token punctuation">[</span><span class="token string">"Parch"</span><span class="token punctuation">]</span>
train<span class="token punctuation">[</span><span class="token string">"SiblingsAndSpouses"</span><span class="token punctuation">]</span> <span class="token operator">=</span> train<span class="token punctuation">[</span><span class="token string">"SibSp"</span><span class="token punctuation">]</span>

plt<span class="token punctuation">.</span>figure<span class="token punctuation">(</span><span class="token punctuation">)</span>
sns<span class="token punctuation">.</span>pairplot<span class="token punctuation">(</span>data<span class="token operator">=</span>train<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token string">"Fare"</span><span class="token punctuation">,</span><span class="token string">"Survived"</span><span class="token punctuation">,</span><span class="token string">"Age"</span><span class="token punctuation">,</span><span class="token string">"ParentsAndChildren"</span><span class="token punctuation">,</span><span class="token string">"SiblingsAndSpouses"</span><span class="token punctuation">,</span><span class="token string">"Pclass"</span><span class="token punctuation">]</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
             hue<span class="token operator">=</span><span class="token string">"Survived"</span><span class="token punctuation">,</span> dropna<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span></code>

https://www.kaggle.io/svf/1603/1db145f4c65249a8bc7b7090fb66369b/1_seaborn_pair_plot.png

Source: https://www.kaggle.com/benhamner/titanic/python-seaborn-pairplot-example/output

How lucky is your name?

Well sometimes, you happend to be of a high authority family and this reason that could save your life.

https://www.kaggle.com/anthonyg/titanic/lucky-names/code

how to use “is in alist” paramaeter in pandas

<code class=" language-python">#pull out the passengers that have popular names (> 10 occurances)

top10_popular_firstname = dfTitanic['FirstName'].value_counts()[dfTitanic['FirstName'].value_counts() > 10].index

dfPassengersWithPopularNames = dfTitanic[dfTitanic['FirstName'].isin( top10_popular_firstname )]
</code>

How to XGboost your solution?

https://www.kaggle.com/cbrogan/titanic/xgboost-example-python/code

Suggestion:

I see there are lots of interesting questions & interesting finding to ask and figure out from data.

  • distplot/hist of features values play important role.

  • sometimes few columns can be inter-dependent and we can use that for guessing missing values.

  • No values can also mean something like a new feature called no.of_nulls feature.

  • check for hidden data in Object type features.

  • if you see combinaiton of cols can change show some important, then create a new feaure.

  • all new features not necesarly adds signification values.

  • too many feaures are good but having feaures which contribute is more important.

  • Failures are stepping stone of sucess. Kill logics the relate to failure. Keep trying.

  • ASK WHY for everything.

    • What & Why these features are good?

    • What story to make up ?

    • What more we can cook up ?