{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"gradient": {
"editing": false
}
},
"source": [
"# RF (Random Forest)\n",
"\n",
"We will break timeline a bit and jump to random forests instead of introducing perceptrons. The general method of random decision forests was first proposed by Ho only in 1995. RF became really popular approach for tabular dataset due to their simplicity and quite robust behaviour. They are still widely used and for some problems can outperform neural nets.\n",
"\n",
"Before defining random forests we need to explore it's main building block - entropy.\n",
"\n",
"## Entropy\n",
"\n",
"Coined by Claude Shannon in “A Mathematical Theory of Communication” 1948 ([link](http://www.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf)).\n",
"Main question: how much useful information are we transmitting?\n",
"When we transmit one bit of information we reduce recipients uncertainty by the factor of two.\n",
"\n",
"P.S.: same paper introduced term bit = **bi**nary uni**t**.\n",
"\n",
"Look at the following video:\n",
"\n",
"\n",
"\n",
"Some key points:\n",
"- How much information are we getting measured in bits? $- \\log_2 (prob) = bits$\n",
"- Entropy: $H(p)=-\\sum_i p_i \\log_2 (p_i)$\n",
"- Cross-Entropy: let $p$ - true distribution, $q$ - predicted distribution, then Cross-entropy is\n",
"$H(p,q) = -\\sum_i p_i \\log_2(q_i)$\n",
"- If predictions are perfect Cross-entropy is equal to entropy, but usually it is greater than entropy.\n",
"- Cross-entropy is often used for ML as a cost function (log-loss) comparing $p$ with $q$."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"gradient": {
"editing": false
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.datasets import load_iris\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.tree import DecisionTreeClassifier, export_graphviz\n",
"from graphviz import Source\n",
"from IPython.display import display, SVG"
]
},
{
"cell_type": "markdown",
"metadata": {
"gradient": {
"editing": false
}
},
"source": [
"## Iris dataset\n",
"\n",
"Bellow we will work with iris dataset. The idea of the dataset is to find out iris class based on given measurements.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"gradient": {
"editing": false
}
},
"outputs": [],
"source": [
"# Load dataset\n",
"iris = load_iris()\n",
"df = pd.DataFrame(iris.data, columns = iris.feature_names)\n",
"df['class'] = iris.target_names[iris.target]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"gradient": {}
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | sepal length (cm) | \n", "sepal width (cm) | \n", "petal length (cm) | \n", "petal width (cm) | \n", "class | \n", "
---|---|---|---|---|---|
0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "
1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "
2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "setosa | \n", "
3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "setosa | \n", "
4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
145 | \n", "6.7 | \n", "3.0 | \n", "5.2 | \n", "2.3 | \n", "virginica | \n", "
146 | \n", "6.3 | \n", "2.5 | \n", "5.0 | \n", "1.9 | \n", "virginica | \n", "
147 | \n", "6.5 | \n", "3.0 | \n", "5.2 | \n", "2.0 | \n", "virginica | \n", "
148 | \n", "6.2 | \n", "3.4 | \n", "5.4 | \n", "2.3 | \n", "virginica | \n", "
149 | \n", "5.9 | \n", "3.0 | \n", "5.1 | \n", "1.8 | \n", "virginica | \n", "
150 rows × 5 columns
\n", "\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "970 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "3 | \n", "1 | \n", "2 | \n", "1 | \n", "
1 | \n", "0 | \n", "1124 | \n", "2 | \n", "4 | \n", "0 | \n", "1 | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "
2 | \n", "6 | \n", "0 | \n", "1002 | \n", "5 | \n", "2 | \n", "0 | \n", "2 | \n", "10 | \n", "5 | \n", "0 | \n", "
3 | \n", "0 | \n", "0 | \n", "10 | \n", "972 | \n", "0 | \n", "7 | \n", "0 | \n", "10 | \n", "9 | \n", "2 | \n", "
4 | \n", "1 | \n", "0 | \n", "2 | \n", "0 | \n", "958 | \n", "0 | \n", "5 | \n", "0 | \n", "3 | \n", "13 | \n", "
5 | \n", "2 | \n", "1 | \n", "1 | \n", "17 | \n", "2 | \n", "856 | \n", "4 | \n", "2 | \n", "5 | \n", "2 | \n", "
6 | \n", "5 | \n", "3 | \n", "0 | \n", "0 | \n", "5 | \n", "5 | \n", "936 | \n", "0 | \n", "4 | \n", "0 | \n", "
7 | \n", "1 | \n", "2 | \n", "20 | \n", "3 | \n", "2 | \n", "0 | \n", "0 | \n", "988 | \n", "2 | \n", "10 | \n", "
8 | \n", "4 | \n", "0 | \n", "5 | \n", "6 | \n", "6 | \n", "7 | \n", "4 | \n", "4 | \n", "927 | \n", "11 | \n", "
9 | \n", "7 | \n", "5 | \n", "3 | \n", "9 | \n", "13 | \n", "1 | \n", "1 | \n", "5 | \n", "4 | \n", "961 | \n", "