Demo
Yun-ri opened this issue · 0 comments
Yun-ri commented
"#Some basic statistics\n",
"print(np.mean(tips['total_bill'])) #mean for the 'total_bill'\n",
"print(np.std(tips['total_bill'])) #standard diviation for the 'total_bill' \n",
"print(np.var(tips['total_bill'])) #variance for the 'total_bill'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Your turn\n",
"
- \n",
- Load "tips.csv" data set from the datasets folder (Seaborn_Datasets). \n",
- Calculate the mean of the "tip"s for dinner and lunch. \n",
"
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Your answer\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simple Plots\n",
"In this section, you will learn how to draw some simple plots to start data analysis."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bar Plot\n",
"A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. It can show the relationship between a numerical variable and a categorical variable. "
]
},
{:
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import seaborn as sns \n",
"import numpy as np\n",
"\n",
"def read_dataset(dataset):\n",
" folder = "Seaborn_Datasets/"\n",
" data = pd.read_csv(folder + dataset)\n",
" return data\n",
"\n",
"tips = read_dataset("tips.csv")\n",
"\n",
"sns.barplot(x="day", y="tip", data=tips)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#changing estimator to median\n",
"sns.barplot(x="day", y="total_bill", data=tips, estimator=np.median)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stacked Plot\n",
"A stacked bar graph (or stacked bar chart) is a chart that uses bars to show comparisons between categories of data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"#generating data set\n",
"df = pd.DataFrame(columns=["Language","Scripting", "Cross Platform","Fast",\n",
" "Data Science","Easy"], \n",
" data=[["Python",1,1,1,1,1],\n",
" ["Java",0,1,1,1,0],\n",
" ["PHP",1,1,0,0,1],\n",
" ["Perl",1,1,1,0,1],\n",
" ["C#",0,0,0,0,0]])\n",
"#drawing stacked bar plot\n",
"df.set_index('Language').plot(kind='bar', stacked=True, figsize=(10, 10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Box Plot\n",
"A box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles.Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generating Sample Data Via Random (Numpy)\n",
"Numpy can be used to generate sample data with different distributions. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"sample1 = np.random.rand(50) * 100 #Generates 50 random data between [0,100)\n",
"print("Sampl1:\n")\n",
"print(sample1)\n",
"sample2 = np.ones(25) * 50 #Generates an array with size 25, all values are 50\n",
"print("\nSample2:\n")\n",
"print(sample2)\n",
"sample3 = np.random.rand(10) * 100 + 100 #Greater than 100\n",
"sample4 = np.random.rand(10) * -100 #Between [0,-100)\n",
"data = np.concatenate((sample1, sample2, sample3, sample4), 0) #Concatenates row by row\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.boxplot(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Color and shape of outliers\n",
"plt.boxplot(data, 0, 'gd')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Change the orientation (vertical, horizontal)\n",
"plt.boxplot(data, 0, 'rs', 0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Multiple box plots together\n",
"data = [data, data[:50],sample1]\n",
"plt.boxplot(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Your turn\n",
"
- \n",
- Load "flights.csv" data set from the datasets folder (Seaborn_Datasets). \n",
- Calculate mean, median, and standard diviation of "passengers" for the first 100 rows. \n",
- Show average number of passengers per month (bar plot) for the whole data set. \n",
- Explore outliers of "passengers" (box plot) in the whole data set, is there any outlier? \n",
"
"
"
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Your answer\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distributions\n",
"The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur.\n",
"### Plotting Univariate Distributions\n",
"In univariate, distribution of just one variable is explored."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"from scipy import stats\n",
" \n",
"sns.set(color_codes=True)\n",
" \n",
"x = np.random.normal(10,1,size=100)\n",
"sns.distplot(x); #default distribution with histogram and kernel density"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Without kernel density, with rug plot (small vertical lines show the observations in each bin)\n",
"sns.distplot(x, kde=False, rug=True);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Identifying number of bins \n",
"sns.distplot(x, bins=20, kde=False, rug=True); "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Without histogram, with rug plot\n",
"sns.distplot(x, hist=False, rug=True);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Aditional features for rug, kde, and hist\n",
"sns.distplot(x, rug=True, \n",
" rug_kws={"color": "r"}, \n",
" kde_kws={"color": "k", "lw": 3, "label": "KDE"}, \n",
" hist_kws={"histtype": "step", "linewidth": 3, "alpha": 1, "color": "g"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting Bivariate Distributions\n",
"In this section, we will explore distribution involving two variables."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"mean = [10, 20] \n",
"cov = [(1, .5), (.5, 1)]\n",
"#Generate 200 random normal data based predefined mean and covariane\n",
"data = np.random.multivariate_normal(mean, cov, 100)\n",
"\n",
"\n",
"#convert Numpy to Dataframe with specific names for columns \n",
"df = pd.DataFrame(data, columns=["x", "y"])\n",
"#print(df.corr())\n",
"\n",
"sns.jointplot(x="x", y="y", data=df, kind="kde"); #kind= scatter, hex, reg,kde \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Changing type to Scatter\n",
"scatter = sns.jointplot(x="x", y="y", data=df, kind="scatter");"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Changing type to Hexagons\n",
"sns.jointplot(x="x", y="y", data=df, kind="hex"); "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Changing type to Regression\n",
"sns.jointplot(x="x", y="y", data=df, kind="reg"); "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pair Plot\n",
"By pair plot, we will create a grid of Axes such that each variable in data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal axis shows the univariate distribution of the data for the corresponding variable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"def read_dataset(dataset):\n",
" folder = "Seaborn_Datasets/"\n",
" data = pd.read_csv(folder + dataset)\n",
" return data\n",
"\n",
"iris = read_dataset("iris.csv")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#By default all numeric variables are used\n",
"sns.pairplot(iris); "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Specify specific variables\n",
"sns.pairplot(iris, vars = ['petal_length','sepal_length']);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Adding Color\n",
"sns.pairplot(iris, hue = 'species');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Adding markers\n",
"sns.pairplot(iris, hue = 'species', markers=["o", "s", "D"]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Your Turn :)\n",
"
- \n",
- Load "mpg" data set by seaborn. \n",
- Show distribution of "horsepower" and "acceleration" together (by a joint plot). Interpret the correlation between "horsepower" and "acceleration". \n",
- Compare the correlation between "horsepower", "weight", and "acceleration" for the cars produced by different continents ("origin"). \n",
"
"
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Your answer\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Decision Tree\n",
"In this part, we will use the "p_decision_tree" library to make a decision tree based on categorical descriptive attributes and the "scikit-learn" library to make a decision tree based on numerical descriptive attributes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Decision Tree (Categorical Descriptive Attributes)\n",
"We use the "p_decision_tree" library to make a decision tree based on the categorical descriptive attributes (make sure that you have installed "p_decision_tree" library). This library is not able to make decision tree based on the numerical descriptive attributes, and you have to convert the numerical descriptive attributes to the categorical attributes. \n",
"\n",
"Note that in order to see a visual tree, you need to install graphviz package. Here you can find the right package with respect to your operation system. \n",
"### Features\n",
"The main algorithm used by the library is ID3 with the following features:\n",
"\n",
"* Information gain based on entropy\n",
"* Information gain based on gini\n",
"* Some pruning capabilities like:\n",
"\t* Minimum number of samples\n",
"\t* Minimum information gain\n",
"* The resulting tree is not binary\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading Dataset\n",
"As aforementioned, you can simply load “csv” or “excel” data by the corresponding methods (“read_csv”, “read_excel” respectively) of Pandas. Make sure that you have uploaded the "DT_Datasets" folder on the home page of your Jupyter notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from p_decision_tree.DecisionTree import DecisionTree\n",
"import pandas as pd\n",
"\n",
"def read_dataset(dataset):\n",
" folder = "DT_Datasets/"\n",
" data = pd.read_csv(folder + dataset)\n",
" return data\n",
"\n",
"data = read_dataset('playtennis.csv')\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Identifying Descriptive and Target Attributes (Features)\n",
"As you know based on the concepts of decision tree, descriptive features and target feature should be specified. Descriptive features are used to make a decision to predict the target feature."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"columns = data.columns\n",
"\n",
"#All columns except the last one are descriptive by default\n",
"descriptive_features = columns[:-1]\n",
"#The last column is considered as label\n",
"label = columns[-1]\n",
"\n",
"#Converting all the columns to string\n",
"for column in columns:\n",
" data[column]= data[column].astype(str)\n",
"\n",
"data_descriptive = data[descriptive_features].values\n",
"data_label = data[label].values\n",
"\n",
"print("descriptive features:")\n",
"print(descriptive_features)\n",
"print("\ntarget feature:\n" + label)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Making the Tree\n",
"The "id3" method is ued to make the decision tree. One can pass the minimum gain and also the minimum samples to this function to prune the tree."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Calling DecisionTree constructor (the last parameter is criterion which can also be "gini")\n",
"decisionTree = DecisionTree(data_descriptive.tolist(), descriptive_features.tolist(), data_label.tolist(), "entropy")\n",
"\n",
"#Here you can pass pruning features (gain_threshold and minimum_samples)\n",
"decisionTree.id3(0,0)\n",
"\n",
"#Visualizing decision tree by Graphviz\n",
"dot = decisionTree.print_visualTree( render=True )\n",
"\n",
"#print(dot)\n",
"\n",
"print("System entropy: ", format(decisionTree.entropy))\n",
"print("System gini: ", format(decisionTree.gini))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Decision Tree (Numerical Descriptive Attributes)\n",
"The "scikit-learn" library is used to make a decision tree based on numerical descriptive attributes. Note that "scikit-learn" as the main library for data science in Python is not able to make a decision tree based on categorical descriptive attributes, and you have to convert the categorical attributes to numerical before passing them to the classifier method. Also, the resulting decision tree by this library is a binary tree.\n",
"In the following, you can find a sample code in order to make a decision tree based on numerical descriptive attributes, using "scikit-learn" library.\n",
"\n",
"“DecisionTreeClassifier” method of “sklearn” is used to generate the tree classifier. You can set the parameters of this method based on what you need. In the following you can find some of the most important parameters of this method:\n",
"- Main parameters to specify the algorithm\n",
" - Criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain. (Default = "gini")\n",
" - Splitter: The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split. (Default = "best")\n",
"- Parameters to control growth of the tree (Pruning)\n",
" - Min_samples_split: The minimum number of samples required to split an internal node\n",
" - Min_samples_leaf: The minimum number of samples required to be at a leaf node. (Default = 1)\n",
" - Max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than “min_samples_split” samples. (Default = None)\n",
" - Max_leaf_nodes: Grow a tree with “max_leaf_nodes” in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. (Default = None)\n",
" - Min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value. (Default = 0.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn import tree\n",
"from subprocess import check_output\n",
"\n",
"#loading dataset\n",
"def read_dataset(dataset):\n",
" folder = "DT_Datasets/"\n",
" data = pd.read_csv(folder + dataset)\n",
" return data\n",
"\n",
"data = read_dataset('ManWoman.csv')\n",
"\n",
"#descriptive features\n",
"X = data[['height','weight']] \n",
"#target feature\n",
"Y = data[["Class"]]\n",
"\n",
"\n",
"job_classifier = tree.DecisionTreeClassifier(criterion="entropy") \n",
"job_classifier.fit(X, Y)\n",
"\n",
"\n",
"column_names = list(data.columns.values)\n",
"del column_names[-1]\n",
"dot_file = "Classification.dot"\n",
"pdf_file = "Classification.pdf"\n",
"with open(dot_file, "w") as f:\n",
" f = tree.export_graphviz(job_classifier, out_file=f, \n",
" feature_names= column_names, \n",
" class_names=["Man", "Woman"], \n",
" filled=True, rounded=True)\n",
"try:\n",
" check_output("dot -Tpdf "+ dot_file + " -o " + pdf_file , shell=True)\n",
" print("Find Classification.dot (description) and Classification.pdf (visual tree) in the home page of your Jupyter.")\n",
"except:\n",
" print("Make sure that you have installed Graphviz, otherwise you can not see the visual tree. But you can find descriptions in a dot file")\n"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"©PADS-RWTH (use only with permission & acknowledgements)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}