Complete-Data-Science-Toolkits

The overall objective of this toolkit is to provide and offer a free collection of data analysis and machine learning that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run this collections either in Jupyter notebook or python alone.

Features

Machine Learning

Cross-Validation
Evaluating Classification Metrics
Evaluating Clustering Metrics
Evaluating Regression Metrics
Grid Search
Preprocessing Encoding Categorical Features
Preprocessing Binarization
Preprocessing Imputing Missing Values
Preprocessing Normalization
Preprocessing StandardScaler
Randomized Parameter Optimization

Numpy

Adding, Removing, and Splitting Arrays
Sorting arrays
Matrix object
Statistics Vector Math
Structured Arrays
Import, Export, Slicing, Indexing
Data to from string

Pandas

Complete pandas
Groupby in Pandas
Mapping
Filtering
Applying

Visualization

BarPlots
Customization Matplotlib
Working with Image
Working with text

Naming Conventions

The naming convections I followed is:
[yyyy-mm-dd-in-project-name-library].extention
yyyy = stands for year
mm = stands for month
dd = stands for day
in = my initial, for example: Saleban Olow = so
library = numpy, pandas, sklearn, matplotlib
project-name = each project name
extention = .ipynb, .py, .html
Example: 2017-25-11-so-cross-validation-sklearn.ipynb

Code Samples:

Cross Validation

from sklearn.model_selection import cross_val_score
model = SVC(kernel='linear', C=1)
# let's try it using cv
scores = cross_val_score(model, X, y, cv=5)

Grid Search

from sklearn.grid_search import GridSearchCV
params = {"n_neighbors": np.arange(1,5), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn, param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score)
print(grid.best_estimator_.n_neighbors)

Preprocessing Imputing Missing Values

from sklearn.preprocessing import Imputer
impute = Imputer(missing_values = 0, strategy='mean', axis=0)
impute.fit_transform(X_train)

Randomized Parameter Optimization

from sklearn.grid_search import RandomizedSearchCV
params = {"n_neighbors" : range(1,5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

Model fitting supervised and unsupervised learning

#supervised learning
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
#unsupervised learning
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
pca_model = pca.fit_transform(X_train)

Working with numpy arrays

import numpy as np 
#appends values to end of arr
np.append(arr, values)
#inserts values into arr before index 2
np.insert(arr, 2, values)

Indexing and Slicing arrays

import numpy as np 
#return the element at index 5
arr = np.array([[1,2,3,4,5,6,7]])
arr[5]
#returns the 2D array element on index 
arr[2,5]
#assign array element on index 1 the value 4
arr[1] = 4
#assign array element on index [1][3] the value 10
arr[1,3] = 10

Creating DataFrame

import pandas as pd 
#specify values for each rows and columns
df = pd.DataFrame(
	[[4,7,10],
	 [5,8,11],
	 [6,9,12]],
	 index=[1,2,3],
	 columns=['a','b','c'])

groupby pandas

import pandas as pd 
import pandas as pd 
#return a groupby object, grouped by values in column named 'cities'
df.groupby(by="Cities")

handling missing values

import pandas as pd 
#drop rows with any column having NA/null data.
df.dropna()
#replace all NA/null data with value
df.fillna(value)

Melt function

import pandas as pd 
#most pandas methods return a DataFrame so that
#this improves readability of code
df = (pd.melt(df)
	  .rename(columns={'old_name':'new_name', 'old_name':'new_name'})
	  .query('new_name >= 200')
)

Save plot

mport matplotlib.pyplot as plt 
#saves plot/figure to image
plt.savefig('pic_name.png')

Marker, lines

import matplotlib.pyplot as plt 
#add * for every data point
plt.plot(x,y, marker='*')
#adds dot for every data point
plt.plot(x,y, marker='.')

Figures, Axis

import matplotlib.pyplot as plt 
#a container that contains all plot elements
fig = plt.figures()
#Initializes subplot
fig.add_axes()
#A subplot is an axes on a grid system, rows-cols num
a = fig.add_subplot(222)
#adds subplot
fig, b = plt.subplots(nrows=3, ncols=2)
#creates subplot
ax = plt.subplots(2,2)

Working with text plot

import matplotlib.pyplot as plt 
#places text at coordinates 1/1
plt.text(1,1, 'Example text', style='italic')
#annotate the point with coordinates xy with text 
ax.annotate('some annotation', xy=(10,10))
#just put math formula
plt.title(r'$delta_i=20$',fontsize=10)

ssjith/Data-Analysis-Science