
Eloisa Elias - Data Scientist • Six Sigma • Open Source AI • Women Techmakers Ambassador

Primary LanguageJupyter Notebook

Advanced statistics and optimization. Six Sigma certified, 8+ years of experience improving KPIs at Fortune 500 companies @eloeliasds

I’m passionate about using data science, programming, and statistical analysis to solve Big Data challenges and deliver valuable business insights.

I've created this repository as a personal reference, for mentoring PyLadies and PyData members.

Summary: Utilizing ML tools
Goal: MLlib

Dataset: http://qwone.com/~jason/20Newsgroups/

The fun part: PySpark df rocks!


Summary: Spark basics with partitions (clusters)
Goal: RDDs and operations

Dataset: s3n://mortar-example-data/airline-data

The fun part: Updating Spark


Summary: Spark basics
Goal: RDDs and operations

Dataset: Sklearn.datasets

The fun part: Lambdas :) hohohohooo


Summary: Utilizing boto to connect and edit S3 buckets 
Goal: AWS storage 

Dataset: Cancer rates.csv

The fun part: boto


Summary: the singular value decomposition (SVD) is a factorization of a real or complex matrix.
Goal: Apply SVD 

Dataset: The International Standard Book Number (ISBN) & voteview.com

The fun part: Visualizing High Dimensional Data


Summary: Principal Component Analysis is a factorization of a real or complex matrix.
Goal: Apply PCA

Dataset: In process

The fun part: In process


Summary: k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Goal: Develop algorithm for a news paper data set

Dataset: News_paper

The fun part: Cluster of related topics


Summary: k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Goal: Visualizing

Dataset: Sklearn.dataset - Iris

The fun part: Visualization of the Sklearn vs K-mean_elo.


Summary: Mongodb and Requests
Goal: Create a NOSQL Mongodb data base with our own scraped information	
Dataset: Wikipedia 

The fun part: Mongodb


Summary: Profit Curves allow us to compare models and select the one that will maximize profit for a specified cost-benefit
Goal: Obtain the maximum profit transaccion per classifier.

Dataset: churn.csv

The fun part:Calculation of the higher profit per transaction.


Summary: Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. 
Goal: Usage of Sklearn Boost algorithms and SearchGrid.

Dataset: spam.csv

The fun part: Grid search


Summary: Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. 
Goal: Usage of Sklearn Boost algorithms and SearchGrid.

Dataset: sklearn.dataset

The fun part: Grid search


Summary: Support Vector Machine 
Goal: Use of hyperparameter C and grid search optimization 

Tradeoffs: SVMs have a tradeoff between maximizing the margin and minimizing the classification error.
	- Advantages: Best parameters set found by using tuning hyper-parameters GridSearchCV() 
	- Downside: Takes a long time for multiple folds
	- Solution: Training with a sample data, or reduce the folds qty.

Dataset: sklearn load_digits() & other cvs files.

The fun part: Grid search


Summary: Support Vector Machine - Linear SVM
Goal: Logistic Regression boundary and SVM boundary comparison

	- Advantages: a)Linear and non-linear classification, b)Usage of soft margins, c)Kernel transformation, d)regularisation parameter, to avoiding over-fitting, e)SVM is defined by a convex optimisation problem
	- Downside: See, a)Journal of Machine Learning Research 11 (2010) 2079-2107 On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
	- Solution: Over-fitting in model selection can be over come using methods that have already been effective in preventing over-fitting during trainin g, such as regularisation as  Kernel Ridge Regression (KRR) classifier

Dataset: cvs files.

The fun part: SVM boundaries


Summary: Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. 

	- Advantages: Cross validation is not strictly necessary. 
	- Downside:Confidence scores (threshold) used to build ROC curves may be difficult to assign.
	- Solution: Alternatives to ROC graphs: DET curves, Cost curves.
Dataset: Churn.csv
The fun part: The confusion matrix and the receiver operating characteristic, amd feature importance.


Summary: (RF) is a non-parametric, non linear supervised learning method used for classification (Nominal/Discrete data) and regression (Continuous data). 
Goal: Step by step manual RF

	- Advantages: Cross validation is not strictly necessary
	- Downside:RF  is a predictive modelling tool, slow to create predictions once trained, More accurate ensembles require more trees.
	- Solution:RF is a highly parallel algorithm , so if you have multiple cores, you can get a significant speedup.  

Dataset: https://archive.ics.uci.edu/ml/datasets
Related Programs: RF
The fun part: The RF class and nodes.


Summary: Decision Trees (DTs) are a non-parametric, non linear supervised learning method used for classification (Nominal/Discrete data) and regression (Continuous data). 
Goal: Step by step manual DT

	- Advantages: No complex data preparation, discrete and continuous data usage, good performance in large datasets 
	- Downside: Overfitting, computationally expensive to train.
	- Solution: Prepruning, Pruning, Random Forests

Dataset: playgolf.csv
Related Programs: DecisionTree_elo.py, DecisionTree_run.py, TreeNode_elo.py

The fun part: The concept of Entropy in terms of information theory.


Summary: Nearest neighbor search (NNS), also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest (or most similar) points

Goal: Step by step manual KNN

	- Advantages: a)Simple implementation, 
	- Downside: a)Determine the value of parameter K, b)Computationally intensive, c)It doesn't handle categorical variables very well, d)A shortcoming of the k-NN algorithm is that it is sensitive to the local structure of the data.  		- Solution: Two classical algorithms can be used to speed up the NN search 1)Bucketing(a.k.a Elias’s algorithm) [Welch 1971], 2)k-d trees [Bentley, 1975; Friedman et al, 1977]
Dataset: from sklearn.datasets import make_classification
Related Programs: Knn.py

The fun part:Data needs no preparation for the the algorithm


Summary: Machine learning - Optimization algorithm
Goal: The goal of gradient descent is to minimize a function (the cost function of the hypothesis or the square errors of the hypothesis). For this case is Logistic regression function. Obtain the parameters that minimize my function. h(θ) --> j(θ) --> min_θ j(θ).

	- Advantages: The use of vectorization.
	- Downside: Overfitting
	- Solution: Feature scaling, manual selection of features, Ridge-Lasso regularization.

Related optimization algorithms: Conjugated gradient, BFGS, L-BFGS.
Dataset: from sklearn.datasets import make_classification
Related Programs: Gradient.py

The fun part: The math and the gradient class function


Summary: Classifier algorithm, ROC, Kfold and AUC
Goal: Obtain ROC curve

	- Advantages: a)Logistic regression will work better if there's a single decision boundar, b)Logistic regression is intrinsically simple. c)Important to consider regularization
	- Downside: a)The explanatory variables should not be highly correlated with one another because this could cause problems with estimation.response variable.
	- Solution: Correct for multicolinearity among features.
Dataset: from sklearn.datasets import make_classification

The fun part: the ROC curve


Summary: Ridge and Lasso   	
Goal: Addressing overfitting

	- Advantages: Works well when we have a lot of features each of which contributes a bit to predicting y. Keep all features, but reduce magnitude/values of parameters θj. 
	- Downside: LASSO - a) For n<<p case (high dimensional case), LASSO can at most select n features. b) For usual case where we have correlated features which is usually the case for real word datasets, LASSO will select only one feature from a group of correlated features. c) For n>>p case, it is seen that for correlated features , Ridge (Tikhonov Regularization) regression has better prediction power than LASSO. RIDGE a') Compared to ordinary least squares, ridge regression is not unbiased. It accepts little bias to reduce variance and the mean square error, and helps to improve the prediction accuracy. Thus, ridge estimator yields more stable solutions by shrinking coefficients but suffers from the lack of sensitivity to the data. LASSO  & RIDGE a)  LASSO regularization can occasionally produce non-unique solutions. A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line. This can be problematic for certain applications, and is overcome by combining LASSO and RIDGE regularization in elastic net regularization
	- Solution: Model selection algorithm

Dataset: sklearn.datasets - load_diabetes()
The fun part: Visualizing the best alpha for the model.


Summary: Comparing models - sklear dataset
Goal: The goal is to evaluate the model given  metric I'm interested in.

	- Advantages: The error value will plateau out after a certain m, or training set size.
	- Downside: If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.
	- Solution:a)For high variance, we have the following relationships in terms of the training set size: With high variance a1)Low training set size:  Jtrain(Θ) will be low and JCV(Θ) will be high., a2)Large training set size: Jtrain(Θ) increases with training set size and JCV(Θ) continues to decrease without leveling off. Also, Jtrain(Θ)<JCV(Θ) but the difference between them remains significant.b)If a learning algorithm is suffering from high variance, getting more training data is likely to help.
Dataset: from sklearn.datasets import load_boston

The fun part: The training test size estimator


Summary: Credit card analysis - multivariate regression
Goal:The task is to predict an individual's balance based on various variables and feature engineering - Using adjusted R^2 or F-tests and VIF to detect multicollinearity.

Dataset: csv

The fun part: Feature engineering

###13_Multivariate_linear_regression Summary: Linear regression Goal: Develop multivariate linear regression

	- Advantages: Linear regression implements a statistical model that, when relationships between the independent variables and the dependent variable are almost linear, shows optimal results. 
	- Downside: If you are using regression analysis to generate predictions. Prediction intervals are calculated based on the assumption that the residuals are normally distributed. If the residuals are non-normal, the prediction intervals may be inaccurate.
	- Solution: Normalizing the dataset, Independence of the observations, Avoiding multicollinearity among features by Compare the heteroscedasticity of residuals before and after taking log.

Dataset: csv
The fun part: Using plotly for graphics


Summary: Business analysis - Bike rental 
Goal: Develop exploratory data analysis and apply linear regression algoritm in order to recomend the specific date with maximum spread of promotional campaign for a business (rental). The goal is to find the coefficients β which fit the equations "best," in the sense of solving the quadratic minimization problem

	- Advantages: The numerical methods for linear least squares are important because linear regression models are among the most important types of model, both as formal statistical models and for exploration of data-sets. The majority of statistical computer packages contain facilities for regression analysis that make use of linear least squares computations. Hence it is appropriate that considerable effort has been devoted to the task of ensuring that these computations are undertaken efficiently and with due regard to round-off error.
	- Downside:In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate
	- Solution:Various regularization techniques can be applied e.g. LASSO|RIDGE

Dataset: cvs
The fun part: Obtaining the normal behavior of the rental business and the use of the basemap


Summary: Markov chain and linear algebra
Goal: Implementing the PageRank algorithm

	- Advantages: No complex ranking algorithm

Dataset: Sklearn dataset - Iris.csv

The fun part: page ranking algorithm using basic linear algebra, one of the first google pagerank algorithm


Summary: Analizing a university dataset, obtaining potential threshold for admission
Goal: Covariance and other statistics functions

Dataset: Admisiones.csv
The fun part: visualizing the pdf for admission vs income


Summary: Probability excercises
Goal: Basic probability using python

The fun part: from basic probabilty obtaining interesting inferences


Summary: Visualizing Bayes step by step

Related Programs: Bayes_elo.py
The fun part: The transformation of the probabilities into a distribution


Summary: Hypothesis test and power calculation
Goal: Power in Python
The fun part: From the frequentist point of view Power is everything. 


Summary: Selection of a slot machine strategy 
Goal: Visializing the trials proportions

Dataset: Discrete
Related Programs: Multiarm_elo
The fun part: Visualize the the house always win


Summary: Comparing websites
Goal: AB test using Bayes

Dataset: AB test
The fun part: Visualizing Bayes


Summary:Data distribution
Goal: Use of diferent distributions

Dataset: rainfall.csv
The fun part: the Gamma vs Normal distributions


Summary: Z-test 
Goal: Test statistic of A/B test

Dataset: csv
The fun part: the z-test function


Summary: Click through rate methodology and t-test function

	- Advantages: The purpose of click-through rates is to measure the ratio of clicks to impressions of an online ad or email marketing campaign. 

Dataset: NYT dataset 

The fun part: T-tests


Summary: Encapsulation programs
Goal: Understanding classes

	- Advantages: Object Oriented programming
The fun part: Learning another way to program


Summary: Exploratory data analysis (EDA) & logistic regression model
Goal: Visualizing the data

	- Advantages: Visualize the shape of our data
	- Downside: Takes some valuable time
	- Solution: It's worth to visualize our datase before start doing stats

Dataset: https://archive.ics.uci.edu/ml/machine-learning-databases & Iris.csv


Summary: DDL & SML
The fun part: DB


Summary: Dimensionality Reduction and ML 
The fun part: Winning the 'Honorable Mention' for the best algorithm presented at the event