netcom_dataScience_dataAnalytics

Day 1

Lab And Intro to Python
- data files
- loan
- vm
Intermediate Python
Intro to Machine Learning: Team Data Science Life Cycle
Question: Pertaining to model leakage:

In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.

R Code Example

> x <- c(1, 5, 4, 9, 0)
> typeof(x)
[1] "double"
> length(x)
[1] 5
> x <- c(1, 5.4, TRUE, "hello")
> x
[1] "1"     "5.4"   "TRUE"  "hello"
> typeof(x)
[1] "character"

Reversing a Python String

# reversing a string, I forgot to use the `''.join.reverse()` or negative indexing with `[::-1]`
'a string'[::-1]
''.join(reversed('a string'))

Lambda Loop Pitfalls

Day 2

References for Questions Asked:

Pandas and Numpy Notebooks

Numpy

Seaborn notebook

State of Tools: Scipy 2015 Keynote
- His Book

Day 3

References Related to Questions Asked:

Virtual Env Management - Stack Overflow Dependency Hell
Virtual Env Management - Reddit Answers
- A quote from my friend who is a python instructor as well:
I use venv for everything I'm going to show off, but my poor global state is masive and conflicts all the time
Crash Course in Applied Linear Algebra
Time Series Focused PostgreSQL TimeScaleDB
Blow the server up and still survive CockroachDB
Distributed Databases on a Raspberry Pi Stalking a City for Frivolity
Combine Data Lineage with end 2 end Piplines Pachyderm
Container Metrics Prometheus
Microservices Consideration Orchestrating Chaos - QCon
AWS SAM Templates for Microserivices Boot Strap Testing SAM tempate github
CookieCutter Templating for Project Distributing Cookie Cutter github
Consuming Models in PowerBi Azure ML -> PowerBI
Create R based visuals in PowerBi

Intro to ML in Python - Categorical / Numerical

How To Lie With Statistics Darry Huff: @google: how to lie with statistics
Winning With SImple, even Linear Models
Statistics Done Wrong - Alex Reinhart
Stats for Hackers: Vanderplas Youtube
Statistical Thinking For Data Scientists
All About That Bayes
Everything Wrong with Statistics and How To Fix It

SKLearn Con Notebooks:

[TODO] The figure modules is out of date and needs to be updated; [patch 1] find ./ -type f -exec sed -i -e 's/from-missing-library/import fix/g' {} \;

Recap of the SKLearn API and which Estimator has which output
- Some models do have the model.transform method attached to them once they are fit; sklearn.preprocessing has this enabling you to create processing pipelines and clean data at scale; Other models will also have .transform showing more findings by the model that may help for visualization, etc.
- Other models will give you denisty estimation, which is a measure of how close the data follows the structure models. i.e. I see a new computer, it's missing installed security services, out of date apps -> evil
~~stratified split explainer~~
Supervised Learning 1 - Classification
Includes Logistic Regressions, a breakdown of which can be seen in 3- Logistic Regression and Naivve Bayes
There is a bug in the data for these notesbooks. I'm either going to write a DF generator to handle making the data sets with NP so the missing data is no longer an issue, or write every example using sklearn.datasets.make_* to generate the sets with matches columns.
Where does the weight value come from in OLS formula
- The weight we are referring to here is the coefficient to multiply the data by
Unsupervised Leaning part one: PCA Dimension Reduction
- How do I find the n_components for my data? The value can take many python data types, but passing in no arguments uses the baked in logic from sklearn to figure out what's optimal. Changing the values used in PCA() and checking the pca.explained_variance_ratio will let you gather some insight into what the PCA model believes in captured by Component 1 and Component 2.
**Cluster: Why is the computed accuray 0.0 and not 1.0, how to fix it
- What this questions is trying to get across is that, the accuracy isn't the score to use for cluserting, none of the labels are corresponding in the example. Checking the confusion matrix shows all the labels of the classes are classified by the slice [class0,class1,class2] which matches the data, however, cluster membership doesn't match the labels. Colorblind to the label, but can see the hue.
- How to fix this? Numerous ways can be checked for distances measures between clusters, rotating clusters can help, but some other model would have to come before this one. Instead of focusing on the labels, focusing on the data points and pairs of data points that are preseved post processing. The adjusted_rand_score in the example follows this similar methodology, and the score goes to 1.0
- Validation and Model Selection - pyData2015
- Advanced SciKit Learn
- SkLearn ROC Curve Visualiztion API
- ROC Curves Explained
- How to evaluate K-Modes Cluster
- What does the '5' mean in the CountVectorizer.vocabulary
  - This is a dictionary structure as one of the students was pointing out trying to help me, the 5 is a positional argument to access the word in the dictionary. The .vocabulary_ value represents the unique list of words found that is assigned a unique token ( 5 here ); Valuable information here is len(.vocabulary_). It should be the link between the data strucutre and the tokenized form. Each word is assigned an arbitrary dimension ( 5 )
  - In our example, I was wrong, and this is the assigned position, not the count.
  - tokenized -> the white space and punctuation has been stripped ).
  - Why use a sparse matrix? A lot of the vocabulary results will have 0's as results given a large dictionary
Data Processing and Regressions - Titanic Case Study
- Encoding Data Categorically with pd.get_dummies(data,columns=['list','to','mask']
- sklearn.impute is used here to scale the data for passing into the Random Forest
- DummyClassifier will count the number of times it sees [0,1] and predicts the majority class, just looking at Y; baccarat or Constant Classifier.
- Development Branch of SKLearn has a Column Transformer that handles multiple data type datasets, cutting out some of pandas lift here.
Logistic Regression and Naieve Bayes
- Here is where I ran once, the Text example failed, showing off a Bayesian Type model, and the 20_newsgrounps dataset, to predict the labels based on new text.
KNN
Cross Validation -> Splitting in multiple ways, with different % of the data to see how that affects training and test results.

Day 4

Question Recap
14-model-complexity and grid search
15-Pipelining Estimators
16-Performance Metrics and Model Eval * 17-In Depth Linear * 18-In Depth Tree and Forests * 19-Feature Selection * 20-Heriarchical and Density Clustering

* Ensemble Models

NLTK ( 2-1 -> maybe 5-1 )
Association Rules

Day 5

ANN / Perceptron Build
Hadoop / Spark-PySpark
Demo
References and Resources Mentioned:
- python main website
- Keynote: Guido Van Rossum
- Keynote: Perry Greenfield How Python Found its way into Astronomy
- violent python in python 3
  - Intersting Google Dork (@google:(github:violent python) & (filetype:pdf))
- ReGex By Al Sweigert
  - His Book
- Engineer Man on Youtube's Python Series
- Uncle Bob Martin: The Future Of Programm Youtube
- Probablistic Programming and Bayesian Modeling with PyMC3
- Ten Ways To Fizz Buzz Joel Grus
- RaspberryPi Python Games

t-0-m-1-3/netcom_dataScience_dataAnalytics