netcom_dataScience_dataAnalytics
Day 1
- Lab And Intro to Python
- Intermediate Python
- Intro to Machine Learning: Team Data Science Life Cycle
- Question: Pertaining to model leakage:
In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.
- R Code Example
> x <- c(1, 5, 4, 9, 0)
> typeof(x)
[1] "double"
> length(x)
[1] 5
> x <- c(1, 5.4, TRUE, "hello")
> x
[1] "1" "5.4" "TRUE" "hello"
> typeof(x)
[1] "character"
- Reversing a Python String
# reversing a string, I forgot to use the `''.join.reverse()` or negative indexing with `[::-1]`
'a string'[::-1]
''.join(reversed('a string'))
Day 2
References for Questions Asked:
- Pandas Apply Function
- PyCon on youtube: Hacking Nintendo Game
- Anatomy of Matplotlib Youtube SciPy 2018
- Python's Infamous Gil
- The Gilectomy
- The Gilectomy: How It's Going
- Thinking Outside the GIL with AsincIO
Pandas and Numpy Notebooks
Numpy
- Numpy Array Ops Docs
- Numpy Tips and Tricks
- 3Brown1Blue Youtube
- Cleaning Data in Pandas Daniel Chen PyData 2018
- Advanced Numpy
- Pandas vs Koalas
Seaborn notebook
Day 3
References Related to Questions Asked:
- Virtual Env Management - Stack Overflow Dependency Hell
- Virtual Env Management - Reddit Answers
- A quote from my friend who is a python instructor as well:
I use venv for everything I'm going to show off, but my poor global state is masive and conflicts all the time
- Crash Course in Applied Linear Algebra
- Time Series Focused PostgreSQL TimeScaleDB
- Blow the server up and still survive CockroachDB
- Distributed Databases on a Raspberry Pi Stalking a City for Frivolity
- Combine Data Lineage with end 2 end Piplines Pachyderm
- Container Metrics Prometheus
- Microservices Consideration Orchestrating Chaos - QCon
- AWS SAM Templates for Microserivices Boot Strap Testing SAM tempate github
- CookieCutter Templating for Project Distributing Cookie Cutter github
- Consuming Models in PowerBi Azure ML -> PowerBI
- Create R based visuals in PowerBi
Intro to ML in Python - Categorical / Numerical
- How To Lie With Statistics Darry Huff:
@google: how to lie with statistics
- Winning With SImple, even Linear Models
- Statistics Done Wrong - Alex Reinhart
- Stats for Hackers: Vanderplas Youtube
- Statistical Thinking For Data Scientists
- All About That Bayes
- Everything Wrong with Statistics and How To Fix It
SKLearn Con Notebooks:
[TODO] The figure
modules is out of date and needs to be updated;
[patch 1] find ./ -type f -exec sed -i -e 's/from-missing-library/import fix/g' {} \;
-
Recap of the SKLearn API and which Estimator has which output
- Some models do have the
model.transform
method attached to them once they are fit;sklearn.preprocessing
has this enabling you to create processing pipelines and clean data at scale; Other models will also have.transform
showing more findings by the model that may help for visualization, etc. - Other models will give you denisty estimation, which is a measure of how close the data follows the structure models. i.e. I see a new computer, it's missing installed security services, out of date apps -> evil
- Some models do have the
-
stratified split explainer -
Includes Logistic Regressions, a breakdown of which can be seen in 3- Logistic Regression and Naivve Bayes
-
There is a bug in the data for these notesbooks. I'm either going to write a DF generator to handle making the data sets with NP so the missing data is no longer an issue, or write every example using
sklearn.datasets.make_*
to generate the sets with matches columns. -
Where does the weight value come from in OLS formula
- The weight we are referring to here is the coefficient to multiply the data by
-
Unsupervised Leaning part one: PCA Dimension Reduction
- How do I find the
n_components
for my data? The value can take many python data types, but passing in no arguments uses the baked in logic from sklearn to figure out what's optimal. Changing the values used inPCA()
and checking thepca.explained_variance_ratio
will let you gather some insight into what the PCA model believes in captured by Component 1 and Component 2.
- How do I find the
-
**Cluster: Why is the computed accuray 0.0 and not 1.0, how to fix it
-
What this questions is trying to get across is that, the accuracy isn't the score to use for cluserting, none of the labels are corresponding in the example. Checking the confusion matrix shows all the labels of the classes are classified by the slice
[class0,class1,class2]
which matches the data, however, cluster membership doesn't match the labels. Colorblind to the label, but can see the hue. -
How to fix this? Numerous ways can be checked for distances measures between clusters, rotating clusters can help, but some other model would have to come before this one. Instead of focusing on the labels, focusing on the data points and pairs of data points that are preseved post processing. The
adjusted_rand_score
in the example follows this similar methodology, and the score goes to1.0
-
What does the '5' mean in the
CountVectorizer.vocabulary
- This is a dictionary structure as one of the students was pointing out trying to help me, the 5 is a positional argument to access the word in the dictionary. The
.vocabulary_
value represents the unique list of words found that is assigned a unique token ( 5 here ); Valuable information here islen(.vocabulary_)
. It should be the link between the data strucutre and the tokenized form. Each word is assigned an arbitrary dimension ( 5 ) - In our example, I was wrong, and this is the assigned position, not the count.
- tokenized -> the white space and punctuation has been stripped ).
- Why use a sparse matrix? A lot of the vocabulary results will have 0's as results given a large dictionary
- This is a dictionary structure as one of the students was pointing out trying to help me, the 5 is a positional argument to access the word in the dictionary. The
-
-
Data Processing and Regressions - Titanic Case Study
- Encoding Data Categorically with
pd.get_dummies(data,columns=['list','to','mask']
sklearn.impute
is used here to scale the data for passing into the Random ForestDummyClassifier
will count the number of times it sees [0,1] and predicts the majority class, just looking atY
; baccarat or Constant Classifier.- Development Branch of SKLearn has a Column Transformer that handles multiple data type datasets, cutting out some of
pandas
lift here.
- Encoding Data Categorically with
-
Logistic Regression and Naieve Bayes
- Here is where I ran once, the Text example failed, showing off a Bayesian Type model, and the
20_newsgrounps
dataset, to predict the labels based on new text.
- Here is where I ran once, the Text example failed, showing off a Bayesian Type model, and the
-
KNN
-
Cross Validation -> Splitting in multiple ways, with different % of the data to see how that affects training and test results.
Day 4
- Question Recap
- 14-model-complexity and grid search
- 15-Pipelining Estimators
- 16-Performance Metrics and Model Eval
* 17-In Depth Linear* 18-In Depth Tree and Forests* 19-Feature Selection* 20-Heriarchical and Density Clustering
* Ensemble Models
- NLTK ( 2-1 -> maybe 5-1 )
- Association Rules
Day 5
-
ANN / Perceptron Build
-
Hadoop / Spark-PySpark
-
Demo
-
References and Resources Mentioned:
- python main website
- Keynote: Guido Van Rossum
- Keynote: Perry Greenfield How Python Found its way into Astronomy
- violent python in python 3
- Intersting Google Dork
(@google:(github:violent python) & (filetype:pdf))
- Intersting Google Dork
- ReGex By Al Sweigert
- Engineer Man on Youtube's Python Series
- Uncle Bob Martin: The Future Of Programm Youtube
- Probablistic Programming and Bayesian Modeling with PyMC3
- Ten Ways To Fizz Buzz Joel Grus
- RaspberryPi Python Games