/netcom_dataScience_dataAnalytics

This is the private repository for the course taught at the USPTO

Primary LanguageJupyter Notebook

netcom_dataScience_dataAnalytics


Day 1


  • Lab And Intro to Python
  • Intermediate Python
  • Intro to Machine Learning: Team Data Science Life Cycle
  • Question: Pertaining to model leakage: towards data science example

In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.

  • R Code Example
> x <- c(1, 5, 4, 9, 0)
> typeof(x)
[1] "double"
> length(x)
[1] 5
> x <- c(1, 5.4, TRUE, "hello")
> x
[1] "1"     "5.4"   "TRUE"  "hello"
> typeof(x)
[1] "character"
  • Reversing a Python String
# reversing a string, I forgot to use the `''.join.reverse()` or negative indexing with `[::-1]`
'a string'[::-1]
''.join(reversed('a string'))

Day 2


References for Questions Asked:

Pandas and Numpy Notebooks

Numpy

Seaborn notebook

Day 3


References Related to Questions Asked:

Intro to ML in Python - Categorical / Numerical


SKLearn Con Notebooks:


[TODO] The figure modules is out of date and needs to be updated; [patch 1] find ./ -type f -exec sed -i -e 's/from-missing-library/import fix/g' {} \;

  • Recap of the SKLearn API and which Estimator has which output

    • Some models do have the model.transform method attached to them once they are fit; sklearn.preprocessing has this enabling you to create processing pipelines and clean data at scale; Other models will also have .transform showing more findings by the model that may help for visualization, etc.
    • Other models will give you denisty estimation, which is a measure of how close the data follows the structure models. i.e. I see a new computer, it's missing installed security services, out of date apps -> evil
  • stratified split explainer

  • Supervised Learning 1 - Classification

  • Includes Logistic Regressions, a breakdown of which can be seen in 3- Logistic Regression and Naivve Bayes

  • There is a bug in the data for these notesbooks. I'm either going to write a DF generator to handle making the data sets with NP so the missing data is no longer an issue, or write every example using sklearn.datasets.make_* to generate the sets with matches columns.

  • Where does the weight value come from in OLS formula

    • The weight we are referring to here is the coefficient to multiply the data by
  • Unsupervised Leaning part one: PCA Dimension Reduction

    • How do I find the n_components for my data? The value can take many python data types, but passing in no arguments uses the baked in logic from sklearn to figure out what's optimal. Changing the values used in PCA() and checking the pca.explained_variance_ratio will let you gather some insight into what the PCA model believes in captured by Component 1 and Component 2.
  • **Cluster: Why is the computed accuray 0.0 and not 1.0, how to fix it

    • What this questions is trying to get across is that, the accuracy isn't the score to use for cluserting, none of the labels are corresponding in the example. Checking the confusion matrix shows all the labels of the classes are classified by the slice [class0,class1,class2] which matches the data, however, cluster membership doesn't match the labels. Colorblind to the label, but can see the hue.

    • How to fix this? Numerous ways can be checked for distances measures between clusters, rotating clusters can help, but some other model would have to come before this one. Instead of focusing on the labels, focusing on the data points and pairs of data points that are preseved post processing. The adjusted_rand_score in the example follows this similar methodology, and the score goes to 1.0

    • Validation and Model Selection - pyData2015

    • Advanced SciKit Learn

    • SkLearn ROC Curve Visualiztion API

    • ROC Curves Explained

    • How to evaluate K-Modes Cluster

    • What does the '5' mean in the CountVectorizer.vocabulary

      • This is a dictionary structure as one of the students was pointing out trying to help me, the 5 is a positional argument to access the word in the dictionary. The .vocabulary_ value represents the unique list of words found that is assigned a unique token ( 5 here ); Valuable information here is len(.vocabulary_). It should be the link between the data strucutre and the tokenized form. Each word is assigned an arbitrary dimension ( 5 )
      • In our example, I was wrong, and this is the assigned position, not the count.
      • tokenized -> the white space and punctuation has been stripped ).
      • Why use a sparse matrix? A lot of the vocabulary results will have 0's as results given a large dictionary
  • Data Processing and Regressions - Titanic Case Study

    • Encoding Data Categorically with pd.get_dummies(data,columns=['list','to','mask']
    • sklearn.impute is used here to scale the data for passing into the Random Forest
    • DummyClassifier will count the number of times it sees [0,1] and predicts the majority class, just looking at Y; baccarat or Constant Classifier.
    • Development Branch of SKLearn has a Column Transformer that handles multiple data type datasets, cutting out some of pandas lift here.
  • Logistic Regression and Naieve Bayes

    • Here is where I ran once, the Text example failed, showing off a Bayesian Type model, and the 20_newsgrounps dataset, to predict the labels based on new text.
  • KNN

  • Cross Validation -> Splitting in multiple ways, with different % of the data to see how that affects training and test results.

Day 4


  • Question Recap
  • 14-model-complexity and grid search
  • 15-Pipelining Estimators
  • 16-Performance Metrics and Model Eval * 17-In Depth Linear * 18-In Depth Tree and Forests * 19-Feature Selection * 20-Heriarchical and Density Clustering

* Ensemble Models

  • NLTK ( 2-1 -> maybe 5-1 )
  • Association Rules

Day 5