/DataScienceMaterial

Data Science with Python

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

DataScienceMaterial

Data Science with Python

Udhay's GitHub activity graph

Data Generation

https://drawdata.xyz/
drawdata module

DataSets

https://files.zillowstatic.com/research/public_csvs/zhvi/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv
https://github.com/CodeSolid/CodeSolid.github.io/raw/main/booksource/data/AnalyticsSnapshot.xlsx

Python for Data Science

  1. Python Fundamentals & Jupyter notebook

  2. Data Science Libraries

    • Data Mining
      • Scrapy
      • BeautifulSoup
    • Data Processing & Modelling
      • Numpy
      • SciPy
      • Pandas
      • Keras
      • Scikit-learn
      • PyTorch
      • TensorFlow
      • XgBoost
      • NLTK
      • Gensim
    • Data Visualization
      • Matplotlib
      • Seaborn
      • Bokeh
      • Plotly
      • pydot
  3. Machine Learning

    a. Supervised Learning

    1. Classification is used to predict the outcome of a given sample when the output variable is in the form of categories. A classification model might look at the input data and try to predict labels like “sick” or “healthy.”

    2. Regression is used to predict the outcome of a given sample when the output variable is in the form of real values. For example, a regression model might process input data to predict the amount of rainfall, the height of a person, etc. Ex: Linear Regression, Logistic Regression, CART, Naïve-Bayes, and K-Nearest Neighbors (KNN) — are examples of supervised learning.

    3. Ensembling is another type of supervised learning. It means combining the predictions of multiple machine learning models that are individually weak to produce a more accurate prediction on a new sample. Algorithms 9 and 10 of this article — Bagging with Random Forests, Boosting with XGBoost — are examples of ensemble techniques.

    b. Unsupervised Learning

    1. Association is used to discover the probability of the co-occurrence of items in a collection. It is extensively used in market-basket analysis. For example, an association model might be used to discover that if a customer purchases bread, s/he is 80% likely to also purchase eggs.

    2. Clustering is used to group samples such that objects within the same cluster are more similar to each other than to the objects from another cluster.

    3. Dimensionality Reduction is used to reduce the number of variables of a data set while ensuring that important information is still conveyed. Dimensionality Reduction can be done using Feature Extraction methods and Feature Selection methods. Feature Selection selects a subset of the original variables. Feature Extraction performs data transformation from a high-dimensional space to a low-dimensional space. Example: PCA algorithm is a Feature Extraction approach.

    c. Reinforcement Learning

  4. Mini-Projects

    a. Data Cleaning Project

    b. Data Visualization Project

    c. Machine Learning Project

Data processing Pipeline

csv -> load pandas df --> data cleansing --> tranformations -> golden record in db
-> data visualation
	- Python         --> Matplotlib/Seaborn/Plotly
	- javascript     --> flask/Django/fastApi --->   d3.js/highcharts

Kaggle

clasfication preprpocessing

Learning Data Science

ml Version Control System https://dvc.org/