/DeepDataMiningLearning

Data mining, machine learning, and deep learning sample code

Primary LanguageJupyter Notebook

DeepDataMiningLearning

Data mining, machine learning, and deep learning sample codes for SJSU CMPE255 Data Mining (Fall2023 SJSU Official Syllabus) and CMPE258 Deep Learning (Fall2023 SJSU Official Syllabus).

  • Some google colab examples need SJSU google account to view)
  • Large language Models (LLMs) part is newly added
  • You can also view the documents in: readthedocs

Edits made

For inference, I created the following two files. I was focusing more on Torchscript, but I wanted to also try out Tensor RT.

DeepDataMiningLearning/detection/torchscript_model.py
DeepDataMiningLearning/detection/tensorrt_model.py

I made edits to the following module files for YOLOv8 to try to get Torchscript to work:

DeepDataMiningLearning/detection/modules/block.py
DeepDataMiningLearning/detection/modules/head.py
DeepDataMiningLearning/detection/modules/tal.py

I also made some minor bug fixes to the following files, mainly pertaining to file paths:

DeepDataMiningLearning/detection/dataset.py
DeepDataMiningLearning/detection/modules/yolomodels.py DeepDataMiningLearning/detection/utils.py

Setups

Install this python package (optional) via

% python3 -m pip install flit
% flit install --symlink

ref "docs/python.rst" for detailed python package description

Open the Jupyter notebook in local machine:

jupyter lab --ip 0.0.0.0 --no-browser --allow-root

Sphinx docs

Activate python virtual environment, you can use 'sphinx-build' command to build the document

   % pip install -r requirements.txt
   (mypy310) kaikailiu@kaikais-mbp DeepDataMiningLearning % sphinx-build docs ./docs/build
   #check the integrity of all internal and external links:
   (mypy310) kaikailiu@kaikais-mbp DeepDataMiningLearning % sphinx-build docs -W -b linkcheck -d docs/build/doctrees docs/build/html

The generated html files are in the folder of "build". You can also view the documents in: readthedocs

Python Data Analytics

Basic python tutorials, numpy, Pandas, data visualization and EDA

Python data apps based on streamlit: streamlittest

Cloud Data Analytics

  • Data Mining based on Google Cloud:
    • Google Cloud access via Colab: colablink
      • Configure Gcloud, Google Cloud Storage, Compute Engine, Colab Terminal
    • Google BigQuery with Colab/Jupyter introduction BigQuery-intro.ipynb -- colablink
      • Natality dataset and Weather data from Google BigQuery
    • COVID19 Data EDA and Visualization based on Google BigQuery (Fall 2022 updated): colablink
      • COVID NYT data, COVID-19 JHU data
    • Additional Google BigQuery examples: colablink
      • Chicago Crime Dataset, Austin Waste Dataset, COVID Racial Dataset (race graph)
    • BigQuery ML examples: colablink
      • COVID, CREDIT_CARD_FRAUD, Predict penguin weight, Natality, US Census Dataset Classification, time-series forecasting from Google Analytics data

Machine Learning Algorithm

  • Machine Learning introduction:
    • MLIntro-Regression -- colablink
    • MLIntro-RegressionSKLearn -- colablink
    • MLIntro2-classification.ipynb --colablink
      • Breast Cancer Dataset, iris Dataset, BigQuery US Census Income Dataset, multiple classifiers.
    • DecisionTree -- colablink
      • SKlearn DecisionTree algorithm on Iris dataset, Breast Cancel Dataset, Make moon dataset, and DecisionTreeRegressor. A berif discussion of Gini Impurity.
    • GradientBoosting -- colablink
      • Gradient boosting process, Gradient boosting regressor with scikit-learn, Gradient boosting classifier with scikit-learn
    • XGBoost -- colablink
      • XGBoost introduction, US Census Income Dataset from Big Query, UCI Dermatology dataset

Deep Learning

Deep learning notebooks (colab link is better)

New Deep Learning sample code based on Pytorch (under the folder of "DeepDataMiningLearning")

  • Pytorch Single GPU image classification with/without automatic mixed precision (AMP) training: singleGPU
  • Pytorch Multi-GPU DDP test: testTorchDDP
  • Pytorch Multi-GPU image classification: multiGPU
  • Pytorch Torchvision image classification (Efficientnet) notebook on HPC: torchvisionHPC.ipynb
  • Pytorch Torchvision vision transformer (ViT) notebook on HPC: torchvisionvitHPC.ipynb
  • Pytorch ViT implement from scratch on HPC: ViTHPC.ipynb
  • Pytorch ImageNet classification example: imagenet
  • Pytorch inference example for top-k class: inference.py
  • TIMM models: testtimm.ipynb
  • Huggingface Images via Transformers: huggingfaceimage.ipynb
  • Siamese network: siamese_network
  • TensorRT example: tensorrt.ipynb
  • Advanced Image Classification: githubrepo
    • General purpose framework for all-in-one image classification for Tensorflow and Pytorch
    • Support for multiple datasets: imagenet_blurred, tiny-imagenet-200, hymenoptera_data, CIFAR10, MNIST, flower_photos
    • Support for multiple custom models ('mlpmodel1', 'lenet', 'alexnet', 'resnetmodel1', 'customresnet', 'vggmodel1', 'vggcustom', 'cnnmodel1'), all models from Torchvision and TorchHub
    • Support HPC training and evaluation
  • Object detection (other repo)

Unsupervised Learning

  • Unsupervised Learning Jupyter notebooks
    • PCA: colablink
      • Numpy/SKlearn SVD, PCA for digits and noise filtering, eigenfaces, PCA vs LDA vs NCA
    • Manifold Learning: colablink
      • Multidimensional Scaling (MDS), Locally Linear Embedding (LLE), Isomap Embedding, T-distributed Stochastic Neighbor Embedding for HELLO, S-Curve, and Swiss roll dataset; Isomap on Faces; Regression with Mainfold Learning
    • Clustering: colablink
      • K-Means, Gaussian Mixture Models, Spectral Clustering, DBSCAN

NLP and Text Mining

  • Text Mining Jupyter notebooks
    • Text Representations: colablink
      • One-Hot encoding, Bag-of-Words, TF-IDF, and Word2Vec (based on gensim); Word2Vec WiKi and Shakespeare examples; Gather data from Google and WordCLoud
    • Texrtact and NLTK: colablink
      • Text Extraction via textract; NLTK text preprocessing
    • Text Mining via Tensorflow-text: colablink
      • Using Keras embedding layer; sentiment classification example; prepare positive and negative samples and create a Skip-gram Word2Vec model
    • Text Classification via Tensorflow: colablink
      • RNN, LSTM, Transformer, BERT
    • Twitter NLP all-in-one example: colablink
      • NTLK, LSTM, Bi-LSTM, GRU, BERT

Recommendation

  • Recommendation
    • Recommendation via Python Surprise and Neural Collaborative Filtering (Tensorflow): colablink
    • Tensorflow Recommender: colab

Large Language Models (LLMs) and Apps

NLP models based on Huggingface Transformer libraries

Pytorch Transformer

Open Source LLMs

LLMs Apps based on OpenAI API

LLMs Apps based on LangChain