Data mining, machine learning, and deep learning sample codes for SJSU CMPE255 Data Mining (Fall2023 SJSU Official Syllabus) and CMPE258 Deep Learning (Fall2023 SJSU Official Syllabus).
- Some google colab examples need SJSU google account to view)
- Large language Models (LLMs) part is newly added
- You can also view the documents in: readthedocs
For inference, I created the following two files. I was focusing more on Torchscript, but I wanted to also try out Tensor RT.
DeepDataMiningLearning/detection/torchscript_model.py
DeepDataMiningLearning/detection/tensorrt_model.py
I made edits to the following module files for YOLOv8 to try to get Torchscript to work:
DeepDataMiningLearning/detection/modules/block.py
DeepDataMiningLearning/detection/modules/head.py
DeepDataMiningLearning/detection/modules/tal.py
I also made some minor bug fixes to the following files, mainly pertaining to file paths:
DeepDataMiningLearning/detection/dataset.py
DeepDataMiningLearning/detection/modules/yolomodels.py DeepDataMiningLearning/detection/utils.py
Install this python package (optional) via
% python3 -m pip install flit
% flit install --symlink
ref "docs/python.rst" for detailed python package description
Open the Jupyter notebook in local machine:
jupyter lab --ip 0.0.0.0 --no-browser --allow-root
Activate python virtual environment, you can use 'sphinx-build' command to build the document
% pip install -r requirements.txt
(mypy310) kaikailiu@kaikais-mbp DeepDataMiningLearning % sphinx-build docs ./docs/build
#check the integrity of all internal and external links:
(mypy310) kaikailiu@kaikais-mbp DeepDataMiningLearning % sphinx-build docs -W -b linkcheck -d docs/build/doctrees docs/build/html
The generated html files are in the folder of "build". You can also view the documents in: readthedocs
Basic python tutorials, numpy, Pandas, data visualization and EDA
- Python tutorial code: Python_tutorial.ipynb--colablink
- Python NumPy tutorial code: Python NumPy tutorial--colablink
- Data Mining introduction code:
Python data apps based on streamlit: streamlittest
- Data Mining based on Google Cloud:
- Google Cloud access via Colab: colablink
- Configure Gcloud, Google Cloud Storage, Compute Engine, Colab Terminal
- Google BigQuery with Colab/Jupyter introduction BigQuery-intro.ipynb -- colablink
- Natality dataset and Weather data from Google BigQuery
- COVID19 Data EDA and Visualization based on Google BigQuery (Fall 2022 updated): colablink
- COVID NYT data, COVID-19 JHU data
- Additional Google BigQuery examples: colablink
- Chicago Crime Dataset, Austin Waste Dataset, COVID Racial Dataset (race graph)
- BigQuery ML examples: colablink
- COVID, CREDIT_CARD_FRAUD, Predict penguin weight, Natality, US Census Dataset Classification, time-series forecasting from Google Analytics data
- Google Cloud access via Colab: colablink
- Machine Learning introduction:
- MLIntro-Regression -- colablink
- MLIntro-RegressionSKLearn -- colablink
- MLIntro2-classification.ipynb --colablink
- Breast Cancer Dataset, iris Dataset, BigQuery US Census Income Dataset, multiple classifiers.
- DecisionTree -- colablink
- SKlearn DecisionTree algorithm on Iris dataset, Breast Cancel Dataset, Make moon dataset, and DecisionTreeRegressor. A berif discussion of Gini Impurity.
- GradientBoosting -- colablink
- Gradient boosting process, Gradient boosting regressor with scikit-learn, Gradient boosting classifier with scikit-learn
- XGBoost -- colablink
- XGBoost introduction, US Census Income Dataset from Big Query, UCI Dermatology dataset
Deep learning notebooks (colab link is better)
- Tensorflow introduction code: CMPE-Tensorflow1.ipynb -- colablink
- Pytorch introduction code: CMPE-pytorch1.ipynb -- colablink
- Tensorflow image classification:
- Road sign data from Kaggle example: Tensorflow-Roadsignclassification.ipynb, colablink
- Flower dataset example with TF Dataset, TFRecord, Google Cloud Storage, TPU/GPU acceleration: colablink
- Pytorch image classification sample: CMPE-pytorch2.ipynb, colablink
New Deep Learning sample code based on Pytorch (under the folder of "DeepDataMiningLearning")
- Pytorch Single GPU image classification with/without automatic mixed precision (AMP) training: singleGPU
- Pytorch Multi-GPU DDP test: testTorchDDP
- Pytorch Multi-GPU image classification: multiGPU
- Pytorch Torchvision image classification (Efficientnet) notebook on HPC: torchvisionHPC.ipynb
- Pytorch Torchvision vision transformer (ViT) notebook on HPC: torchvisionvitHPC.ipynb
- Pytorch ViT implement from scratch on HPC: ViTHPC.ipynb
- Pytorch ImageNet classification example: imagenet
- Pytorch inference example for top-k class: inference.py
- TIMM models: testtimm.ipynb
- Huggingface Images via Transformers: huggingfaceimage.ipynb
- Siamese network: siamese_network
- TensorRT example: tensorrt.ipynb
- Advanced Image Classification: githubrepo
- General purpose framework for all-in-one image classification for Tensorflow and Pytorch
- Support for multiple datasets: imagenet_blurred, tiny-imagenet-200, hymenoptera_data, CIFAR10, MNIST, flower_photos
- Support for multiple custom models ('mlpmodel1', 'lenet', 'alexnet', 'resnetmodel1', 'customresnet', 'vggmodel1', 'vggcustom', 'cnnmodel1'), all models from Torchvision and TorchHub
- Support HPC training and evaluation
- Object detection (other repo)
- MultiModalDetector
- myyolov7: Add YOLOv5 models with YOLOv7, performed training on COCO and WaymoCOCO dataset.
- myyolov5: My fork of the YOLOv5, convert COCO to YOLO format, changed the code to be the base code for YOLOv4, YOLOv5, and ScaledYOLOv4; performed training on COCO and WaymoCOCO dataset.
- WaymoObjectDetection
- Waymo Dataset Conversion to COCO format: WaymoCOCO
- torchvision_waymococo_train.py: performs Pytorch FasterRCNN training based on converted Waymo COCO format data. This version can be applied for any dataset with COCO format annotation
- WaymoCOCODetectron2train.py: WaymoCOCO training based on Detectron2
- mymmdetection2dtrain.py: Object Detection training and evaluation based on MMdetection2D
- CustomDetectron2
- Unsupervised Learning Jupyter notebooks
- PCA: colablink
- Numpy/SKlearn SVD, PCA for digits and noise filtering, eigenfaces, PCA vs LDA vs NCA
- Manifold Learning: colablink
- Multidimensional Scaling (MDS), Locally Linear Embedding (LLE), Isomap Embedding, T-distributed Stochastic Neighbor Embedding for HELLO, S-Curve, and Swiss roll dataset; Isomap on Faces; Regression with Mainfold Learning
- Clustering: colablink
- K-Means, Gaussian Mixture Models, Spectral Clustering, DBSCAN
- PCA: colablink
- Text Mining Jupyter notebooks
- Text Representations: colablink
- One-Hot encoding, Bag-of-Words, TF-IDF, and Word2Vec (based on gensim); Word2Vec WiKi and Shakespeare examples; Gather data from Google and WordCLoud
- Texrtact and NLTK: colablink
- Text Extraction via textract; NLTK text preprocessing
- Text Mining via Tensorflow-text: colablink
- Using Keras embedding layer; sentiment classification example; prepare positive and negative samples and create a Skip-gram Word2Vec model
- Text Classification via Tensorflow: colablink
- RNN, LSTM, Transformer, BERT
- Twitter NLP all-in-one example: colablink
- NTLK, LSTM, Bi-LSTM, GRU, BERT
- Text Representations: colablink
- Recommendation
NLP models based on Huggingface Transformer libraries
- Starting
- Classification application
- Multi-modal Classifier: huggingfaceclassifier2, huggingfaceclassifier
- Sequence related application, e.g., translation, summary
- Question and Answer (Q&A)
- Chatbot
Pytorch Transformer
Open Source LLMs
- BERTLM.ipynb
- Masked Language Modeling: huggingfaceLM.ipynb
- llama2
LLMs Apps based on OpenAI API
LLMs Apps based on LangChain