RAPIDS Notebooks-Contrib
Intro
Welcome to the community contributed notebooks repo! (formerly known as Notebooks-Extended)
The purpose of this collection of notebooks is to help users understand what RAPIDS has to offer, learn why, how, and when including RAPIDS in a data science pipeline makes sense, and contain community contributions of RAPIDS knowledge. The difference between this repo and the Notebooks Repo are:
- These are vetted, community-contributed notebooks (includes RAPIDS team member contributions).
- These notebooks won't run on air gapped systems, which is one of our container requirements. Many RAPIDS notebooks use additional PyData ecosystem packages, and include code for downloading datasets, thus they require network connectivity. If running on a system with no network access, please download all the data that you plan to use ahead of time or simply use the core notebooks repo.
Installation
Please use the BUILD.md to check the pre-requisite packages and installation steps.
Contributing
Please see our guide for contributing to notebooks-contrib.
Once you've followed our guide, please don't forget to test your notebooks! before making a PR.
Exploring the Repo
Folders
getting_started_notebooks
- “how to start using RAPIDS”. Contains notebooks showing "hello worlds", getting started with RAPIDS libraries, and tutorials around RAPIDS concepts.intermediate_notebooks
- “how to accomplish your workflows with RAPIDS”. Contains notebooks showing algorithm and workflow examples, benchmarking tools, and some complete end-to-end (E2E) workflows.advanced_notebooks
- "how to master RAPIDS". Contains notebooks showing kernel customization and advanced end-to-end workflows.blog notebooks
- contains shared notebooks mentioned and used in blogs that showcase RAPIDS workflows and capabilitiesconference notebooks
- contains notebooks used in conferences, such as GTCdata
- contains small data samples used for purely functional demonstrations. Some notebooks include cells that download larger datasets from external websites.
Lists
multimedia_links.md
is a list of videos by RAPIDS or our community talking about or showing how to use RAPIDS. Feel free to contribute your videos and RAPIDS themed playlists as well!competition_notebooks.md
- contains archived notebooks that were used in competitions, such as Kaggle. Some of these notebooks were blogged about and can also be found in ourblog notebooks
folder.
Our Notebooks
Below is a listing of the notebooks in this repository. Each row will tell you the notebook's
- Location in Folder
- Notebook Title and Direct Link in Notebook Title
- Description in Description
- Design is for a
Single GPU
(SG) orMultiple GPUs
(MG) in GPU (don't worry, you can still run the multi-GPU notebooks with a single GPU) - Data can be found in Dataset Used
Getting Started Notebooks:
Folder | Notebook Title | Description | GPU | Dataset Used |
---|---|---|---|---|
basics | Getting_Started_with_cuDF | This notebook shows how to get started with GPU DataFrames (single GPU only) using cuDF in RAPIDS. | SG | Self Generated |
basics | Dask_Hello_World | This notebook shows how to quickly setup Dask and run a "Hello World" example. | MG | Self Generated |
basics | Getting_Started_with_Dask | This notebook shows how to get started with multi-GPU DataFrames using Dask and cuDF in RAPIDS. | MG | Self Generated |
basics | hello_streamz | This notebook demonstrates use of cuDF to perform streaming word-count using a small portion of the Streamz API. | SG | Self Generated |
basics -> blazingsql | Getting Started with BlazingSQL | How to set up and get started with BlazingSQL and the RAPIDS AI suite. | SG | Music Dataset |
basics -> blazingsql | Federated Query Demo | In a single query, join an Apache Parquet file, a CSV file, and a GPU DataFrame (GDF) in GPU memory. | SG | Breast Cancer Diagnostic |
intro_tutorials | 01_Introduction_to_RAPIDS | This notebook shows at a high level what each of the packages in RAPIDS are as well as what they do. | MG | Self Generated |
intro_tutorials | 02_Introduction_to_cuDF | This notebook shows how to work with cuDF DataFrames in RAPIDS. | SG | Self Generated |
intro_tutorials | 03_Introduction_to_Dask | This notebook shows how to work with Dask using basic Python primitives like integers and strings. | MG | Self Generated |
intro_tutorials | 04_Introduction_to_Dask_using_cuDF_DataFrames | This notebook shows how to work with cuDF DataFrames using Dask. | MG | Self Generated |
intro_tutorials | 06_Introduction_to_Supervised_Learning | This notebook shows how to do GPU accelerated Supervised Learning in RAPIDS. | SG | Self Generated |
intro_tutorials | 07_Introduction_to_XGBoost | This notebook shows how to work with GPU accelerated XGBoost in RAPIDS. | SG | Self Generated |
intro_tutorials | 08_Introduction_to_Dask_XGBoost | This notebook shows how to work with Dask XGBoost in RAPIDS. | MG | Self Generated |
intro_tutorials | 09_Introduction_to_Dimensionality_Reduction | This notebook shows how to do GPU accelerated Dimensionality Reduction in RAPIDS. | SG | Self Generated |
intro_tutorials | 10_Introduction_to_Clustering | This notebook shows how to do GPU accelerated Clustering in RAPIDS. | SG | Self Generated |
Intermediate Notebooks:
Folder | Notebook Title | Description | GPU | Dataset Used |
---|---|---|---|---|
examples | linear_regression_demo.ipynb | This notebook demos how to implement simple and multiple linear regression with cuML to predict median housing price on sklearn's Boston Housing dataset. With corresponding Medium Story. | SG | SKLearn Boston Housing |
examples | umap_demo_full | In this notebook we will show how to use UMAP and its GPU accelerated implementation present in RAPIDS. | SG | Fashion MNIST |
examples | rf_demo | Demonstration of using both cuml and sklearn to train a RandomForestClassifier on the Higgs dataset. | SG | Higgs Boson |
examples | weather | Demonstration of using Dask and cuDF to process and analyze weather history | MG | NOAA Annual Weather Data |
examples -> blazingsql | BlazingSQL vs Spark | Analyze 73 million rows of net flow data. Compare BlazingSQL and Apache Spark timings for the same workload. | SG | University of New South Wales LanL Dataset |
examples -> blazingsql | Taxi Fare Prediction | Build & test a cuML Linear Regression model to predict the cost of a ride from 20 million rows of NYC Taxi data. | SG | NYC Taxi Dataset |
examples-> custreamz | parsing_haproxy_logs | This notebook builds upon the weblogs streaming notebook and demonstrates more advanced features for parsing HAProxy logs. | SG | Self Generated |
examples->cugraph | MG Pagerank | Analyze a Twitter dataset (26GB on disk) with 41.7 million users with 1.47 billion social relations (edges) to find out the most influential profiles. | MG | |
E2E-> taxi | NYCTaxi | Demonstrates multi-node ETL for cleanup of raw data into cleaned train and test dataframes. Shows how to run multi-node XGBoost training with dask-xgboost. Please Note: requires Google Dataproc to run! Blog | MG | Google Dataproc Hosted NYC Taxi Data |
E2E-> synthetic_3D | rapids_ml_workflow_demo | A 3D visual showcase of a machine learning workflow with RAPIDS (load data, transform/normalize, train XGBoost model, evaluate accuracy, use model for inference). Along the way we compare the performance gains of RAPIDS [GPU] vs sklearn/pandas methods [CPU]. | SG | SciKit-Learn's demo datasets |
E2E-> census | census_education2income_demo | In this notebook we use 50 years of census data to see how education affects income. | SG | Custom IPUMS Data pull |
benchmarks | cuml_benchmarks | The purpose of this notebook is to extensively benchmark all of the single GPU cuML algorithms against their skLearn counterparts, while also providing the ability to find and verify upper bounds. Note: Best on large memory GPUs | SG | Self Generated |
benchmarks | rapids_decomposition | This notebook benchmarks and visualize RAPIDS decomposition methods against each other. You have the opportunity to self-compare it to CPU speeds and methods | SG | SciKit-Learn's demo datasets |
benchmarks-> cugraph_benchmarks | louvain_benchmark | This notebook benchmarks performance improvement of running the Louvain clustering algorithm within cuGraph against NetworkX. | SG | Sparse collection |
benchmarks-> cugraph_benchmarks | pagerank_benchmark | This notebook benchmarks performance improvement of running PageRank within cuGraph against NetworkX. | SG | Sparse collection |
benchmarks-> cugraph_benchmarks | BFS benchmark | This notebook benchmarks performance improvement of running BFS within cuGraph against NetworkX. | SG | Sparse collection |
benchmarks-> cugraph_benchmarks | SSSP_benchmark | This notebook benchmarks performance improvement of running SSSP within cuGraph against NetworkX. | SG | Sparse collection |
benchmarks-> cugraph_mg_hibench | MG pagerank_benchmark | This notebook runs cuGraph's multi-GPU PageRank on a dataset of 300GB. It designed for DGX-2 machines. | MG | HiBench |
Advanced Notebooks:
Folder | Notebook Title | Description | GPU | Dataset Used |
---|---|---|---|---|
tutorials | rapids_customized_kernels | Archive Only. This notebook shows how create customized kernels using CUDA to make your workflow in RAPIDS even faster. | SG | Self Generated |
Blog Notebooks:
Folder | Notebook Title | Description | GPU | Dataset Used |
---|---|---|---|---|
cyber -> flow_classification | flow_classification_rapids | Archive Only. The cyber folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to load netflow data into cuDF and create a multiclass classification model using XGBoost. Uses run_raw_data_generator |
SG | University of New South Wales LanL Dataset |
cyber -> network_mapping | lanl_network_mapping_using_rapids | Archive Only. The cyber folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to parse raw windows event logs using cudf and uses cuGraph's pagerank model to build a network graph. Uses run_raw_data_generator |
SG | University of New South Wales LanL Dataset |
databricks | RAPIDS_PCA_demo_avro_read | The databricks folder is the companion file repository to the blog RAPIDS can now be accessed on Databricks Unified Analytics Platform by Ikroop Dhillon, Karthikeyan Rajendran, and Taurean Dyer. This notebooks purpose is to showcase RAPIDS on Databricks use their sample datasets and show the CPU vs GPU comparison for the PCA algorithm. There is also an accompanying HTML file for easy Databricks import. This notebook is for illustrative purposes only! Do not expect this notebook to successfully run on its own- this notebook's code is replicates a workflow meant to run on a specific platform, Databricks |
SG | RAPIDS Toy Data |
plasticc-> notebooks | rapids_lsst_full_demo | Archive Only. This notebook demos the full CPU and GPU implementation of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. Updated notebooks found here | MG | Kaggle PLAsTiCC-2018 dataset |
plasticc-> notebooks | rapids_lsst_gpu_only_demo | Archive Only. This GPU only based notebook shows the RAPIDS speedup of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. Updated notebooks found here | MG | Kaggle PLAsTiCC-2018 dataset |
santander | cudf_tf_demo | Archive Only. This financial industry facing notebook is the cudf-tensorflow approach from the RAPIDS.ai team for Santander Customer Transaction Prediction. Placed 17/8808. Blog | SG | Kaggle Santander Customer Transaction Prediction Dataset |
santander | E2E_santander_pandas | Archive Only. This This financial data modelling notebook is the Pandas based version the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. Placed 17/8808. Blog | SG | Kaggle Santander Customer Transaction Prediction Dataset |
santander | E2E_santander | Archive Only. This financial data modelling notebook is the cuDF based version of the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. It allows you to compare cuDF performance to the Pandas version. Placed 17/8808. Blog. | SG | Kaggle Santander Customer Transaction Prediction Dataset |
regression | regression_blog_notebook | This is the companion notebook for the blog Essential Machine Learning with Linear Models in RAPIDS: part 1 of a series by Paul Mahler. It showcases an end to end notebook using the Bike Share dataset and cuML's implementation of ridge regression. | SG | Bike Share Dataset |
regression | regression_2_blog | This is the companion notebook for the blog Regression Blog 2: We’re Practically Giving These Regressions Away by Paul Mahler. It showcases an end to end notebook using the Black Friday dataset and cuML's implementations of L1 and L2 regularizations using Ridge, Lasso, and ElasticNet regression techniques. | SG | Analytics Vidhya Black Friday Hackathon Dataset |
nlp -> show_me_the_word_count_gutenberg | show_me_the_word_count_gutenberg | This is the notebook for blog Show Me The Word Count by Vibhu Jawa, Nick Becker, David Wendt, and Randy Gelhausen. This notebook showcases NLP pre-processing capabilties of nvstrings+cudf on the Gutenberg dataset. | SG | Gutenburg Dataset |
cuspatial -> accelerate_geospatial_processing | accelerate_geospatial_processing | This is the notebook for blog cuSpatial Accelerates Geospatial and Spatiotemporal Processing by Milind Naphade, Jianting Zhang, Shuo Wang, Thomson Comer, Josh Paterson, Keith Kraus, Mark Harris, and Sujit Biswas. This notebook showcases cuSpatial benchmarking of directed Hausdorff distance for computing trajectory clustering on a large dataset. | SG | Trajectories Data and target_intersection.png |
randomforest | fruits_rf_notebook | This is the notebook for blog GPU-accelerated Random Forest by Vishal Mehta, Myrto Papadopoulou, Thejaswi Rao. This notebook showcases how to use GPU accelerated Random Forest Classification in cuML. The fruit dataset used is Self generated and used as an example in the Blog | SG | Self Generated |
mortgage_deep_learning | mortgage_e2e_deep_learning | Archive Only. This end to end notebook for the blog, Using RAPIDS with PyTorch, by Even Oldridge, combines the RAPIDS GPU data processing with a PyTorch deep learning neural network to predict mortgage loan delinquency. | MG | Fannie Mae Mortgage Dataset |
Conference Notebooks:
Folder | Notebook Title | Description | GPU | Dataset Used |
---|---|---|---|---|
GTC_SJ_2019 | GTC_tutorial_instructor | This is the instructor notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019. It contains all the demonstrated solutions. | SG | Analytics Vidhya Black Friday Hackathon Dataset |
GTC_SJ_2019 | GTC_tutorial_student | This is the exercise-filled student notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019 | SG | Analytics Vidhya Black Friday Hackathon Dataset |
KDD_2019 -> cyber | Cybersecurity_KDD | Using RAPIDS on network traffic and metadata, we demonstrate how to: 1. Triage and perform data exploration, 2. Model network data as a graph, 3. Perform graph analytics on the graph representation of the cyber network data, and 4. Prepare the results in a way that is suitable for visualization. | SG | IDS 2018 dataset |
KDD_2019 -> graph_pattern_mining | MiningFrequentPatternsFromGraphs | This notebook uses PC failure metadata, turns it into a coordinate list, and uses cugraph to find frequent patterns about the population that has failed | SG | Microsoft PC Failure Metadata Graph |
KDD_2019 -> plasticc | Part 1.1 RNN Feature Engineering | Part 1.1 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here | MG | Kaggle PLAsTiCC-2018 dataset |
KDD_2019 -> plasticc | Part 1.2 RNN Extract Bottleneck | Part 1.2 of this GPU only based notebook shows the RAPIDS speedup of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here | MG | Kaggle PLAsTiCC-2018 dataset |
KDD_2019 -> plasticc | Part 2.1 Feature Engineering | Part 2.1 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here | MG | Kaggle PLAsTiCC-2018 dataset |
KDD_2019 -> plasticc | Part 2.2 Train XGBoost & MLP | Part 2.2 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here | MG | Kaggle PLAsTiCC-2018 dataset |
SCIPY_2019 | SCIPY_2019 Tutorial Index | This index outlines the "getting started" style tutorials within the folder. The tutorials cover cudf, cuml, and cugraph. These tutorials were presented at SCIPY 2019 | SG | Various Self Generated datasets and Zachary Karate Club Data Set |
ASONAM 2019 | Cyber | Example notebook using RAPIDS to let an organization's security and forensics experts collect vast amounts of network traffic and network metadata and perform fast triage, processing, modeling, and visualization capabilities. | MG | IDS 2018 dataset from the Canadian Institute for Cybersecurity |
ASONAM 2019 | Spotify Playlist | Shows how you can quickly use RAPIDS to explore the Spotify Million Playlist Dataset, which was created for the RecSys 2018 competition, and build a playlist recommender Note: this dataset requires an independent user download and cannot be pulled from the notebook | MG | RecSys 2018 competition |
ASONAM 2019 | Weighted Link Prediction | This notebook uses cuGraph for Weighted Link Prediction to mitigate uncertainty on the Epinions Trust Network Dataset to predict the likelihood of trust or distrust between vertices. Note: this dataset requires an independent user download and cannot be pulled from the notebook | SG | Epinions Trust Network Dataset |
Additional Information
-
The
data
folder also includes the full image set from the Fashion MNIST dataset. -
utils
: contains a set of useful scripts for interacting with RAPIDS Notebooks-Contrib -
For our notebook examples and tutorials found in our standard containers, please see the Notebooks Repo