/notebooks-contrib

RAPIDS Community Notebooks

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

RAPIDS Notebooks-Contrib

Intro

Welcome to the community contributed notebooks repo! (formerly known as Notebooks-Extended)

The purpose of this collection of notebooks is to help users understand what RAPIDS has to offer, learn why, how, and when including RAPIDS in a data science pipeline makes sense, and contain community contributions of RAPIDS knowledge. The difference between this repo and the Notebooks Repo are:

  1. These are vetted, community-contributed notebooks (includes RAPIDS team member contributions).
  2. These notebooks won't run on air gapped systems, which is one of our container requirements. Many RAPIDS notebooks use additional PyData ecosystem packages, and include code for downloading datasets, thus they require network connectivity. If running on a system with no network access, please download all the data that you plan to use ahead of time or simply use the core notebooks repo.

Installation

Please use the BUILD.md to check the pre-requisite packages and installation steps.

Contributing

Please see our guide for contributing to notebooks-contrib.

Once you've followed our guide, please don't forget to test your notebooks! before making a PR.

Exploring the Repo

Folders

  • getting_started_notebooks - “how to start using RAPIDS”. Contains notebooks showing "hello worlds", getting started with RAPIDS libraries, and tutorials around RAPIDS concepts.
  • intermediate_notebooks - “how to accomplish your workflows with RAPIDS”. Contains notebooks showing algorithm and workflow examples, benchmarking tools, and some complete end-to-end (E2E) workflows.
  • advanced_notebooks - "how to master RAPIDS". Contains notebooks showing kernel customization and advanced end-to-end workflows.
  • blog notebooks - contains shared notebooks mentioned and used in blogs that showcase RAPIDS workflows and capabilities
  • conference notebooks - contains notebooks used in conferences, such as GTC
  • data - contains small data samples used for purely functional demonstrations. Some notebooks include cells that download larger datasets from external websites.

Lists

  • multimedia_links.md is a list of videos by RAPIDS or our community talking about or showing how to use RAPIDS. Feel free to contribute your videos and RAPIDS themed playlists as well!
  • competition_notebooks.md - contains archived notebooks that were used in competitions, such as Kaggle. Some of these notebooks were blogged about and can also be found in our blog notebooks folder.

Our Notebooks

Below is a listing of the notebooks in this repository. Each row will tell you the notebook's

  • Location in Folder
  • Notebook Title and Direct Link in Notebook Title
  • Description in Description
  • Design is for a Single GPU(SG) or Multiple GPUs(MG) in GPU (don't worry, you can still run the multi-GPU notebooks with a single GPU)
  • Data can be found in Dataset Used

Getting Started Notebooks:

Folder Notebook Title Description GPU Dataset Used
basics Getting_Started_with_cuDF This notebook shows how to get started with GPU DataFrames (single GPU only) using cuDF in RAPIDS. SG Self Generated
basics Dask_Hello_World This notebook shows how to quickly setup Dask and run a "Hello World" example. MG Self Generated
basics Getting_Started_with_Dask This notebook shows how to get started with multi-GPU DataFrames using Dask and cuDF in RAPIDS. MG Self Generated
basics hello_streamz This notebook demonstrates use of cuDF to perform streaming word-count using a small portion of the Streamz API. SG Self Generated
basics -> blazingsql Getting Started with BlazingSQL How to set up and get started with BlazingSQL and the RAPIDS AI suite. SG Music Dataset
basics -> blazingsql Federated Query Demo In a single query, join an Apache Parquet file, a CSV file, and a GPU DataFrame (GDF) in GPU memory. SG Breast Cancer Diagnostic
intro_tutorials 01_Introduction_to_RAPIDS This notebook shows at a high level what each of the packages in RAPIDS are as well as what they do. MG Self Generated
intro_tutorials 02_Introduction_to_cuDF This notebook shows how to work with cuDF DataFrames in RAPIDS. SG Self Generated
intro_tutorials 03_Introduction_to_Dask This notebook shows how to work with Dask using basic Python primitives like integers and strings. MG Self Generated
intro_tutorials 04_Introduction_to_Dask_using_cuDF_DataFrames This notebook shows how to work with cuDF DataFrames using Dask. MG Self Generated
intro_tutorials 06_Introduction_to_Supervised_Learning This notebook shows how to do GPU accelerated Supervised Learning in RAPIDS. SG Self Generated
intro_tutorials 07_Introduction_to_XGBoost This notebook shows how to work with GPU accelerated XGBoost in RAPIDS. SG Self Generated
intro_tutorials 08_Introduction_to_Dask_XGBoost This notebook shows how to work with Dask XGBoost in RAPIDS. MG Self Generated
intro_tutorials 09_Introduction_to_Dimensionality_Reduction This notebook shows how to do GPU accelerated Dimensionality Reduction in RAPIDS. SG Self Generated
intro_tutorials 10_Introduction_to_Clustering This notebook shows how to do GPU accelerated Clustering in RAPIDS. SG Self Generated

Intermediate Notebooks:

Folder Notebook Title Description GPU Dataset Used
examples linear_regression_demo.ipynb This notebook demos how to implement simple and multiple linear regression with cuML to predict median housing price on sklearn's Boston Housing dataset. With corresponding Medium Story. SG SKLearn Boston Housing
examples umap_demo_full In this notebook we will show how to use UMAP and its GPU accelerated implementation present in RAPIDS. SG Fashion MNIST
examples rf_demo Demonstration of using both cuml and sklearn to train a RandomForestClassifier on the Higgs dataset. SG Higgs Boson
examples weather Demonstration of using Dask and cuDF to process and analyze weather history MG NOAA Annual Weather Data
examples -> blazingsql BlazingSQL vs Spark Analyze 73 million rows of net flow data. Compare BlazingSQL and Apache Spark timings for the same workload. SG University of New South Wales LanL Dataset
examples -> blazingsql Taxi Fare Prediction Build & test a cuML Linear Regression model to predict the cost of a ride from 20 million rows of NYC Taxi data. SG NYC Taxi Dataset
examples-> custreamz parsing_haproxy_logs This notebook builds upon the weblogs streaming notebook and demonstrates more advanced features for parsing HAProxy logs. SG Self Generated
examples->cugraph MG Pagerank Analyze a Twitter dataset (26GB on disk) with 41.7 million users with 1.47 billion social relations (edges) to find out the most influential profiles. MG Twitter
E2E-> taxi NYCTaxi Demonstrates multi-node ETL for cleanup of raw data into cleaned train and test dataframes. Shows how to run multi-node XGBoost training with dask-xgboost. Please Note: requires Google Dataproc to run! Blog MG Google Dataproc Hosted NYC Taxi Data
E2E-> synthetic_3D rapids_ml_workflow_demo A 3D visual showcase of a machine learning workflow with RAPIDS (load data, transform/normalize, train XGBoost model, evaluate accuracy, use model for inference). Along the way we compare the performance gains of RAPIDS [GPU] vs sklearn/pandas methods [CPU]. SG SciKit-Learn's demo datasets
E2E-> census census_education2income_demo In this notebook we use 50 years of census data to see how education affects income. SG Custom IPUMS Data pull
benchmarks cuml_benchmarks The purpose of this notebook is to extensively benchmark all of the single GPU cuML algorithms against their skLearn counterparts, while also providing the ability to find and verify upper bounds. Note: Best on large memory GPUs SG Self Generated
benchmarks rapids_decomposition This notebook benchmarks and visualize RAPIDS decomposition methods against each other. You have the opportunity to self-compare it to CPU speeds and methods SG SciKit-Learn's demo datasets
benchmarks-> cugraph_benchmarks louvain_benchmark This notebook benchmarks performance improvement of running the Louvain clustering algorithm within cuGraph against NetworkX. SG Sparse collection
benchmarks-> cugraph_benchmarks pagerank_benchmark This notebook benchmarks performance improvement of running PageRank within cuGraph against NetworkX. SG Sparse collection
benchmarks-> cugraph_benchmarks BFS benchmark This notebook benchmarks performance improvement of running BFS within cuGraph against NetworkX. SG Sparse collection
benchmarks-> cugraph_benchmarks SSSP_benchmark This notebook benchmarks performance improvement of running SSSP within cuGraph against NetworkX. SG Sparse collection
benchmarks-> cugraph_mg_hibench MG pagerank_benchmark This notebook runs cuGraph's multi-GPU PageRank on a dataset of 300GB. It designed for DGX-2 machines. MG HiBench

Advanced Notebooks:

Folder Notebook Title Description GPU Dataset Used
tutorials rapids_customized_kernels Archive Only. This notebook shows how create customized kernels using CUDA to make your workflow in RAPIDS even faster. SG Self Generated

Blog Notebooks:

Folder Notebook Title Description GPU Dataset Used
cyber -> flow_classification flow_classification_rapids Archive Only. The cyber folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to load netflow data into cuDF and create a multiclass classification model using XGBoost. Uses run_raw_data_generator SG University of New South Wales LanL Dataset
cyber -> network_mapping lanl_network_mapping_using_rapids Archive Only. The cyber folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to parse raw windows event logs using cudf and uses cuGraph's pagerank model to build a network graph. Uses run_raw_data_generator SG University of New South Wales LanL Dataset
databricks RAPIDS_PCA_demo_avro_read The databricks folder is the companion file repository to the blog RAPIDS can now be accessed on Databricks Unified Analytics Platform by Ikroop Dhillon, Karthikeyan Rajendran, and Taurean Dyer. This notebooks purpose is to showcase RAPIDS on Databricks use their sample datasets and show the CPU vs GPU comparison for the PCA algorithm. There is also an accompanying HTML file for easy Databricks import. This notebook is for illustrative purposes only! Do not expect this notebook to successfully run on its own- this notebook's code is replicates a workflow meant to run on a specific platform, Databricks SG RAPIDS Toy Data
plasticc-> notebooks rapids_lsst_full_demo Archive Only. This notebook demos the full CPU and GPU implementation of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. Updated notebooks found here MG Kaggle PLAsTiCC-2018 dataset
plasticc-> notebooks rapids_lsst_gpu_only_demo Archive Only. This GPU only based notebook shows the RAPIDS speedup of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. Updated notebooks found here MG Kaggle PLAsTiCC-2018 dataset
santander cudf_tf_demo Archive Only. This financial industry facing notebook is the cudf-tensorflow approach from the RAPIDS.ai team for Santander Customer Transaction Prediction. Placed 17/8808. Blog SG Kaggle Santander Customer Transaction Prediction Dataset
santander E2E_santander_pandas Archive Only. This This financial data modelling notebook is the Pandas based version the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. Placed 17/8808. Blog SG Kaggle Santander Customer Transaction Prediction Dataset
santander E2E_santander Archive Only. This financial data modelling notebook is the cuDF based version of the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. It allows you to compare cuDF performance to the Pandas version. Placed 17/8808. Blog. SG Kaggle Santander Customer Transaction Prediction Dataset
regression regression_blog_notebook This is the companion notebook for the blog Essential Machine Learning with Linear Models in RAPIDS: part 1 of a series by Paul Mahler. It showcases an end to end notebook using the Bike Share dataset and cuML's implementation of ridge regression. SG Bike Share Dataset
regression regression_2_blog This is the companion notebook for the blog Regression Blog 2: We’re Practically Giving These Regressions Away by Paul Mahler. It showcases an end to end notebook using the Black Friday dataset and cuML's implementations of L1 and L2 regularizations using Ridge, Lasso, and ElasticNet regression techniques. SG Analytics Vidhya Black Friday Hackathon Dataset
nlp -> show_me_the_word_count_gutenberg show_me_the_word_count_gutenberg This is the notebook for blog Show Me The Word Count by Vibhu Jawa, Nick Becker, David Wendt, and Randy Gelhausen. This notebook showcases NLP pre-processing capabilties of nvstrings+cudf on the Gutenberg dataset. SG Gutenburg Dataset
cuspatial -> accelerate_geospatial_processing accelerate_geospatial_processing This is the notebook for blog cuSpatial Accelerates Geospatial and Spatiotemporal Processing by Milind Naphade, Jianting Zhang, Shuo Wang, Thomson Comer, Josh Paterson, Keith Kraus, Mark Harris, and Sujit Biswas. This notebook showcases cuSpatial benchmarking of directed Hausdorff distance for computing trajectory clustering on a large dataset. SG Trajectories Data and target_intersection.png
randomforest fruits_rf_notebook This is the notebook for blog GPU-accelerated Random Forest by Vishal Mehta, Myrto Papadopoulou, Thejaswi Rao. This notebook showcases how to use GPU accelerated Random Forest Classification in cuML. The fruit dataset used is Self generated and used as an example in the Blog SG Self Generated
mortgage_deep_learning mortgage_e2e_deep_learning Archive Only. This end to end notebook for the blog, Using RAPIDS with PyTorch, by Even Oldridge, combines the RAPIDS GPU data processing with a PyTorch deep learning neural network to predict mortgage loan delinquency. MG Fannie Mae Mortgage Dataset

Conference Notebooks:

Folder Notebook Title Description GPU Dataset Used
GTC_SJ_2019 GTC_tutorial_instructor This is the instructor notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019. It contains all the demonstrated solutions. SG Analytics Vidhya Black Friday Hackathon Dataset
GTC_SJ_2019 GTC_tutorial_student This is the exercise-filled student notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019 SG Analytics Vidhya Black Friday Hackathon Dataset
KDD_2019 -> cyber Cybersecurity_KDD Using RAPIDS on network traffic and metadata, we demonstrate how to: 1. Triage and perform data exploration, 2. Model network data as a graph, 3. Perform graph analytics on the graph representation of the cyber network data, and 4. Prepare the results in a way that is suitable for visualization. SG IDS 2018 dataset
KDD_2019 -> graph_pattern_mining MiningFrequentPatternsFromGraphs This notebook uses PC failure metadata, turns it into a coordinate list, and uses cugraph to find frequent patterns about the population that has failed SG Microsoft PC Failure Metadata Graph
KDD_2019 -> plasticc Part 1.1 RNN Feature Engineering Part 1.1 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here MG Kaggle PLAsTiCC-2018 dataset
KDD_2019 -> plasticc Part 1.2 RNN Extract Bottleneck Part 1.2 of this GPU only based notebook shows the RAPIDS speedup of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here MG Kaggle PLAsTiCC-2018 dataset
KDD_2019 -> plasticc Part 2.1 Feature Engineering Part 2.1 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here MG Kaggle PLAsTiCC-2018 dataset
KDD_2019 -> plasticc Part 2.2 Train XGBoost & MLP Part 2.2 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here MG Kaggle PLAsTiCC-2018 dataset
SCIPY_2019 SCIPY_2019 Tutorial Index This index outlines the "getting started" style tutorials within the folder. The tutorials cover cudf, cuml, and cugraph. These tutorials were presented at SCIPY 2019 SG Various Self Generated datasets and Zachary Karate Club Data Set
ASONAM 2019 Cyber Example notebook using RAPIDS to let an organization's security and forensics experts collect vast amounts of network traffic and network metadata and perform fast triage, processing, modeling, and visualization capabilities. MG IDS 2018 dataset from the Canadian Institute for Cybersecurity
ASONAM 2019 Spotify Playlist Shows how you can quickly use RAPIDS to explore the Spotify Million Playlist Dataset, which was created for the RecSys 2018 competition, and build a playlist recommender Note: this dataset requires an independent user download and cannot be pulled from the notebook MG RecSys 2018 competition
ASONAM 2019 Weighted Link Prediction This notebook uses cuGraph for Weighted Link Prediction to mitigate uncertainty on the Epinions Trust Network Dataset to predict the likelihood of trust or distrust between vertices. Note: this dataset requires an independent user download and cannot be pulled from the notebook SG Epinions Trust Network Dataset

Additional Information

  • The data folder also includes the full image set from the Fashion MNIST dataset.

  • utils: contains a set of useful scripts for interacting with RAPIDS Notebooks-Contrib

  • For our notebook examples and tutorials found in our standard containers, please see the Notebooks Repo