Practice ETL with Rust and Polars
This repository will walk you through examples for each step in ETL so that you can apply Rust and Polars for these operations using a sample CSV dataset.
You will be using a sample dataset that contains wines from all over the world. Explore the wine dataset and familiarize yourself with the data before you start the ETL process.
Each example is a separate Cargo project and it is meant to be run independently. You can run each example by navigating to the project directory and running the following command:
cargo run ../../top-rated-wines.csv
For this lesson, you will learn how to read a CSV file and load it into a DataFrame in Polars. You will do minor checking of the data to ensure that it was loaded correctly and that the data is in the expected format.
For this lesson, you will learn how to transform the data by filtering out unnecessary columns and rows. You will use one hot encoding to convert columns. There are two examples in this lesson, one that does hot encoding on all columns and another that does hot encoding on selected columns.
Finally, for this lesson, you will learn how to save the transformed data into a Parquet file. A Parquet file is a columnar storage file that is optimized for reading and writing data.
- Verify Parquet file: You will save the transformed data into a Parquet file and then read it back to ensure that the data was saved correctly using the Load project as a reference.
- Add options for saving: Currently, all projects do not save the CSV back to the file system. Add an option to save the transformed data back to the file system.
- Add more transformations: Add more transformations to the data such as sorting, grouping, and aggregating data.
- Implement Schema validation: Use Polars Schema validation to ensure that the data is in the expected format before transforming it.