Rust Data Engineering
Projects for Rust Data Engineering Coursera course. Website for projects here: https://nogibjj.github.io/rust-data-engineering/
Environments
- Works with both AWS CodeCatalyst and GitHub Codespaces
Feedback
- Any suggestions or feedback? Feel free file a ticket.
Labs (in sequential Order)
Week 1- Rust Data Structures: Collections
Sequences
- Print Rust data structures:
cd print-data-structs && cargo run
- Vector Fruit Salad:
cd vector-fruit-salad && cargo run
- VecDeque Fruit Salad:
cd vecdeque-fruit-salad && cargo run
- Linkedin List Fruit Salad:
cd linked-list-fruit-salad && cargo run
- Fruit Salad CLI:
cd cli-salad && cargo run -- --number 3
Maps
- HashMap frequency counter:
cd hashmap-count && cargo run
- HashMap language comparison:
cd hashmap-language && cargo run
- BTreeMap language comparison:
cd BTreeMap-language && cargo run
Sets
- HashSet fruits:
cd hashset-fruit && cargo run
- BTreeSet fruits:
cd btreeset-fruit && cargo run
Misc
- Binary Heap Fruit Salad with Fig Priority:
cd binaryheap-fruit && cargo run
Week 2-Safety, Security, and Concurrency with Rust
- mutable fruit salad:
cd mutable-fruit-salad && cargo run
- cli customize fruit salad:
cd cli-customize-fruit-salad && cargo run -- fruits.csv
orcargo run -- --fruits "apple, pear"
- data race example:
cd data-race && cargo run
(will not compile because of data race)
Ciphers vs Encryption
The main differences between ciphers and encryption algorithms:
-
Ciphers operate directly on the plaintext, substituting or transposing the letters mathematically. Encryption algorithms operate on the binary data representations of the plaintext.
-
Ciphers typically have a small key space based on simple operations like letter mappings or transposition rules. Encryption algorithms use complex math and very large key sizes.
-
Ciphers provide security through obscuring letter frequencies but are still vulnerable to cryptanalysis. Encryption algorithms rely on computational hardness assumptions.
-
Ciphers only handle textual data well. Encryption algorithms can handle all binary data like images, video, etc.
In summary:
-
Ciphers like homophonic substitution operate directly on textual plaintext with simple math operations and fixed small key spaces.
-
Encryption algorithms like AES operate on any binary data with complex math and very large key sizes.
-
Ciphers are considered obsolete for serious encryption use today due to vulnerabilities.
-
Modern encryption provides provable security based on mathematical problems assumed to be computationally infeasible to solve.
Suggested Exercises
-
Data Race Detector: Create a multi-threaded application that attempts to produce a data race, then show how Rust's ownership rules prevent this from occurring.
-
Memory Leak Preventer: Build an application that would typically suffer from memory leaks in other languages, such as a complex tree structure. Rust's automatic memory management through RAII (Resource Acquisition Is Initialization) should ensure no memory leaks occur.
-
Null Pointer Safety: Create a project demonstrating how Rust's Option and Result<T, E> types are used to handle potentially null or error-producing cases safely.
-
Immutable by Default: Design a system where mutability would cause bugs (for instance, a simulation with entities that should not be able to change once created) and show how Rust's immutability by default prevents these issues.
-
System with Lifetimes: Show how lifetimes can prevent use-after-free bugs by building an application where objects have distinct lifetimes that must be enforced.
-
No Segfault System: Create a project that would usually segfault in other languages, and demonstrate how Rust prevents this.
-
Web Server: Build a small multi-threaded web server. Show how Rust's safety features prevent common bugs in concurrent programming.
-
Safe FFI: Create a project that uses Rust's Foreign Function Interface (FFI) to interoperate with C libraries safely.
-
Safe Transmute: Write a program that demonstrates the use of safe transmutes in Rust. This could be a good way to show how Rust can avoid undefined behavior that's common in languages like C or C++.
-
Bounds Checking: Design a system that would typically have a lot of array bounds errors, then show how Rust's automatic bounds checking prevents these kinds of errors.
-
Immutable Concurrency: Create a project that takes advantage of Rust's ability to share immutable data among threads without data races.
-
Command Line Application: Build a command-line application that processes user input. Use Rust's strong type system and pattern matching to handle different types of input safely and cleanly
Week 3-Rust Data Engineering Libraries and Tools
Suggested Exercises
- CSV Data Processing: A tool for processing large CSV files, showcasing efficient data reading, filtering, and aggregation capabilities of Rust.
- Database Interaction: An application that interacts with a SQL database (like PostgreSQL) using Diesel, demonstrating CRUD operations, migrations, and complex queries.
- Data Visualization: A CLI tool that generates graphs and charts from input data using plotters.
- Web Scraper: A multi-threaded web scraper that fetches and parses data from several web pages concurrently.
- REST API Consumer: An application that interacts with a REST API to fetch, process, and visualize data.
- Log Parser: A tool to parse and analyze server log files. It can extract meaningful information and statistics and provide insights about the server performance.
- File System Analyzer: An application that provides insights about disk usage, like the
du
command in Unix. - Real-Time Twitter Analysis: A real-time tweet analysis tool that uses Twitter Stream API to fetch tweets and analyze them (for example, performing sentiment analysis).
- Stock Market Analyzer: An application that fetches stock market data from a free API and performs various analyses.
- Text Analytics: A text analytics library that provides functionalities like sentiment analysis, named entity recognition, etc.
- Delta Lake Interaction: A project demonstrating interaction with Delta Lake for processing large amounts of data.
- AWS SDK usage: A project demonstrating the use of AWS SDK in Rust for tasks such as accessing S3 buckets, performing operations on DynamoDB, etc.
- Data Processing with Polars: A project demonstrating how to perform large-scale data processing with the Polars library in Rust.
- Kafka Producer/Consumer: An application that produces and consumes messages from Kafka.
- gRPC Microservices: A basic microservices setup using gRPC, demonstrating how Rust can be used for backend development.
- Apache Arrow usage: A project demonstrating how to use Apache Arrow for columnar data processing in Rust.
- Parquet File Processing: An application that reads and writes Parquet files, demonstrating how Rust can be used for efficient data engineering tasks.
- Data Engineering with TiKV: A project demonstrating how to use TiKV, a distributed transactional key-value database built in Rust.
Week 4-Rust
Suggested Exercises
-
Rust-based ETL Pipeline: Develop an ETL (Extract, Transform, Load) pipeline using various Rust libraries to process and transfer data between different storage systems.
-
Web Scraper with Rust: Build a concurrent web scraper that can efficiently scrape large amounts of data from web pages.
-
Rust REST API Server: Design a REST API server in Rust that serves data from a database. Use the Diesel ORM for database interactions.
-
Real-time Data Streaming with Rust: Implement a real-time data streaming application, processing streams of data in a concurrent manner.
-
Rust-based Data Lake: Use the Delta Lake Rust API to create a data lake solution. Implement CRUD operations on the data lake.
-
Big Data Processing with Rust and Apache Arrow: Use Apache Arrow to perform efficient in-memory big data processing.
-
Rust and AWS SDK: Use the AWS SDK for Rust to interact with AWS services such as S3 and DynamoDB.
-
gRPC Service in Rust: Implement a gRPC service in Rust that performs CRUD operations on a database.
-
Log Analyzer: Create a log analyzer that can process large log files concurrently and provide useful insights from logs.
-
Distributed Systems with Rust: Create a simple distributed system using Rust's concurrency features. This could be a simple key-value store or a message-passing system.
-
Rust and GraphQL: Implement a GraphQL API in Rust using libraries like Juniper.
-
Data Serialization with Rust: Use libraries like serde to perform data serialization and deserialization in various formats (JSON, XML, etc.)
-
Rust and Kafka: Use Rust to interact with Kafka, implementing a producer and consumer system.
-
Data Validation Service: Create a service that validates input data based on predefined rules. This could be a web service or a library that other services can use.
-
Rust and Machine Learning: Use Rust machine learning libraries to implement a simple prediction model. You could use the data processed in the ETL pipeline or the data lake for this.
Lab: Modifying a Rust Command-Line Tool
In this lab you will gain experience extending an existing Rust project by forking and modifying a simple command-line tool.
Steps
-
Fork the repository at https://github.com/nogibjj/rust-data-engineering
-
Clone your forked repository
-
Navigate to one of the command-line tool projects
-
Make a small modification to the tool such as:
-
Adding a new command line argument
-
Supporting additional input file formats
-
Adding more processing logic
-
Changing output formatting
-
-
Run
cargo build
to compile your changes -
Run
cargo run
to test your modified tool -
Commit your changes and push to your forked repository
Deliverable
Submit a link to your forked repository showing the code changes.
Goals
This hands-on lab provides experience with:
-
Forking and cloning a Rust project
-
Modifying existing Rust code
-
Running
cargo build
andcargo run
-
Version control with git
-
Making a pull request (optional)
Technical Notes
Makefile
Each subdirectory project uses this style to make it easy to test and run
format:
cargo fmt --quiet
lint:
cargo clippy --quiet
test:
cargo test --quiet
run:
cargo run
all: format lint test run