StreamSets Data Collector Tutorials

The following tutorials demonstrate some StreamSets Data Collector features. Clone this repository to your machine to follow along and get familiar with using Data Collector.

Basic Tutorials

Log Shipping to Elasticsearch - Read weblog files from a local filesystem directory, decorate some of the fields (e.g. GeoIP Lookup), and write them to Elasticsearch.
Simple Kafka Enablement using StreamSets Data Collector
What’s the Biggest Lot in the City of San Francisco? - Read city lot data from JSON, calculate lot areas in JavaScript, and write them to Hive.
Ingesting Local Data into Azure Data Lake Store - Read records from a local CSV-formatted file, mask out PII (credit card numbers) and send them to a JSON-formatted file in Azure Data Lake Store.

Writing Custom Pipeline Stages

Creating a Custom StreamSets Origin - Build a simple custom origin that reads a Git repository's commit log and produces the corresponding records.
Creating a Custom Multithreaded StreamSets Origin - A more advanced tutorial focusing on building an origin that supports parallel execution, so the pipeline can run in multiple threads.
Creating a Custom StreamSets Processor - Build a simple custom processor that reads metadata tags from image files and writes them to the records as fields.
Creating a Custom StreamSets Destination - Build a simple custom destination that writes batches of records to a webhook.

Advanced Features

Ingesting Drifting Data into Hive and Impala - Build a pipeline that handles schema changes in MySQL, creating and altering Hive tables accordingly.
Creating a StreamSets Spark Transformer in Java - Build a simple Java Spark Transformer that computes a credit card's issuing network from its number.
Creating a StreamSets Spark Transformer in Scala - Build a simple Scala Spark Transformer that computes a credit card's issuing network from its number.
Creating a CRUD Microservice Pipeline - Build a microservice pipeline to implement a RESTful web service that reads from and writes to a database via JDBC.

The Data Collector documentation also includes an extended tutorial that walks through basic Data Collector functionality, including creating, previewing and running a pipeline, and creating alerts.

License

StreamSets Data Collector and its tutorials are built on open source technologies; the tutorials and accompanying code are licensed with the Apache License 2.0.

Contributing Tutorials