For HDFS based ingestion have the following file pushed in to HDFS.
'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz'
- Download and extract the file in to one of the server where hdfs client is installed
wget --no-verbose --continue 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz'
gzip -d hits.tsv.gz
- Upload the file to HDFS
hdfs dfs -mkdir /data
hdfs dfs -put hits.tsv /data
For File based ingestion just have the following file in one of the servers.
- Download and extract the file in to one of the server where hdfs client is installed
wget --no-verbose --continue 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz'
gzip -d hits.tsv.gz
For Kafka based ingestion, you will require to spin up kafka cluster with 1 or more brokers. We will use implydata/druid-datagenerator
To deploy the kafka data generator use the following commands
docker run -d -p 9999:9999 imply/datagen:latest
[] Give which type of ingestion job to be performed or all [] Give the number of iterations to do for the queries to be ran against all the datasources. [] How many replicas are need for each of the datasource.
ddgen ingest [kafka | hadoop | file | all]
This will perform the ingestion and will keep a state of the application locally, so that if there is any issue then the application will start up at the same point All the pre requisites for the ingestion should already be in place otherwise there will be errors which running this command.ddgen query [--iterations=<number of iterations> --datasource=<datasource>]
This will execute the queries against the datasource and will keep a state of the application locally, so that if there is any issue then the application will start up at the same pointddgen cron [--datasource=<datasource>]
This will output a cron expression for you to use in your crontab