/kafka-connect-spooldir

Kafka Connect connector for reading CSV files into Kafka.

Primary LanguageJavaApache License 2.0Apache-2.0

Overview

This Kafka Connect connector provides the capability to watch a directory for files and read the data as new files are written to the input directory. Each of the records in the input file will be converted based on the user supplied schema.

The CSVRecordProcessor supports reading CSV or TSV files. It can convert a CSV on the fly to the strongly typed Kafka Connect data types. It currently has support for all of the schema types and logical types that are supported in Kafka 0.10.x. If you couple this with the Avro converter and Schema Registry by Confluent, you will be able to process csv files to strongly typed Avro data in real time.

Schema Configuration

This connector requires you to specify the schema of the files to be read. This is controlled by the key.schema and the value.schema configuration settings. This setting is a schema in json form stored as a string. The field schemas are used to parse the data in each row. This allows the user to strongly define the types for each field.

Generating Schemas

Csv

mvn clean package
export CLASSPATH="$(find target/kafka-connect-target/usr/share/java -type f -name '*.jar' | tr '\n' ':')"
kafka-run-class com.github.jcustenborder.kafka.connect.spooldir.SchemaGenerator -t csv -f src/test/resources/com/github/jcustenborder/kafka/connect/spooldir/csv/FieldsMatch.data -c config/CSVExample.properties -i id

Json

mvn clean package
export CLASSPATH="$(find target/kafka-connect-target/usr/share/java -type f -name '*.jar' | tr '\n' ':')"
kafka-run-class com.github.jcustenborder.kafka.connect.spooldir.SchemaGenerator -t json -f src/test/resources/com/github/jcustenborder/kafka/connect/spooldir/json/FieldsMatch.data -c config/JsonExample.properties -i id

Key schema example

{
    "name" : "com.example.users.UserKey",
    "type" : "STRUCT",
    "isOptional" : false,
    "fieldSchemas" : {
        "id" : {
          "type" : "INT64",
          "isOptional" : false
        }
    }
}

Value schema example

{
  "name" : "com.example.users.User",
  "type" : "STRUCT",
  "isOptional" : false,
  "fieldSchemas" : {
    "id" : {
      "type" : "INT64",
      "isOptional" : false
    },
    "first_name" : {
      "type" : "STRING",
      "isOptional" : true
    },
    "last_name" : {
      "type" : "STRING",
      "isOptional" : true
    },
    "email" : {
      "type" : "STRING",
      "isOptional" : true
    },
    "gender" : {
      "type" : "STRING",
      "isOptional" : true
    },
    "ip_address" : {
      "type" : "STRING",
      "isOptional" : true
    },
    "last_login" : {
      "name" : "org.apache.kafka.connect.data.Timestamp",
      "type" : "INT64",
      "version" : 1,
      "isOptional" : true
    },
    "account_balance" : {
      "name" : "org.apache.kafka.connect.data.Decimal",
      "type" : "BYTES",
      "version" : 1,
      "parameters" : {
        "scale" : "2"
      },
      "isOptional" : true
    },
    "country" : {
      "type" : "STRING",
      "isOptional" : true
    },
    "favorite_color" : {
      "type" : "STRING",
      "isOptional" : true
    }
  }
}

Logical Type Examples

Timestamp

{
  "name" : "org.apache.kafka.connect.data.Timestamp",
  "type" : "INT64",
  "version" : 1,
  "isOptional" : true
}

Decimal

{
  "name" : "org.apache.kafka.connect.data.Decimal",
  "type" : "BYTES",
  "version" : 1,
  "parameters" : {
    "scale" : "2"
  },
  "isOptional" : true
}

Date

{
  "name" : "org.apache.kafka.connect.data.Date",
  "type" : "INT32",
  "version" : 1,
  "isOptional" : true
}

Time

{
  "name" : "org.apache.kafka.connect.data.Time",
  "type" : "INT32",
  "version" : 1,
  "isOptional" : true
}

Configuration

SpoolDirCsvSourceConnector

The SpoolDirCsvSourceConnector will monitor the directory specified in input.path for files and read them as a CSV converting each of the records to the strongly typed equavalent specified in key.schema and value.schema.

name=connector1
tasks.max=1
connector.class=com.github.jcustenborder.kafka.connect.spooldir.SpoolDirCsvSourceConnector

# Set these required values
finished.path=
input.file.pattern=
error.path=
topic=
input.path=
Name Description Type Default Valid Values Importance
error.path The directory to place files in which have error(s). This directory must exist and be writable by the user running Kafka Connect. string com.github.jcustenborder.kafka.connect.spooldir.SpoolDirSourceConnectorConfig$WritableDirectoryValidator@319b4fb3 high
finished.path The directory to place files that have been successfully processed. This directory must exist and be writable by the user running Kafka Connect. string com.github.jcustenborder.kafka.connect.spooldir.SpoolDirSourceConnectorConfig$WritableDirectoryValidator@20851e5b high
input.file.pattern Regular expression to check input file names against. This expression must match the entire filename. The equivalent of Matcher.matches(). string high
input.path The directory to read files that will be processed. This directory must exist and be writable by the user running Kafka Connect. string com.github.jcustenborder.kafka.connect.spooldir.SpoolDirSourceConnectorConfig$WritableDirectoryValidator@5de6d178 high
topic The Kafka topic to write the data to. string high
halt.on.error Should the task halt when it encounters an error or continue to the next file. boolean true high
key.schema The schema for the key written to Kafka. string "" high
value.schema The schema for the value written to Kafka. string "" high
csv.first.row.as.header Flag to indicate if the fist row of data contains the header of the file. If true the position of the columns will be determined by the first row to the CSV. The column position will be inferred from the position of the schema supplied in value.schema. If set to true the number of columns must be greater than or equal to the number of fields in the schema. boolean false medium
schema.generation.enabled Flag to determine if schemas should be dynamically generated. If set to true, key.schema and value.schema can be omitted, but schema.generation.key.name and schema.generation.value.name must be set. boolean false medium
schema.generation.key.fields The field(s) to use to build a key schema. This is only used during schema generation. list [] medium
schema.generation.key.name The name of the generated key schema. string com.github.jcustenborder.kafka.connect.model.Key medium
schema.generation.value.name The name of the generated value schema. string com.github.jcustenborder.kafka.connect.model.Value medium
timestamp.field The field in the value schema that will contain the parsed timestamp for the record. This field cannot be marked as optional and must be a Timestamp string "" medium
timestamp.mode Determines how the connector will set the timestamp for the ConnectRecord. If set to Field then the timestamp will be read from a field in the value. This field cannot be optional and must be a Timestamp. Specify the field in timestamp.field. If set to FILE_TIME then the last modified time of the file will be used. If set to PROCESS_TIME the time the record is read will be used. string PROCESS_TIME ValidEnum{enum=TimestampMode, allowed=[FIELD, FILE_TIME, PROCESS_TIME]} medium
batch.size The number of records that should be returned with each batch. int 1000 low
csv.case.sensitive.field.names Flag to determine if the field names in the header row should be treated as case sensitive. boolean false low
csv.escape.char Escape character. int 92 low
csv.file.charset Character set to read wth file with. string UTF-8 Big5,Big5-HKSCS,CESU-8,EUC-JP,EUC-KR,GB18030,GB2312,GBK,IBM-Thai,IBM00858,IBM01140,IBM01141,IBM01142,IBM01143,IBM01144,IBM01145,IBM01146,IBM01147,IBM01148,IBM01149,IBM037,IBM1026,IBM1047,IBM273,IBM277,IBM278,IBM280,IBM284,IBM285,IBM290,IBM297,IBM420,IBM424,IBM437,IBM500,IBM775,IBM850,IBM852,IBM855,IBM857,IBM860,IBM861,IBM862,IBM863,IBM864,IBM865,IBM866,IBM868,IBM869,IBM870,IBM871,IBM918,ISO-2022-CN,ISO-2022-JP,ISO-2022-JP-2,ISO-2022-KR,ISO-8859-1,ISO-8859-13,ISO-8859-15,ISO-8859-2,ISO-8859-3,ISO-8859-4,ISO-8859-5,ISO-8859-6,ISO-8859-7,ISO-8859-8,ISO-8859-9,JIS_X0201,JIS_X0212-1990,KOI8-R,KOI8-U,Shift_JIS,TIS-620,US-ASCII,UTF-16,UTF-16BE,UTF-16LE,UTF-32,UTF-32BE,UTF-32LE,UTF-8,windows-1250,windows-1251,windows-1252,windows-1253,windows-1254,windows-1255,windows-1256,windows-1257,windows-1258,windows-31j,x-Big5-HKSCS-2001,x-Big5-Solaris,x-COMPOUND_TEXT,x-euc-jp-linux,x-EUC-TW,x-eucJP-Open,x-IBM1006,x-IBM1025,x-IBM1046,x-IBM1097,x-IBM1098,x-IBM1112,x-IBM1122,x-IBM1123,x-IBM1124,x-IBM1364,x-IBM1381,x-IBM1383,x-IBM300,x-IBM33722,x-IBM737,x-IBM833,x-IBM834,x-IBM856,x-IBM874,x-IBM875,x-IBM921,x-IBM922,x-IBM930,x-IBM933,x-IBM935,x-IBM937,x-IBM939,x-IBM942,x-IBM942C,x-IBM943,x-IBM943C,x-IBM948,x-IBM949,x-IBM949C,x-IBM950,x-IBM964,x-IBM970,x-ISCII91,x-ISO-2022-CN-CNS,x-ISO-2022-CN-GB,x-iso-8859-11,x-JIS0208,x-JISAutoDetect,x-Johab,x-MacArabic,x-MacCentralEurope,x-MacCroatian,x-MacCyrillic,x-MacDingbat,x-MacGreek,x-MacHebrew,x-MacIceland,x-MacRoman,x-MacRomania,x-MacSymbol,x-MacThai,x-MacTurkish,x-MacUkraine,x-MS932_0213,x-MS950-HKSCS,x-MS950-HKSCS-XP,x-mswin-936,x-PCK,x-SJIS_0213,x-UTF-16LE-BOM,X-UTF-32BE-BOM,X-UTF-32LE-BOM,x-windows-50220,x-windows-50221,x-windows-874,x-windows-949,x-windows-950,x-windows-iso2022jp low
csv.ignore.leading.whitespace Sets the ignore leading whitespace setting - if true, white space in front of a quote in a field is ignored. boolean true low
csv.ignore.quotations Sets the ignore quotations mode - if true, quotations are ignored. boolean false low
csv.keep.carriage.return Flag to determine if the carriage return at the end of the line should be maintained. boolean false low
csv.null.field.indicator Indicator to determine how the CSV Reader can determine if a field is null. Valid values are EMPTY_SEPARATORS, EMPTY_QUOTES, BOTH, NEITHER. For more information see http://opencsv.sourceforge.net/apidocs/com/opencsv/enums/CSVReaderNullFieldIndicator.html. string NEITHER ValidEnum{enum=CSVReaderNullFieldIndicator, allowed=[EMPTY_SEPARATORS, EMPTY_QUOTES, BOTH, NEITHER]} low
csv.quote.char The character that is used to quote a field. This typically happens when the csv.separator.char character is within the data. int 34 low
csv.separator.char The character that seperates each field. Typically in a CSV this is a , character. A TSV would use \t. int 44 low
csv.skip.lines Number of lines to skip in the beginning of the file. int 0 low
csv.strict.quotes Sets the strict quotes setting - if true, characters outside the quotes are ignored. boolean false low
csv.verify.reader Flag to determine if the reader should be verified. boolean true low
empty.poll.wait.ms The amount of time to wait if a poll returns an empty list of records. long 1000 [1,...,9223372036854775807] low
file.minimum.age.ms The amount of time in milliseconds after the file was last written to before the file can be processed. long 0 [0,...,9223372036854775807] low
parser.timestamp.date.formats The date formats that are expected in the file. This is a list of strings that will be used to parse the date fields in order. The most accurate date format should be the first in the list. Take a look at the Java documentation for more info. https://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html list [yyyy-MM-dd'T'HH:mm:ss, yyyy-MM-dd' 'HH:mm:ss] low
parser.timestamp.timezone The timezone that all of the dates will be parsed with. string UTC low
processing.file.extension Before a file is processed, it is renamed to indicate that it is currently being processed. This setting is appended to the end of the file. string .PROCESSING ValidPattern{pattern=^.*..+$} low

SpoolDirJsonSourceConnector

This connector is used to stream JSON files from a directory while converting the data based on the schema supplied in the configuration.

name=connector1
tasks.max=1
connector.class=com.github.jcustenborder.kafka.connect.spooldir.SpoolDirJsonSourceConnector

# Set these required values
finished.path=
input.file.pattern=
topic=
error.path=
input.path=
Name Description Type Default Valid Values Importance
error.path The directory to place files in which have error(s). This directory must exist and be writable by the user running Kafka Connect. string com.github.jcustenborder.kafka.connect.spooldir.SpoolDirSourceConnectorConfig$WritableDirectoryValidator@4a126898 high
finished.path The directory to place files that have been successfully processed. This directory must exist and be writable by the user running Kafka Connect. string com.github.jcustenborder.kafka.connect.spooldir.SpoolDirSourceConnectorConfig$WritableDirectoryValidator@21940db3 high
input.file.pattern Regular expression to check input file names against. This expression must match the entire filename. The equivalent of Matcher.matches(). string high
input.path The directory to read files that will be processed. This directory must exist and be writable by the user running Kafka Connect. string com.github.jcustenborder.kafka.connect.spooldir.SpoolDirSourceConnectorConfig$WritableDirectoryValidator@f723474 high
topic The Kafka topic to write the data to. string high
halt.on.error Should the task halt when it encounters an error or continue to the next file. boolean true high
key.schema The schema for the key written to Kafka. string "" high
value.schema The schema for the value written to Kafka. string "" high
schema.generation.enabled Flag to determine if schemas should be dynamically generated. If set to true, key.schema and value.schema can be omitted, but schema.generation.key.name and schema.generation.value.name must be set. boolean false medium
schema.generation.key.fields The field(s) to use to build a key schema. This is only used during schema generation. list [] medium
schema.generation.key.name The name of the generated key schema. string com.github.jcustenborder.kafka.connect.model.Key medium
schema.generation.value.name The name of the generated value schema. string com.github.jcustenborder.kafka.connect.model.Value medium
timestamp.field The field in the value schema that will contain the parsed timestamp for the record. This field cannot be marked as optional and must be a Timestamp string "" medium
timestamp.mode Determines how the connector will set the timestamp for the ConnectRecord. If set to Field then the timestamp will be read from a field in the value. This field cannot be optional and must be a Timestamp. Specify the field in timestamp.field. If set to FILE_TIME then the last modified time of the file will be used. If set to PROCESS_TIME the time the record is read will be used. string PROCESS_TIME ValidEnum{enum=TimestampMode, allowed=[FIELD, FILE_TIME, PROCESS_TIME]} medium
batch.size The number of records that should be returned with each batch. int 1000 low
empty.poll.wait.ms The amount of time to wait if a poll returns an empty list of records. long 1000 [1,...,9223372036854775807] low
file.minimum.age.ms The amount of time in milliseconds after the file was last written to before the file can be processed. long 0 [0,...,9223372036854775807] low
parser.timestamp.date.formats The date formats that are expected in the file. This is a list of strings that will be used to parse the date fields in order. The most accurate date format should be the first in the list. Take a look at the Java documentation for more info. https://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html list [yyyy-MM-dd'T'HH:mm:ss, yyyy-MM-dd' 'HH:mm:ss] low
parser.timestamp.timezone The timezone that all of the dates will be parsed with. string UTC low
processing.file.extension Before a file is processed, it is renamed to indicate that it is currently being processed. This setting is appended to the end of the file. string .PROCESSING ValidPattern{pattern=^.*..+$} low

Building on you workstation

git clone git@github.com:jcustenborder/kafka-connect-spooldir.git
cd kafka-connect-spooldir
mvn clean package

Debugging

Running with debugging

./bin/debug.sh

Running with debugging - suspending

export SUSPEND='y'
./bin/debug.sh