Aiven-Open/gcs-connector-for-apache-kafka

Add support for saving the raw value on GCS

Closed this issue · 4 comments

ndajr commented

I'm using the following configuration:

format.output.type: jsonl
format.output.fields: value

By result the connector is saving objects on GCS with this format:

{"value":{"id":"foo"}}
{"value":{"id":"bar"}}

What I'd like to have is:

{"id":"foo"}
{"id":"bar"}

This would helpful for example when using GCS as source of BigQuery, so we could avoid too much nesting when defining the schema and also when querying the data.

One solution would adding a new output field plainValue and here checking the output field, in case it is plainValue we could use ValuePlainWriter otherwise JsonLinesOutputWriter as default.

Any other suggestions are welcome too, if you like the idea I am willing to raise a PR. Thank you!

Hi @neemiasjnr
Looks like our configuration could be improved!

I believe there's a workaround to achieve what you want. Could you please try:

value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
format.output.type=csv
format.output.fields=value
format.output.fields.value.encoding=none

However, this will work only if record values are already single-line JSON strings.

You contribution would be very much welcome!
I can suggest a slightly different idea, something like this: We can introduce a new configuration format.output.json.envelope (true by default to keep backward compatibility), which controls if JSON outputs should be enveloped in this "key": ... "value": ... thing. A check is needed to allow disabling the envelope only when there is only one field in format.output.fields. It can be ignored when the output type is not json or jsonl.
How does it sound?

@HelenMel would be happy to help you.

ndajr commented

Thank you for supporting this idea @ivanyu, I created two draft PRs in an attempt of implementing what you suggested, I'm trying to improve the code and add more tests, meanwhile I appreciate any feedback from you!

Addressed in #92 and #96