apache/beam

[Feature Request]: Auto-detect Schema for Parquet Sink

gbsDojo opened this issue · 0 comments

What would you like to happen?

I would like to request the implementation of a feature that allows for auto-detecting the schema when saving data in Parquet using Apache Beam. Currently, when using the Parquet sink, the schema needs to be explicitly defined, which can be limiting in scenarios where the input data may vary, or the schema is not known in advance.

Justification:

  • Flexibility: The ability to auto-detect the schema would make the process more flexible, allowing Beam to dynamically adapt to different types of data without the need to pre-define a schema.
  • Ease of Use: It would reduce complexity for users, especially in scenarios where data is extracted from heterogeneous sources and the schema may change over time.
  • Compatibility: Many big data systems already offer schema auto-detection features (such as Apache Spark), and this functionality would bring Apache Beam on par with these solutions.
  • Extended Use Cases: Although there is a Google template that reads data from BigQuery to Parquet, there are scenarios where auto-detecting the schema would still be beneficial. This is particularly true when reading from other Parquet files or from other formats that are converted into dictionaries or key-value pairs. In these cases, having an automatic schema detection would streamline the process and reduce the need for manual intervention.

Suggested Implementation:

  • Implement a function in the Parquet sink that inspects the input data and automatically infers the schema.
  • Add an option in the ParquetIO PTransform to enable/disable this functionality, allowing users to opt for automatic schema detection or retain the current behavior where the schema must be explicitly defined.

Impact:
This feature would benefit users working with varied and dynamic schemas, simplifying pipelines and improving the usability of Apache Beam in big data environments.

Alternatives:
Currently, users must manually define the schema or use other tools to infer the schema before passing the data to Apache Beam.

Additional Considerations:

  • The implementation should ensure that schema auto-detection is efficient and does not introduce significant overhead in the pipeline.
  • Compatibility with different data types should be considered, and the functionality should be robust enough to handle complex schemas.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner