/pyspark-data-sources

Custom PySpark Data Sources

Primary LanguagePythonApache License 2.0Apache-2.0

pyspark-data-sources

pypi

This repository showcases custom Spark data sources built using the new Python Data Source API for the upcoming Apache Spark 4.0 release. For an in-depth understanding of the API, please refer to the API source code.

Installation

pip install pyspark-data-sources

Usage

Note: Currently the following code only works with Apache Spark master branch.

from pyspark_datasources.github import GithubDataSource

# Register the data source
spark.dataSource.register(GithubDataSource)

spark.read.format("github").load("apache/spark").show()

Contributing

We welcome and appreciate any contributions to enhance and expand the custom data sources. If you're interested in contributing:

  • Add New Data Sources: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
  • Suggest Enhancements: If you have ideas to improve a data source or the API, we'd love to hear them!
  • Report Bugs: Found something that doesn't work as expected? Let us know by opening an issue.

Need help or have questions? Don't hesitate to open a new issue, and we'll do our best to assist you.