pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas allows us to focus more on research and less on programming. pandas is the perfect tool for bridging the gap between rapid iterations of ad-hoc analysis and production quality code [Source].
Apache Spark is a fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3 [Source].
This repo has the PySpark equivalent code for the pandas queries. This can save a lot of time in Googling.
- Python 3.5.2
- Spark 2.2.0
- PySpark 2.2.0
- Pandas 0.21.0
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
A detailed overview on how to contribute can be found in the contributing guide.
If you are simply looking to start working with the repo's codebase, navigate to the GitHub “issues” tab and start looking through interesting issues.