pyspark
There are 3925 repositories under pyspark topic.
ibis-project/ibis
the portable Python dataframe library
microsoft/SynapseML
Simple and Distributed Machine Learning
JohnSnowLabs/spark-nlp
State of the Art Natural Language Processing
apache/linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
AlexIoannides/pyspark-example-project
Implementing best practices for PySpark ETL jobs and applications.
uber/petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
awesome-spark/awesome-spark
A curated list of awesome Apache Spark packages and resources.
jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
ptyadana/SQL-Data-Analysis-and-Visualization-Projects
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
jupyter-incubator/sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
logicalclocks/hopsworks
Hopsworks - Data-Intensive AI platform with a Feature Store
mahmoudparsian/pyspark-tutorial
PySpark-Tutorial provides basic algorithms using PySpark
mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
h2oai/sparkling-water
Sparkling Water provides H2O functionality inside Spark cluster
narwhals-dev/narwhals
Lightweight and extensible compatibility layer between dataframe libraries!
WeBankFinTech/Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
lyhue1991/eat_pyspark_in_10_days
pyspark🍒🥭 is delicious,just eat it!😋😋
HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
kuwala-io/kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
lakehq/sail
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
MrPowers/chispa
PySpark test helper methods with beautiful error messages
ankurchavda/SparkLearning
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
mrpowers-io/quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Nike-Inc/koheesio
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
capitalone/datacompy
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
kevinschaich/pyspark-cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
cluster-apps-on-docker/spark-standalone-cluster-on-docker
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
ericxiao251/spark-syntax
This is a repo documenting the best practices in PySpark.
cartershanklin/pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
databrickslabs/dbldatagen
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
ekampf/PySpark-Boilerplate
A boilerplate for writing PySpark Jobs
CamDavidsonPilon/tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
awesome-spark/spark-gotchas
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
huseinzol05/Gather-Deployment
Gathers Python deployment, infrastructure and practices.