pyspark

There are 4432 repositories under pyspark topic.

ibis-project/ibis
the portable Python dataframe library
Language:Python6.2k 86 3.5k682
microsoft/SynapseML
Simple and Distributed Machine Learning
Language:Scala5.2k 137 764852
JohnSnowLabs/spark-nlp
State of the Art Natural Language Processing
Language:Scala4.1k 98 908733
apache/linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Language:Java3.4k 262 2.6k1.2k
AlexIoannides/pyspark-example-project
Implementing best practices for PySpark ETL jobs and applications.
Language:Python2k 57 18776
uber/petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Language:Python1.9k 37 308287
awesome-spark/awesome-spark
A curated list of awesome Apache Spark packages and resources.
Language:Shell1.8k 83 75341
jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Language:Jupyter Notebook1.7k 97 10917
ptyadana/SQL-Data-Analysis-and-Visualization-Projects
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Language:Jupyter Notebook1.6k 22 0572
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Language:Python1.5k 36 219233
jupyter-incubator/sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Language:Python1.4k 44 442455
narwhals-dev/narwhals
Lightweight and extensible compatibility layer between dataframe libraries!
Language:Python1.4k 11 865171
logicalclocks/hopsworks
Hopsworks - Data-Intensive AI platform with a Feature Store
Language:Java1.3k 33 22153
mahmoudparsian/pyspark-tutorial
PySpark-Tutorial provides basic algorithms using PySpark
Language:Jupyter Notebook1.3k 53 3477
graphframes/graphframes
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
Language:Scala1.1k 53 407255
mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Language:Java1.1k 126 26659
lakehq/sail
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.
Language:Rust1.1k 11 23263
h2oai/sparkling-water
Sparkling Water provides H2O functionality inside Spark cluster
Language:Scala977 175 3.1k361
lyhue1991/eat_pyspark_in_10_days
pyspark🍒🥭 is delicious，just eat it!😋😋
Language:Python821 9 1222
WeBankFinTech/Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Language:Vue813 69 40266
HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Language:Python807 39 6349
kuwala-io/kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
Language:JavaScript805 12 7255
MrPowers/chispa
PySpark test helper methods with beautiful error messages
Language:Python724 3 6775
ankurchavda/SparkLearning
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
681 18 080
mrpowers-io/quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Language:Python675 18 11599
Nike-Inc/koheesio
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Language:Python648 14 8837
kevinschaich/pyspark-cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
627 6 0196
capitalone/datacompy
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
Language:Python611 21 167148
cluster-apps-on-docker/spark-standalone-cluster-on-docker
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
Language:Jupyter Notebook497 9 25197
cartershanklin/pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Language:Python479 11 0229
ericxiao251/spark-syntax
This is a repo documenting the best practices in PySpark.
Language:Jupyter Notebook462 14 1078
commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
Language:Python443 20 2990
databrickslabs/dbldatagen
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Language:Python432 13 10183
CamDavidsonPilon/tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Language:Python403 9 3654
ekampf/PySpark-Boilerplate
A boilerplate for writing PySpark Jobs
Language:Python394 18 2154
typedef-ai/fenic
Build reliable AI and agentic applications with DataFrames
Language:Python378 0 1023

pyspark

ibis-project/ibis

microsoft/SynapseML

JohnSnowLabs/spark-nlp

apache/linkis

AlexIoannides/pyspark-example-project

uber/petastorm

awesome-spark/awesome-spark

jadianes/spark-py-notebooks

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

hi-primus/optimus

jupyter-incubator/sparkmagic

narwhals-dev/narwhals

logicalclocks/hopsworks

mahmoudparsian/pyspark-tutorial

graphframes/graphframes

mahmoudparsian/data-algorithms-book

lakehq/sail

h2oai/sparkling-water

lyhue1991/eat_pyspark_in_10_days

WeBankFinTech/Scriptis

HariSekhon/DevOps-Python-tools

kuwala-io/kuwala

MrPowers/chispa

ankurchavda/SparkLearning

mrpowers-io/quinn

Nike-Inc/koheesio

kevinschaich/pyspark-cheatsheet

capitalone/datacompy

cluster-apps-on-docker/spark-standalone-cluster-on-docker

cartershanklin/pyspark-cheatsheet

ericxiao251/spark-syntax

commoncrawl/cc-pyspark

databrickslabs/dbldatagen

CamDavidsonPilon/tdigest

ekampf/PySpark-Boilerplate

typedef-ai/fenic