pyspark

There are 4432 repositories under pyspark topic.

  • ibis-project/ibis

    the portable Python dataframe library

    Language:Python6.2k863.5k682
  • SynapseML

    microsoft/SynapseML

    Simple and Distributed Machine Learning

    Language:Scala5.2k137764852
  • spark-nlp

    JohnSnowLabs/spark-nlp

    State of the Art Natural Language Processing

    Language:Scala4.1k98908733
  • apache/linkis

    Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

    Language:Java3.4k2622.6k1.2k
  • AlexIoannides/pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

    Language:Python2k5718776
  • uber/petastorm

    Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

    Language:Python1.9k37308287
  • awesome-spark/awesome-spark

    A curated list of awesome Apache Spark packages and resources.

    Language:Shell1.8k8375341
  • jadianes/spark-py-notebooks

    Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

    Language:Jupyter Notebook1.7k9710917
  • ptyadana/SQL-Data-Analysis-and-Visualization-Projects

    SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

    Language:Jupyter Notebook1.6k220572
  • hi-primus/optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

    Language:Python1.5k36219233
  • jupyter-incubator/sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

    Language:Python1.4k44442455
  • narwhals-dev/narwhals

    Lightweight and extensible compatibility layer between dataframe libraries!

    Language:Python1.4k11865171
  • logicalclocks/hopsworks

    Hopsworks - Data-Intensive AI platform with a Feature Store

    Language:Java1.3k3322153
  • mahmoudparsian/pyspark-tutorial

    PySpark-Tutorial provides basic algorithms using PySpark

    Language:Jupyter Notebook1.3k533477
  • graphframes

    graphframes/graphframes

    GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

    Language:Scala1.1k53407255
  • mahmoudparsian/data-algorithms-book

    MapReduce, Spark, Java, and Scala for Data Algorithms Book

    Language:Java1.1k12626659
  • lakehq/sail

    LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.

    Language:Rust1.1k1123263
  • h2oai/sparkling-water

    Sparkling Water provides H2O functionality inside Spark cluster

    Language:Scala9771753.1k361
  • lyhue1991/eat_pyspark_in_10_days

    pyspark🍒🥭 is delicious,just eat it!😋😋

    Language:Python82191222
  • WeBankFinTech/Scriptis

    Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

    Language:Vue8136940266
  • HariSekhon/DevOps-Python-tools

    80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

    Language:Python807396349
  • kuwala

    kuwala-io/kuwala

    Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times

    Language:JavaScript805127255
  • MrPowers/chispa

    PySpark test helper methods with beautiful error messages

    Language:Python72436775
  • ankurchavda/SparkLearning

    A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.

  • mrpowers-io/quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉

    Language:Python6751811599
  • koheesio

    Nike-Inc/koheesio

    Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

    Language:Python648148837
  • kevinschaich/pyspark-cheatsheet

    🐍 Quick reference guide to common patterns & functions in PySpark.

  • capitalone/datacompy

    Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

    Language:Python61121167148
  • spark-standalone-cluster-on-docker

    cluster-apps-on-docker/spark-standalone-cluster-on-docker

    Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

    Language:Jupyter Notebook497925197
  • cartershanklin/pyspark-cheatsheet

    PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

    Language:Python479110229
  • ericxiao251/spark-syntax

    This is a repo documenting the best practices in PySpark.

    Language:Jupyter Notebook462141078
  • commoncrawl/cc-pyspark

    Process Common Crawl data with Python and Spark

    Language:Python443202990
  • databrickslabs/dbldatagen

    Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

    Language:Python4321310183
  • CamDavidsonPilon/tdigest

    t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

    Language:Python40393654
  • ekampf/PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

    Language:Python394182154
  • typedef-ai/fenic

    Build reliable AI and agentic applications with DataFrames

    Language:Python37801023