Koalas: pandas APIs on Apache Spark
The Koalas project makes data scientists more productive when interacting with big data, by augmenting Apache Spark's Python DataFrame API to be compatible with pandas'.
pandas is the de facto standard (single-node) dataframe implementation in Python, while Spark is the de facto standard for big data processing. With this package, data scientists can:
- Be immediately productive with Spark, with no learning curve, if one is already familiar with pandas.
- Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).
Table of Contents
Dependencies
- cmake for building pyarrow
- Spark 2.4. Some older versions of Spark may work too but they are not officially supported.
- A recent version of pandas. It is officially developed against 0.23+ but some other versions may work too.
- Python 3.5+ if you want to use type hints in UDFs. Work is ongoing to also support Python 2.
Get Started
Koalas is available at the Python package index:
pip install koalas
If this fails to install the pyarrow dependency, you may want to try installing with Python 3.6.x, as pip install arrow
does not work out of the box for 3.7 apache/arrow#1125.
After installing the package, you can import the package:
import databricks.koalas as ks
Now you can turn a pandas DataFrame into a Koalas DataFrame that is API-compliant with the former:
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
# Do some operations in place:
df['x2'] = df.x * df.x
Documentation
Project docs are published here: https://koalas.readthedocs.io
Mailing List
We use Google Groups for mailling list: https://groups.google.com/forum/#!forum/koalas-dev
Development Guide
Environment Setup
We recommend setting up a Conda environment for development:
conda create --name koalas-dev-env python=3.6
conda activate koalas-dev-env
conda install -c conda-forge pyspark=2.4
conda install -c conda-forge --yes --file requirements-dev.txt
pip install -e . # installs koalas from current checkout
Once setup, make sure you switch to koalas-dev-env
before development:
conda activate koalas-dev-env
Running Tests
There is a script ./dev/pytest
which is exactly same as pytest
but with some default settings to run Koalas tests easily.
To run all the tests, similar to our CI pipeline:
# Run all unittest and doctest
./dev/pytest
To run a specific test file:
# Run unittest
./dev/pytest -k test_dataframe.py
# Run doctest
./dev/pytest -k series.py --doctest-modules databricks
To run a specific doctest/unittest:
# Run unittest
./dev/pytest -k "DataFrameTest and test_Dataframe"
# Run doctest
./dev/pytest -k DataFrame.corr --doctest-modules databricks
Note that -k
is used for simplicity although it takes an expression. You can use --verbose
to check what to filter. See pytest --help
for more details.
Building Documentation
To build documentation via Sphinx:
cd docs && make clean html
It generates HTMLs under docs/build/html
directory. Open docs/build/html/index.html
to check if documentation is built properly.
Coding Conventions
We follow PEP 8 with one exception: lines can be up to 100 characters in length, not 79.
Release Instructions
Only project maintainers can do the following.
Step 1. Make sure the build is green.
Step 2. Create a new release on GitHub. Tag it as the same version as the setup.py. If the version is "0.1.0", tag the commit as "v0.1.0".
Step 3. Upload the package to PyPi:
rm -rf dist/koalas*
python setup.py bdist_wheel
export package_version=$(python setup.py --version)
echo $package_version
python3 -m pip install --user --upgrade twine
python3 -m twine upload --repository-url https://test.pypi.org/legacy/ dist/koalas-$package_version-py3-none-any.whl
python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/koalas-$package_version-py3-none-any.whl
FAQ
What's the project's status?
This project is currently in beta and is rapidly evolving. We plan to do weekly releases at this stage. You should expect the following differences:
-
some functions may be missing (see the Contributions section)
-
some behavior may be different, in particular in the treatment of nulls: Pandas uses Not a Number (NaN) special constants to indicate missing values, while Spark has a special flag on each value to indicate missing values. We would love to hear from you if you come across any discrepancies
-
because Spark is lazy in nature, some operations like creating new columns only get performed when Spark needs to print or write the dataframe.
Should I use PySpark's DataFrame API or Koalas?
If you are already familiar with pandas and want to leverage Spark for big data, we recommend using Koalas. If you are learning Spark from ground up, we recommend you start with PySpark's API.
How can I request support for a method?
File a GitHub issue: https://github.com/databricks/koalas/issues
Databricks customers are also welcome to file a support ticket to request a new feature.
How is Koalas different from Dask?
Different projects have different focuses. Spark is already deployed in virtually every organization, and often is the primary interface to the massive amount of data stored in data lakes. Koalas was inspired by Dask, and aims to make the transition from pandas to Spark easy for data scientists.
How can I contribute to Koalas?
Please create a GitHub issue if your favorite function is not yet supported.
Make sure the name also reflects precisely which function you want to implement, such as
DataFrame.fillna
or Series.dropna
. If an open issue already exists and you want do add
missing parameters, consider contributing to that issue instead.
We also document all the functions that are not yet supported in the missing directory. In most cases, it is very easy to add new functions by simply wrapping the existing pandas or Spark functions. Pull requests welcome!
Why a new project (instead of putting this in Apache Spark itself)?
Two reasons:
-
We want a venue in which we can rapidly iterate and make new releases. The overhead of making a release as a separate project is minuscule (in the order of minutes). A release on Spark takes a lot longer (in the order of days)
-
Koalas takes a different approach that might contradict Spark's API design principles, and those principles cannot be changed lightly given the large user base of Spark. A new, separate project provides an opportunity for us to experiment with new design principles.
How do I use this on Databricks?
Databricks Runtime for Machine Learning has the right versions of dependencies setup already, so you just need to install Koalas from PyPi when creating a cluster.
For the regular Databricks Runtime, you will need to upgrade pandas and NumPy versions to the required list.
In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning.