/sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

Primary LanguagePythonApache License 2.0Apache-2.0

Sparkit-learn

Build Status PyPi Join the chat at https://gitter.im/lensacom/sparkit-learn Gitential Coding Hours

PySpark + Scikit-learn = Sparkit-learn

GitHub: https://github.com/lensacom/sparkit-learn

About

Sparkit-learn aims to provide scikit-learn functionality and API on PySpark. The main goal of the library is to create an API that stays close to sklearn's.

The driving principle was to "Think locally, execute distributively." To accomodate this concept, the basic data block is always an array or a (sparse) matrix and the operations are executed on block level.

Requirements

  • Python 2.7.x or 3.4.x
  • Spark[>=1.3.0]
  • NumPy[>=1.9.0]
  • SciPy[>=0.14.0]
  • Scikit-learn[>=0.16]

Run IPython from notebooks directory

Run tests with

Quick start

Sparkit-learn introduces three important distributed data format:

Basic workflow

With the use of the described data structures, the basic workflow is almost identical to sklearn's.

Distributed vectorizing of texts

SparkCountVectorizer

SparkHashingVectorizer

SparkTfidfTransformer

Distributed Classifiers

Distributed Model Selection

ROADMAP

  • [ ] Transparent API to support plain numpy and scipy objects (partially done in the transparent_api branch)
  • [ ] Update all dependencies
  • [ ] Use Mllib and ML packages more extensively (since it becames more mature)
  • [ ] Support Spark DataFrames

Special thanks

  • scikit-learn community
  • spylearn community
  • pyspark community

Similar Projects ===============