/scikit-learn-ts

Powerful machine learning library for Node.js – uses Python's scikit-learn under the hood.

Primary LanguageTypeScriptMIT LicenseMIT

sklearn ts logo

scikit-learn-ts

NPM Build Status MIT License Prettier Code Formatting

Intro

This project enables Node.js devs to use Python's powerful scikit-learn machine learning library – without having to know any Python. 🤯

See the full docs for more info.

Note This project is new and experimental. It works great for local development, but I wouldn't recommend using it for production just yet. You can follow the progress on Twitter @transitive_bs

Features

  • All TS classes are auto-generated from the official python scikit-learn docs!
  • All 257 classes are supported along with proper TS types and docs
    • KMeans
    • TSNE
    • PCA
    • LinearRegression
    • LogisticRegression
    • DecisionTreeClassifier
    • RandomForestClassifier
    • XGBClassifier
    • DBSCAN
    • StandardScaler
    • MinMaxScaler
    • ... all of them 💯
  • Generally much faster and more robust than JS-based alternatives
    • (benchmarks & comparisons coming soon)

Prequisites

This project is meant for Node.js users, so don't worry if you're not familiar with Python. This is the only step where you'll need to touch Python, and it should be pretty straightforward.

Make sure you have Node.js and Python 3 installed and in your PATH.

  • node >= 14
  • python >= 3.7

In python land, install numpy and scikit-learn either globally via pip or via your favorite virtualenv manager. The shell running your Node.js program will need access to these python modules, so if you're using a virtualenv, make sure it's activated.

If you're not sure what this means, it's okay. First install python, which will also install pip, python's package manager. Then run:

pip install numpy scikit-learn

Congratulations! You've safely navigated Python land, and from here on out, we'll be using Node.js / JS / TS. The sklearn NPM package will use your Python installation under the hood.

Install

npm install sklearn

Usage

See the full docs for more info.

import * as sklearn from 'sklearn'

const data = [
  [0, 0, 0],
  [0, 1, 1],
  [1, 0, 1],
  [1, 1, 1]
]

const py = await sklearn.createPythonBridge()

const model = new sklearn.TSNE({
  n_components: 2,
  perplexity: 2
})
await model.init(py)

const x = await model.fit_transform({ X: data })
console.log(x)

await model.dispose()
await py.disconnect()

Since the TS classes are auto-generated from the Python docs, the code will look almost identical to the Python version, so use their excellent API docs as a reference.

All class names, method names, attribute (accessor) names and types are the same as the official Python version.

The main differences are:

  • You need to call createPythonBridge() before using any sklearn classes
    • This spawns a Python child process and validates all of the Python dependencies
    • You can pass a custom python path via createPythonBridge({ python: '/path/to/your/python3' })
  • You need to pass this bridge to a class's async init method before using it
    • This creates an underlying Python variable representing your class instance
  • Instead of using numpy or pandas, we're just using plain JavaScript arrays
    • Anywhere the Python version would input or output a nympy array, we instead just use number[], number[][], etc
    • We take care of converting to and from numpy arrays automatically where necessary
  • Whenever you're done using an instance, call dispose() to free the underlying Python resources
  • Whenever you're done using your Python bridge, call disconnect() on the bridge to cleanly exit the Python child process

Restrictions

  • We don't currently support positional arguments; only keyword-based arguments:
// this works (keyword args)
const x = await model.fit_transform({ X: data })

// this doesn't work yet (positional args)
const y = await model.fit_transform(data)
  • We don't currently generate TS code for scikit-learn's built-in datasets
  • We don't currently generate TS code for scikit-learn's top-level function exports (only classes right now)
  • There are basic unit tests for a handful of the auto-generated TS classes, and they work well, but there are probably edge cases and bugs in other auto-generated classes
    • Please create an issue on GitHub if you run into any weird behavior and include as much detail as possible, including code snippets

Examples

Here are some side-by-side examples using the official Python scikit-learn package on the left and the TS sklearn package on the right.

StandardScaler

StandardScaler Python docs

Python TypeScript
import numpy as np
from sklearn.preprocessing import StandardScaler

data = np.array([
  [0, 0, 0],
  [0, 1, 1],
  [1, 0, 1],
  [1, 1, 1]
])

s = StandardScaler()

x = s.fit_transform(data)
import * as sklearn from 'sklearn'

const data = [
  [0, 0, 0],
  [0, 1, 1],
  [1, 0, 1],
  [1, 1, 1]
]

const py = await sklearn.createPythonBridge()

const s = new sklearn.StandardScaler()
await s.init(py)

const x = await s.fit_transform({ X: data })

KMeans

KMeans Python docs

Python TypeScript
import numpy as np
from sklearn.cluster import KMeans

data = np.array([
  [0, 0, 0],
  [0, 1, 1],
  [1, 0, 1],
  [1, 1, 1]
])

model = KMeans(
  n_clusters=2,
  random_state=42,
  n_init='auto'
)

x = model.fit_predict(data)
import * as sklearn from 'sklearn'

const data = [
  [0, 0, 0],
  [0, 1, 1],
  [1, 0, 1],
  [1, 1, 1]
]

const py = await sklearn.createPythonBridge()

const model = new sklearn.KMeans({
  n_clusters: 2,
  random_state: 42,
  n_init: 'auto'
})
await model.init(py)

const x = await model.fit_predict({ X: data })

TSNE

TSNE Python docs

Python TypeScript
import numpy as np
from sklearn.manifold import TSNE

data = np.array([
  [0, 0, 0],
  [0, 1, 1],
  [1, 0, 1],
  [1, 1, 1]
])

model = TSNE(
  n_components=2,
  perplexity=2,
  learning_rate='auto',
  init='random'
)

x = model.fit_transform(data)
import * as sklearn from 'sklearn'

const data = [
  [0, 0, 0],
  [0, 1, 1],
  [1, 0, 1],
  [1, 1, 1]
]

const py = await sklearn.createPythonBridge()

const model = new sklearn.TSNE({
  n_components: 2,
  perplexity: 2,
  learning_rate: 'auto',
  init: 'random'
})
await model.init(py)

const x = await model.fit_transform({ X: data })

See the full docs for more examples.

Why?

The Python ML ecosystem is generally a lot more mature than the Node.js ML ecosystem. Most ML research happens in Python, and many common ML tasks that Python devs take for granted are much more difficult to accomplish in Node.js.

For example, I was recently working on a data viz project using full-stack TypeScript, and I needed to use k-means and t-SNE on some text embeddings. I tested 6 different t-SNE JS packages and several k-means packages. None of the t-SNE packages worked for medium-sized inputs, they were 1000x slower in many cases, and I kept running into NaN city with the JS-based versions.

Case in point; it's incredibly difficult to compete with the robustness, speed, and maturity of proven Python ML libraries like scikit-learn in JS/TS land.

So instead of trying to build a Rust-based version from scratch or using ad hoc NPM packages like above, I decided to create an experiment to see how practical it would be to just use scikit-learn from Node.js.

And that's how scikit-learn-ts was born.

How it works

This project uses a fork of python-bridge to spawn a Python interpreter as a subprocess and communicates back and forth via standard Unix pipes. The IPC pipes don't interfere with stdout/stderr/stdin, so your Node.js code and the underlying Python code can print things normally.

The TS library is auto-generated from the Python scikit-learn API docs. By using the official Python docs as a source of truth, we can guarantee a certain level of compatibility and upgradeability.

For each scikit-learn HTML page that belongs to an exported Python class or function, we first parse it's metadata, params, methods, attributes, etc using cheerio, then we convert the Python types into equivalent TypeScript types. We then generate a corresponding TypeScript file which wraps an instance of that Python declaration via a PythonBridge.

For each TypeScript wrapper class of function, we take special care to handle serializing values back and forth between Node.js and Python as JSON, including converting between primitive arrays and numpy arrays where necessary. All numpy array conversions should be handled automatically for you since we only support serializing primitive JSON types over the PythonBridge. There may be some edge cases where the automatic numpy inference fails, but we have a regression test suite for parsing these cases, so as long as the official Python docs are correct for a given type, then our implicit numpy conversion logic should "just work".

Credit

This project is not affiliated with the official Python scikit-learn project. Hopefully it will be one day. 😄

All of the difficult machine learning work happens under the hood via the official Python scikit-learn project, with full credit given to their absolutely amazing team. This project is just a small open source experiment to try and leverage the existing scikit-learn ecosystem for the Node.js community.

See the full docs for more info.

License

The official Python scikit-learn project is licensed under the BSD 3-Clause.

This project is licensed under MIT © Travis Fischer.

If you found this project helpful, please consider following me on twitter twitter