/fsspark

Feature selection in Spark

Primary LanguagePythonApache License 2.0Apache-2.0

Python application Python Package using Conda

fsspark


Feature selection in Spark

Description

fsspark is a python module to perform feature selection and machine learning based on spark. Pipelines written using fsspark can be divided roughly in four major stages: 1) data pre-processing, 2) univariate filters, 3) multivariate filters and 4) machine learning wrapped with cross-validation (Figure 1).

Feature Selection flowchart Figure 1. Feature selection workflow example implemented in fsspark.

Documentation

The package documentation describes the data structures and features selection methods implemented in fsspark.

Installation

  • pip
git clone https://github.com/enriquea/fsspark.git
cd fsspark
pip install . -r requirements.txt
  • conda
git clone https://github.com/enriquea/fsspark.git
cd fsspark
conda env create -f environment.yml
conda activate fsspark-venv
pip install . -r requirements.txt

Maintainers