- Free software: MIT license
- Documentation: http://arm-preprocessing.readthedocs.io
- Python: 3.9.x, 3.10.x, 3.11.x, 3.12x
- Tested OS: Windows, Ubuntu, Fedora, Alpine, Arch, macOS. However, that does not mean it does not work on others
arm-preprocessing is a lightweight Python library supporting several key steps involving data preparation, manipulation, and discretisation for Association Rule Mining (ARM). 🧠 Embrace its minimalistic design that prioritises simplicity. 💡 The framework is intended to be fully extensible and offers seamless integration with related ARM libraries (e.g., NiaARM). 🔗
While numerous libraries facilitate data mining preprocessing tasks, this library is designed to integrate seamlessly with association rule mining. It harmonises well with the NiaARM library, a robust numerical association rule mining framework. The primary aim is to bridge the gap between preprocessing and rule mining, simplifying the workflow/pipeline. Additionally, its design allows for the effortless incorporation of new preprocessing methods and fast benchmarking.
- Loading various formats of datasets (CSV, JSON, TXT, TCX) 📊
- Converting datasets to different formats 🔄
- Loading different types of datasets (numerical dataset, discrete dataset, time-series data, text, etc.) 📉
- Dataset identification (which type of dataset) 🔍
- Dataset statistics 📈
- Discretisation methods 📏
- Data squashing methods 🤏
- Feature scaling methods ⚖️
- Feature selection methods 🎯
To install arm-preprocessing
with pip, use:
pip install arm-preprocessing
To install arm-preprocessing
on Alpine Linux, please use:
$ apk add py3-arm-preprocessing
To install arm-preprocessing
on Arch Linux, please use an AUR helper:
$ yay -Syyu python-arm-preprocessing
The following example demonstrates how to load a dataset from a file (csv, json, txt). More examples can be found in the examples/data_loading directory:
- Loading a dataset from a CSV file
- Loading a dataset from a JSON file
- Loading a dataset from a TCX file
- Loading a time-series dataset
from arm_preprocessing.dataset import Dataset
# Initialise dataset with filename (without format) and format (csv, json, txt)
dataset = Dataset('path/to/datasets', format='csv')
# Load dataset
dataset.load_data()
df = dataset.data
The following example demonstrates how to handle missing values in a dataset using imputation. More examples can be found in the examples/missing_values directory:
- Handling missing values in a dataset using row deletion
- Handling missing values in a dataset using column deletion
- Handling missing values in a dataset using imputation
from arm_preprocessing.dataset import Dataset
# Initialise dataset with filename and format
dataset = Dataset('examples/missing_values/data', format='csv')
dataset.load()
# Impute missing data
dataset.missing_values(method='impute')
The following example demonstrates how to discretise a dataset using the equal width method. More examples can be found in the examples/discretisation directory:
- Discretising a dataset using the equal width method
- Discretising a dataset using the equal frequency method
- Discretising a dataset using k-means clustering
from arm_preprocessing.dataset import Dataset
# Initialise dataset with filename (without format) and format (csv, json, txt)
dataset = Dataset('datasets/sportydatagen', format='csv')
dataset.load_data()
# Discretise dataset using equal width discretisation
dataset.discretise(method='equal_width', num_bins=5, columns=['calories'])
The following example demonstrates how to squash a dataset using the euclidean similarity. More examples can be found in the examples/squashing directory:
from arm_preprocessing.dataset import Dataset
# Initialise dataset with filename and format
dataset = Dataset('datasets/breast', format='csv')
dataset.load()
# Squash dataset
dataset.squash(threshold=0.75, similarity='euclidean')
The following example demonstrates how to scale the dataset's features. More examples can be found in the examples/scaling directory:
from arm_preprocessing.dataset import Dataset
# Initialise dataset with filename and format
dataset = Dataset('datasets/Abalone', format='csv')
dataset.load()
# Scale dataset using normalisation
dataset.scale(method='normalisation')
The following example demonstrates how to select features from a dataset. More examples can be found in the examples/feature_selection directory:
from arm_preprocessing.dataset import Dataset
# Initialise dataset with filename and format
dataset = Dataset('datasets/sportydatagen', format='csv')
dataset.load()
# Feature selection
dataset.feature_selection(
method='kendall', threshold=0.15, class_column='calories')
[1] NiaARM: A minimalistic framework for Numerical Association Rule Mining
[2] uARMSolver: universal Association Rule Mining Solver
[1] I. Fister, I. Fister Jr., D. Novak and D. Verber, Data squashing as preprocessing in association rule mining, 2022 IEEE Symposium Series on Computational Intelligence (SSCI), Singapore, Singapore, 2022, pp. 1720-1725, doi: 10.1109/SSCI51031.2022.10022240.
[2] I. Fister Jr., I. Fister A brief overview of swarm intelligence-based algorithms for numerical association rule mining. arXiv preprint arXiv:2010.15524 (2020).
This package is distributed under the MIT License. This license can be found online at http://www.opensource.org/licenses/MIT.
This framework is provided as-is, and there are no guarantees that it fits your purposes or that it is bug-free. Use it at your own risk!