pfilter
is a command-line tool for filtering your CSV dataset by percentiles.
Motivation. During the work on CaM project, we were required to filter
out too small and too big GitHub repositories by number of files.
No readily available command-line tool existed that could perform that
function, so we created pfilter
.
First, pull it from PyPI like this:
pip install pfilter
Now, execute it with the following flags:
pfilter --csv=foo.csv --c=age --lower=0.05 --upper=0.95 --o=filtered.csv
Where, --csv
is a path to your source CSV file, --c
is a column to filter by,
--lower
is a lower percentile (max is 1, so 0.05 is a 5th percentile, or P5
for short), --upper
is an upper percentile (max is 1, so 0.95 is a 95th
percentile, or P95 for short), and --o
is a location for the output, filtered
dataset.
Fork repository, make changes, send us a pull request. We will
review your changes and apply them to the master
branch shortly, provided
they don't violate our quality standards. To avoid frustration, before sending
us your pull request please run full build:
poetry build
You will need Python 3.11+ installed.