/jmpy

jm's Python utils

Primary LanguagePythonMIT LicenseMIT

jmpy

jm's Python utils

Instalation

sudo python3 setup.py install

num_stats - detailed statistics in Your commandline

Apart from the utility functions below, this contains a cmd-line program num_stats that prints statistics of numbers read from STDIN (or other files) to STDOUT.

Run num_stats --help for help.

For example:

# generate 300 random numbers first
$ python -c 'import numpy;print("\n".join(map(str,numpy.random.normal(size=300))))' > numbers

Then:

# pipe them into the num_stats program
$ cat numbers | num_stats -c
count  = 300	
sum    = 14.511	
mean   = 0.048 ± 1.036	
median = 0.060	
min = -3.473     1% = -2.260    5% = -1.610    25% = -0.652	
max = 2.878      99% = 2.322    95% = 1.628    75% = 0.796	

 <from,    to)  #      statistics of bin count
-----------------------------------------------
-3.473, -2.838  1       0%  0% .
-2.838, -2.203  4   1   2%  1% **.
-2.203, -1.568 15   5   7%  5% *********.
-1.568, -0.932 35   -  18% 12% ***********************.
-0.932, -0.297 58  25  38% 19% **************************************.
-0.297,  0.338 76  mM  63% 25% **************************************************
 0.338,  0.973 52  75  80% 17% **********************************.
 0.973,  1.608 42   -  94% 14% ***************************.
 1.608,  2.243 10  95  98%  3% ******.
 2.243,  2.878  7  99 100%  2% ****.

Table of Contents

jmpy/utils

identity

identity(x)

Return itself

>>> identity(1)
1

filter_null

filter_null(iterable)

Filter out elements that do not evaluate to True

>>> list(filter_null((0, None, 1, '', 'cherry')))
[1, 'cherry']

filter_both

filter_both(predicate, iterable)

Splits the iterable into two groups, based on the result of calling predicate on each element.

WARN: Consumes the whole iterable in the process. This is the price for calling the predicate function only once for each element. (See itertools recipes for similar functionality without this requirement.)

>>> filter_both(lambda x: x%2 == 0, range(4))
([0, 2], [1, 3])

flatten

flatten(iterables)
>>> list(flatten(((1, 2, 3), (4, 5, 6))))
[1, 2, 3, 4, 5, 6]

argmax

argmax(pairs)

Given an iterable of pairs (key, value), return the key corresponding to the greatest value. Raises ValueError on empty sequence.

>>> argmax(zip(range(20), range(20, 0, -1)))
0

argmin

argmin(pairs)

Given an iterable of pairs (key, value), return the key corresponding to the smallest value. Raises ValueError on empty sequence.

>>> argmin(zip(range(20), range(20, 0, -1)))
19

argmax_index

argmax_index(values)

Given an iterable of values, return the index of the (first) greatest value. Raises ValueError on empty sequence.

>>> argmax_index([0, 4, 3, 2, 1, 4, 0])
1

argmin_index

argmin_index(values)

Given an iterable of values, return the index of the (first) smallest value. Raises ValueError on empty sequence.

>>> argmin_index([10, 4, 0, 2, 1, 0])
2

bucket_by_key

bucket_by_key(iterable, key_fc)

Throws items in @iterable into buckets given by @key_fc function. e.g.

>>> bucket_by_key([1, 2, -3, 4, 5, 6, -7, 8, -9], lambda num: 'neg' if num < 0 else 'nonneg')
OrderedDict([('nonneg', [1, 2, 4, 5, 6, 8]), ('neg', [-3, -7, -9])])

first_true_pred

first_true_pred(predicates, value)

Given a list of predicates and a value, return the index of first predicate, s.t. predicate(value) == True. If no such predicate found, raises IndexError.

>>> first_true_pred([lambda x: x%2==0, lambda x: x%2==1], 13)
1

cache_into

cache_into(factory, filename)

Simple pickle caching. Calls factory, stores result to filename pickle. Subsequent calls load the obj from the pickle instead of running the factory again.

consuming_length

consuming_length(iterator)

Return length of an iterator, consuming its contents. O(1) memory.

>>> consuming_length(range(10))
10

simple_tokenize

simple_tokenize(txt, sep_rexp=r"\W")

Iterates through tokens, kwarg sep_rexp specifies the whitespace. O(N) memory.

>>> list(simple_tokenize('23_45 hello, how are  you?'))
['23_45', 'hello', 'how', 'are', 'you']

k_grams

k_grams(iterable, k)

Returns iterator of k-grams of elements from iterable.

>>> list(k_grams(range(4), 2))
[(0, 1), (1, 2), (2, 3)]
>>> list(k_grams((), 2))
[]
>>> list(k_grams((1,), 2))
[]

uniq

uniq(iterable, count=False)

Similar to unix uniq. Returns counts as well if count arg is True. Has O(1) memory footprint.

>>> list(uniq([1, 1, 1, 2, 3, 3, 2, 2]))
[1, 2, 3, 2]
>>> list(uniq([1, 1, 1, 2, 3, 3, 2, 2], count=True))
[(3, 1), (1, 2), (2, 3), (2, 2)]
>>> list(uniq([1, None]))
[1, None]
>>> list(uniq([None]))
[None]
>>> list(uniq([]))
[]

group_consequent

group_consequent(iterator, key=None)

Groups consequent elements from an iterable and returns them as a sequence.

Has O(maximal groupsize) memory footprint.

>>> list(group_consequent([0, 2, 1, 3, 2, 1], key=lambda x:x%2))
[[0, 2], [1, 3], [2], [1]]
>>> list(group_consequent([None, None]))
[[None, None]]
>>> [len(g) for g in group_consequent([1, 1, 1, 2, 3, 3, 2, 2])]
[3, 1, 2, 2]

nonempty_strip

nonempty_strip(iterable)
>>> list(nonempty_strip(['little ', '    ', '\tpiggy\\n']))
['little', 'piggy']

collapse_whitespace

collapse_whitespace(txt)
>>> collapse_whitespace("bla   bla")
'bla bla'

num_stats

num_stats(numbers, print=False, print_formats=None)

Computes stats of the numbers, returns an OrderedDict with value and suggested print format

>>> num_stats(range(10))
OrderedDict([('count', 10), ('sum', 45), ('mean', 4.5), ('sd', 2.8722813232690143), ('min', 0), ('1%', 0.09), ('5%', 0.45), ('25%', 2.25), ('50%', 4.5), ('75%', 6.75), ('95%', 8.549999999999999), ('99%', 8.91), ('max', 9)])
>>> print_num_stats(num_stats(range(10)))
count 10
sum 45.000
mean 4.500
sd 2.872
min 0.000
1% 0.090
5% 0.450
25% 2.250
50% 4.500
75% 6.750
95% 8.550
99% 8.910
max 9.000

full_stats

full_stats(numbers, count_hist=True, sum_hist=False, bins='sturges', **kwargs)

Prints statistics of a list of numbers to console.

Arguments:

  • count_hist: prints histogram.
  • sum_hist: prints histogram, but of SUMS of the values in the bins.
  • bins: numpy bins arguments
>>> import numpy as np
>>> np.random.seed(666)
>>> first_peak = np.random.normal(size=100)
>>> second_peak = np.random.normal(loc=4,size=100)
>>> numbers = np.concatenate([first_peak, second_peak])
>>> full_stats(numbers)
count  = 200
sum    = 403.403
mean   = 2.017 ± 2.261
median = 1.874
min = -3.095	 1% = -1.870	 5% = -0.990	25% = -0.045
max = 7.217	99% = 6.063	95% = 5.404	75% = 4.089
<BLANKLINE>
<from,    to)  #       statistics of bin count
------------------------------------------------
-3.095, -1.949  1        0%  0% *
-1.949, -0.803 18   5.  10%  9% ******************
-0.803,  0.343 48   -.  34% 24% ************************************************
0.343,  1.488 28       48% 14% ****************************
1.488,  2.634 12   mM  54%  6% ************
2.634,  3.780 34       70% 17% **********************************
3.780,  4.926 39   -.  90% 20% ***************************************
4.926,  6.071 18  95.  99%  9% ******************
6.071,  7.217  2      100%  1% **

print_num_stats

print_num_stats(stats, units=None, formats=None, file=None)

Utility function to print results of num_stats function.

>>> print_num_stats(num_stats(range(10)), units={'count':'iterations'}, formats={'sum':'%.5f'})
count 10 iterations
sum 45.00000
mean 4.500
sd 2.872
min 0.000
1% 0.090
5% 0.450
25% 2.250
50% 4.500
75% 6.750
95% 8.550
99% 8.910
max 9.000
>>> print_num_stats(num_stats(range(10)), formats={'sum':'', '1%':'', '5%':''})
count 10
mean 4.500
sd 2.872
min 0.000
25% 2.250
50% 4.500
75% 6.750
95% 8.550
99% 8.910
max 9.000

mod_stdout

@_contextlib.contextmanager
mod_stdout(transform, redirect_fn=_contextlib.redirect_stdout, print_fn=print)

A context manager that modifies every line printed to stdout.

>>> with mod_stdout(lambda line: line.upper()):
...     print("this will be upper")
THIS WILL BE UPPER

prefix_stdout

prefix_stdout(prefix)

A context manager that prefixes every line printed to stout by prefix.

>>> with prefix_stdout(" * "):
...     print("bullet")
 * bullet