scikit-hep/root_numpy

Use np.fromiter for GeneratorTypes

parnmatt opened this issue · 8 comments

https://github.com/rootpy/root_numpy/blob/master/root_numpy/_hist.py#L47

np.asarray will take any array-like structure and return an array; however this doesn't work for generators/iterators. np.fromiter will work for any iterable.

array = np.fromiter(array, dtype=np.double) in theory should do exactly the same as what is currently used, but also accept generator-like objects too.

I'd make a pull request, but I'd rather this be discussed if this is what is wanted in the project before I do (as it may mean several similar changes throughout the codebase, and several tests to be written).

ndawe commented

fromiter is much slower than asarray due to the constant resizing of the array and copying. So maybe we can first detect if the input is a generator and use fromiter otherwise use asarray:

import timeit

setup = '''
from numpy import ones, fromiter, asarray, int
arr = ones(1000000, dtype=int)
'''

print "fromiter: ", min(timeit.Timer('fromiter(arr, dtype=int)', setup=setup).repeat(3, 100))
print "asarray:  ", min(timeit.Timer('asarray(arr, dtype=int)', setup=setup).repeat(3, 100))
fromiter:  4.12140917778
asarray:   5.48362731934e-05

Easiest way is to use asarray by default and then catch the exception and use fromiter. This would be very python-esque and work quickly.

Actually, I can't seem to get asarray to crash on a generator consistently... For example:

>>> np.asarray(xrange(500), dtype=np.int)  # works
>>> np.asarray((i for i in range(500)), dtype=np.int)  # crashes
TypeError: long() argument must be a string or a number, not 'generator'

I always thought xrange would be a generator, but it seems like asarray can handle it. I guess this gives me more power for my argument in wrapping an exception

try:
  np.asarray((i for i in range(500)), dtype=np.int)
except TypeError as e:
  if "not 'generator'" not in str(e):
    raise
  else:
    print "caught"

works well.

@ndawe I see you're poit, it isn't very practice to have as the default. I agree with @kratsg that catcching the expectation would be best.
@kratsg that's because xrange isn't a generator per se, whereas you are constructing a generator from an iterate in the second.

ndawe commented

What about using inspect.isgenerator upfront instead of catching the exception:

>>> import inspect
>>> inspect.isgenerator(range(10))
False
>>> inspect.isgenerator(xrange(10))
False
>>> inspect.isgenerator((i for i in range(10)))
True
ndawe commented

That is true if it results in faster cleaner code (to prefer try/except). But we should avoid doing anything with the content of the exception message.

ndawe commented

In fact, isgenerator is simply:

isinstance(thing, types.GeneratorType)

So, something along the lines of:

try:
    ... = np.asarray(thing, ...)
except TypeError:
    if isinstance(thing, types.GeneratorType):
        ... = np.fromiter(thing, ...)
    else:
        raise
ndawe commented

Getting back to this. Generally one would use a generator if you don't want to have a big array of all values in memory at once. In that case it's better to just loop on the generator and fill the histogram accordingly. Internally converting the generator to an array in fill_hist kind of defeats the purpose of using a generator in the first place. If you really want to, then nothing is preventing you from using fill_hist(hist, np.fromiter(things)). I'm just a bit hesitant to put support for generators inside fill_hist.