Use np.fromiter for GeneratorTypes
parnmatt opened this issue · 8 comments
https://github.com/rootpy/root_numpy/blob/master/root_numpy/_hist.py#L47
np.asarray
will take any array-like structure and return an array; however this doesn't work for generators/iterators. np.fromiter
will work for any iterable.
array = np.fromiter(array, dtype=np.double)
in theory should do exactly the same as what is currently used, but also accept generator-like objects too.
I'd make a pull request, but I'd rather this be discussed if this is what is wanted in the project before I do (as it may mean several similar changes throughout the codebase, and several tests to be written).
fromiter is much slower than asarray due to the constant resizing of the array and copying. So maybe we can first detect if the input is a generator and use fromiter otherwise use asarray:
import timeit
setup = '''
from numpy import ones, fromiter, asarray, int
arr = ones(1000000, dtype=int)
'''
print "fromiter: ", min(timeit.Timer('fromiter(arr, dtype=int)', setup=setup).repeat(3, 100))
print "asarray: ", min(timeit.Timer('asarray(arr, dtype=int)', setup=setup).repeat(3, 100))
fromiter: 4.12140917778
asarray: 5.48362731934e-05
Easiest way is to use asarray
by default and then catch the exception and use fromiter
. This would be very python-esque and work quickly.
Actually, I can't seem to get asarray
to crash on a generator consistently... For example:
>>> np.asarray(xrange(500), dtype=np.int) # works
>>> np.asarray((i for i in range(500)), dtype=np.int) # crashes
TypeError: long() argument must be a string or a number, not 'generator'
I always thought xrange would be a generator, but it seems like asarray
can handle it. I guess this gives me more power for my argument in wrapping an exception
try:
np.asarray((i for i in range(500)), dtype=np.int)
except TypeError as e:
if "not 'generator'" not in str(e):
raise
else:
print "caught"
works well.
What about using inspect.isgenerator
upfront instead of catching the exception:
>>> import inspect
>>> inspect.isgenerator(range(10))
False
>>> inspect.isgenerator(xrange(10))
False
>>> inspect.isgenerator((i for i in range(10)))
True
That is true if it results in faster cleaner code (to prefer try/except). But we should avoid doing anything with the content of the exception message.
In fact, isgenerator is simply:
isinstance(thing, types.GeneratorType)
So, something along the lines of:
try:
... = np.asarray(thing, ...)
except TypeError:
if isinstance(thing, types.GeneratorType):
... = np.fromiter(thing, ...)
else:
raise
Getting back to this. Generally one would use a generator if you don't want to have a big array of all values in memory at once. In that case it's better to just loop on the generator and fill the histogram accordingly. Internally converting the generator to an array in fill_hist kind of defeats the purpose of using a generator in the first place. If you really want to, then nothing is preventing you from using fill_hist(hist, np.fromiter(things))
. I'm just a bit hesitant to put support for generators inside fill_hist.