choice of returned histogram datatype
manodeep opened this issue · 9 comments
Thanks for writing this fast histogram package. A task that we astronomers tend to do often and repeatedly :)
Coming back to the package itself, it seems to me that the return datatype of the histograms are doubles
. Conceptually, the histogram is simply a count in that bin, and should be of an integral type. Is the return datatype chosen to be double
to reflect the input array type? Or is that in keeping with the numpy
convention?
Changing the histogram increment to be an integral type (i.e., the postfix ++
operator) might also lead to better performance. (There are some opportunity for performance improvements -- e.g., pointer operations, reduction of scope for variables, unrolled loops + separate counters)
Separately, any particular reason to perform a bitwise and (&
) rather than the logical and (&&
) here?
I decided to return doubles for consistency with Numpy, but we could potentially have an option to choose the return type (at least maybe choosing between double and long?). All the other improvements sound good, and the use of &&
was a mistake :)
Would you be interested in submitting a pull request to implement/fix some of these as well as #4?
You mention reduction of scope for variables, but just to let you know, I found that all the variables had to be declared at the top to keep the Windows compilers happy (but maybe you know how to avoid this).
I sent up the trivial stuff in the PR #6.
I have never compiled for Windows. What's the error you get on Windows with these lines:
const double postfac = normx * fnx;
for(i = 0; i < n; i++) {
if (*x >= xmin && *x < xmax) {
const long ix = (*x - xmin) * postfac;
count[ix] += 1.;
}
x++;
}
@manodeep - Thanks! I don't have Windows locally, but you can try and open a PR, and there will be an AppVeyor build in which you can check any error messages if the build fails.
FYI, item += 1
stops working if the numbers get large enough for floating point numbers. However, that number is very large for doubles.
Also, numpy returns int64s:
import numpy as np
a,b = np.histogram([1,2,.3])
print(a.dtype)
# prints int64
I wonder if this changed in Numpy since I wrote this package. I'd certainly be up for returning integer values from the plain histogram and doubles from the weighted one, though I need to check if that might have any undesired side effects on other packages.
I think this changed in 1.10, at least the 64 part of the int was added then. This is also when numpy got a high-speed 1D histogram algorithm. I wonder how it compares to fast-histogram now in 1D?
numpy/numpy@34b582a#diff-9d44603693b9544b900e135a203b3ae2R180, numpy/numpy#7629
Edit, per this issue numpy/numpy#7845, 2D/ND histograms still return a floating type.
The current benchmarks shown in the README are against a later version than 1.10, so this is still quite a bit faster than numpy. I'm not sure if you are familiar with the asv tool for benchmarks, but I keep a running benchmark here of how fast fast-histogram is compared to Numpy: https://astrofrog.github.io/fast-histogram/#benchmarks.Histogram1D.time_histogram1d?x-axis=commit
By the way, the current Numpy code is internally in int64s, and then casts at the end. There's one line that can be dropped to stop the casting. This allows Numpy to count past the point that floating point can no longer add 1.