The vectorized submodule is not fully supported throughout the package
jgostick opened this issue · 6 comments
Describe the bug
The vectorized
functionality does not seem to be universally applied.
Minimal Reproducible Example
import chemicals as chem
import numpy as np
from chemicals import vectorized
props = chem.heat_capacity.TRC_gas_data.loc['7732-18-5'] # for water
coeffs = {}
for val in props.index:
if val.startswith('a'):
coeffs[val] = props[val]
Cp = chem.heat_capacity.TRCCp(T=np.array([333, 444]), **coeffs)
Additional context
The error message is:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This is occurring because of the line:
if T <= a7:
For this to work in a vectorized way, this needs to be if np.all(T < a7)
.
Hi Jeff,
Thanks for reaching out. I think maybe I haven't been clean enough in the documentation for that submodule. The chemicals.vectorized module itself has the vectorized functions; importing it doesn't change the entire library to use vectorized functions. Your example works once the path of your function call is changed:
import chemicals as chem
import numpy as np
from chemicals import vectorized
props = chem.heat_capacity.TRC_gas_data.loc['7732-18-5'] # for water
coeffs = {}
for val in props.index:
if val.startswith('a'):
coeffs[val] = props[val]
Cp = vectorized.TRCCp(T=np.array([333, 444]), **coeffs)
The if np.all(T < a7)
call isn't necessary because the function is wrapped with numpy's vectorize function. This is a nice way of providing vectorization functionality without too much work. The only con is performance. For performance, the chemicals.numba submodule provides C/Fortran level speed, substantially more than numpy.
import chemicals.numba_vectorized
chemicals.numba_vectorized.TRCCp(np.array([333, 444]), 4.0, 870000.0, 1646.0, 3.111, 1.728, -54010000.0, 559.0, 304.0)
I will try to make the documentation more clear.
Sincerely,
Caleb
I have been implementing my own functions when needed, by pulling the code out of your source and changing the checks, which so far have been the only thing stopping your 'normal' functions from working natively with numpy arrays as inputs. Is there any reason why "you" (or a kindly contributor ;-) ) don't just overhaul the functions accordingly? For instance, the extended Antoine function would be:
coeffs = chem.vapor_pressure.Psat_data_AntoineExtended.loc[CAS]
_, A, B, C, Tc, to, n, E, F, Tmin, Tmax = coeffs
x = (T - to - 273.15)/Tc
x = np.clip(x, 0, np.inf) # Changing this line makes it work with numpy
x4 = x*x*x*x
PV = 10.**(A - B/(T+C) + 0.43429*x**n + x4*x4*(E + F*x4))
Also, have you compared a numpy computation like the one above with the numba one? I am not sure the numba performance would be better. Or at least numpy would be so close to C-speed on something like that, it's hard to see how it would be a problem. Beside, numba is a huge pain. For instance, try this for water:
numba.COSTALD(np.array([298., 298.]), 647.14, 5.6000000000000006e-05, 0.344)
For me I get a lovely cryptic message from numba about type mis-matches.
BTW, you do not need to make the documentation more clear. I have literally never seen better docs on a free/open source project. I am worried about you actually...I think you spend too much time on this!
Hi Jeff,
Your new example numba call needs to use the numba_vectorized submodule to accept numpy arrays. This is still very much powered by numba; but it gets turned into a true numpy ufunc by numba.vectorize which allows all sorts of fun broadcasting. As an example, the function could be called with an array of Tc
values but a single temperature.
I really do agree numba is frustrating to use, and I empathize for the desire for numpy function compatibility everywhere. Numpy is a really great tool and it can allow prototyping like nothing else I have ever used! This code base is "production" though, not a prototype, and the code has been heavily optimized.
I am not against numpy at all, but I am heavily invested in using PyPy, another accelerator. PyPy and NumPy don't play well together: the C-extension framework can't be sped up by PyPy, and in fact usually makes code in PyPy slower than in CPython. I think the two different ways of using numpy with chemicals
is a pretty good compromise, and I can only encourage you to use them. If a third way comes out of automatically numpy-ing Python code, I would consider adding that also.
The speed hit from using numpy is very real, much larger than most people expect. I did a quick example showing the speed of the Antoine function optimized for Numpy, CPython, Numba, and PyPy, with inputs of 1 million, 1000, 100, and 10 temperatures:
Temperatures | Numpy | CPython | Numba | PyPy |
---|---|---|---|---|
10^6 | 47.7 ms | 287 ms | 25.2 ms | 56 ms |
1000 | 45.1 µs | 230 µs | 24.9 µs | 48.3 µs |
100 | 17.1 µs | 23 µs | 5.56 µs | 4.74 µs |
10 | 14.3 µs | 2.41 µs | 3.23 µs | 461 ns |
For myself, my input vectors tend to be in the 4 to 40 length range; so I get hit with the ~4x in this case slowdown when using NumPy.
I have attached the jupyter notebook as a zip file (which does not include PyPy) to this if you would like to see the difference yourself.
Antoine performance.zip
Interesting. Our problems are usually limited by the matrix routines, so a few microseconds don't matter much.
Anyway, new issue: I am getting an 'invalid number of arguments' error from all the numba_vectorized functions. Any thoughts?
numba_vectorized.viscosity_gas_Gharagheizi(T=[444, 555], Tc=555, Pc=444, MW=18)
Sorry to be a pain, but we're integrating chemicals into openpnm for phase property estimation, so we need to apply the functions to millions of values, hence my interest in the speed-ups.
Ummm, I just happened to discover that they only work with positional arguments. I think that'll get me going now...but positional arguments scare me, guess I'll just have to live with it.
Thanks again!