Pykx pandas conversion offers no speedup with short vs float columns
antipisa opened this issue · 3 comments
Python: 3.11.8
Pandas: 2.2.1
Numpy: 1.26.4
There seems to be marginal benefit in terms speed when converting kdb tables into pandas using the .pd() method when using short ints instead of floats. Although the memory usage of the table drops accordingly, the time spent in conversion to dataframe does not improve much.
N:50000000;
dat1:([] date:2000.01.01; sym:`A; q1:N?100h; q2:N?5000h; q3:N?50h);
dat2:([] date:2000.01.01; sym:`A; q1:N?100f; q2:N?5000f; q3:N?50f);
Indeed, the size of dat1 is 40% the size of dat2, and yet in python:
handle = pykx.SyncQConnection(host, port)
%timeit df1 = handle('dat1').pd() #short
%timeit df2 = handle('dat2').pd() #float
3.48s per loop
5.76s per loop
This gets closer the more float/short columns you add.
Is there a way to optimize the call to .pd when dealing with very large tables whose column values are mostly shorts? Otherwise one can spend forever waiting for the conversion.
If you are not already using Pandas 2.0 it's worth upgrading as you will see a 3x speed improvement for these conversions.
Pandas 1.5.3
In [1]: import pykx as kx
In [2]: kx.q['N'] =50000000;
...: dat1 = kx.q('([] date:2000.01.01; sym:`A; q1:N?100h; q2:N?5000h; q3:N?50h)')
...: dat2 = kx.q('([] date:2000.01.01; sym:`A; q1:N?100f; q2:N?5000f; q3:N?50f)')
...:
In [3]: %timeit df1 = dat1.pd()
2.13 s ± 163 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df2 = dat2.pd()
2.1 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pandas 2.1.4
In [1]: import pykx as kx
...:
...: kx.q['N'] =50000000;
...: dat1 = kx.q('([] date:2000.01.01; sym:`A; q1:N?100h; q2:N?5000h; q3:N?50h)')
...: dat2 = kx.q('([] date:2000.01.01; sym:`A; q1:N?100f; q2:N?5000f; q3:N?50f)')
...:
In [2]: %timeit df1 = dat1.pd()
738 ms ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit df2 = dat2.pd()
700 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@rianoc-kx this is on pandas 2.2.1. Could you try going over TCP/IP?
Isolating the IPC portion you can see the larger float data is slower to transfer:
In [3]: %timeit dat1 = handle('dat1')
1.13 s ± 49.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit dat2 = handle('dat2')
1.56 s ± 32.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Removing the sym
column we can see that this is responsible for the majority on the conversion time: ~700ms
--> ~300ms
In [9]: kx.q['N'] =50000000;
...: dat1 = kx.q('([] date:2000.01.01;q1:N?100h; q2:N?5000h; q3:N?50h)')
...: dat2 = kx.q('([] date:2000.01.01; q1:N?100f; q2:N?5000f; q3:N?50f)')
In [10]: %timeit df1 = dat1.pd()
311 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit df2 = dat2.pd()
270 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Going further again and removing date
we can see it was responsible for the remaining majority of the conversion time. Leaving only the numeric conversions which are the most direct and therefore fastest:
In [12]: dat1 = kx.q('([] q1:N?100h; q2:N?5000h; q3:N?50h)')
...: dat2 = kx.q('([] q1:N?100f; q2:N?5000f; q3:N?50f)')
In [13]: %timeit df1 = dat1.pd()
47.7 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [14]: %timeit df2 = dat2.pd()
121 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Float arrays are able to be zero copied from q to NumPy arrays to build the dataframe which gives this operation making it effectively a constant time operation.
short
-> int
-> long
does show a progression on conversion time with the datatype size:
In [3]: kx.q['N'] =50000000;
In [4]: dat1 = kx.q('([] q1:N?100h; q2:N?5000h; q3:N?50h)')
In [5]: %timeit df1 = dat1.pd()
43.8 ms ± 701 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: dat2 = kx.q('([] q1:N?100i; q2:N?5000i; q3:N?50i)')
In [7]: %timeit df2 = dat2.pd()
58.3 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: dat3 = kx.q('([] q1:N?100; q2:N?5000; q3:N?50)')
In [9]: %timeit df3 = dat3.pd()
97.4 ms ± 685 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)