cur.fetch() can be extremely slow
Opened this issue · 9 comments
I haven't quite figured out when this is the case, but some queries can cause cursor.fetch() to take several hours to complete even though just running the query from a hive CLI and piping to a file only take 60 seconds.
Is there anything that I can do to help you diagose this problem?
how large its the data set you are retrieving? Is this the case for all queries or just a few?
Hive 0.13.0.2.1.4.0-632
Correction: piping to a file from the CLI takes about 5-8 minutes, not 1 minute.
The data is stored in a managed table.
The returned data size is 4.7M rows * 13 columns, or about 306MB (Hive 'describe extended tablename' reports that the size of the table is 878MB).
All queries seem to run slowly using cursor.fetch().
My "feeling" is that something in the python code is inefficiently allocating & filling memory, or maybe just grabbing one or a few rows at a time.
When I watch the memory usage of the python process it climbs slowly but steadily.
Using pyhs2 cursor::fetch() on our cluster, reading a 66 column table (a single 10-character date column, an integer column, and 64 float columns):
fetched 3000 rows in 10.7 secs
fetched 30000 rows in 104.2 secs
So it’s reading about 20,000 elements a second (30000 rows x 66 columns ~= 2,000,000 elements).
A crappy HD reads at around 50MB/s on average. If each element was 2,500 bytes, then it should take this long to read. But each element is a float (let’s just say it’s double-precision, or 8 bytes). So the read speed is 300x slower than it should be.
From this, I’d guess that 300,000 rows takes 17 minutes and 3,000,000 rows takes about three hours
I'm running cProfile now...
cProfile stats:
ncalls tottime percall cumtime percall filename:lineno(function) 11753141/11753119 85.004 0.000 279.991 0.000 TTransport.py:54(readAll) 11753118 63.297 0.000 162.265 0.000 thrift_sasl.py:148(read) 643 59.518 0.093 59.518 0.093 {method 'recv' of '_socket.socket' objects} 1980000 49.948 0.000 434.618 0.000 ttypes.py:1773(read) 35261749/35261748 49.587 0.000 49.587 0.000 {len} 7248226 40.663 0.000 189.522 0.000 TBinaryProtocol.py:195(readByte) 7218080 34.901 0.000 307.126 0.000 TBinaryProtocol.py:154(readFieldBegin) 30000 28.844 0.001 478.409 0.016 ttypes.py:1894(read) 11753128 22.575 0.000 22.575 0.000 {method 'read' of 'cStringIO.StringI' objects} 11722981 20.213 0.000 20.213 0.000 {_struct.unpack} 3227507 17.853 0.000 83.466 0.000 TBinaryProtocol.py:200(readI16) 930000 15.269 0.000 121.194 0.000 ttypes.py:1320(read) 1020000 11.462 0.000 72.414 0.000 ttypes.py:1440(read) 4 7.290 1.822 496.383 124.096 cursor.py:219(_fetch) 1980000 7.114 0.000 7.114 0.000 ttypes.py:1764(__init__) 4020468 6.236 0.000 6.236 0.000 {method 'append' of 'list' objects} 3990573 6.000 0.000 6.000 0.000 TBinaryProtocol.py:148(readStructBegin) 3990573 5.984 0.000 5.984 0.000 TBinaryProtocol.py:151(readStructEnd) 1980000 5.471 0.000 5.471 0.000 cursor.py:24(get_value) 932268 5.276 0.000 83.881 0.000 TBinaryProtocol.py:205(readI32) 3227507 4.872 0.000 4.872 0.000 TBinaryProtocol.py:161(readFieldEnd) 4 2.353 0.588 498.736 124.684 cursor.py:71(fetchSet) 1020000 1.922 0.000 1.922 0.000 ttypes.py:1437(__init__) 930000 1.769 0.000 1.769 0.000 ttypes.py:1317(__init__) 314964 1.737 0.000 8.287 0.000 TBinaryProtocol.py:215(readDouble) 30000 0.500 0.000 4.830 0.000 ttypes.py:1500(read) 66 0.403 0.006 0.403 0.006 {pandas.lib.maybe_convert_objects} 4 0.311 0.078 478.833 119.708 ttypes.py:1971(read) 30138 0.187 0.000 1.717 0.000 TBinaryProtocol.py:173(readListBegin) 30149 0.183 0.000 1.545 0.000 TBinaryProtocol.py:220(readString) 33 0.104 0.003 0.104 0.003 {pandas.lib.is_possible_datetimelike_array} 30000 0.061 0.000 0.061 0.000 ttypes.py:1497(__init__) 30000 0.060 0.000 0.060 0.000 ttypes.py:1891(__init__) 1 0.049 0.049 0.049 0.049 {pandas.lib.to_object_array} 30138 0.047 0.000 0.047 0.000 TBinaryProtocol.py:178(readListEnd)
answer is simple: don't use fetch(). use one of the Python standard PEP 249 calls: fetchall() (not a good idea for a large data set), fetchone() (for going one row at a time) or fetchmany() (return a block of records at a time). just using fetch() might (and probably does) overwhelms your system and should not be generally used.
@zklopman-rovi what option would you recommend works best for a large data set? Sorry, I am a little late to the party.
that depends on your needs, and what you consider "large". In general, you
should not overwhelm the memory. Assuming a typical 4GB available RAM, I
would day that around a million normal sized (not very wide) records can
still be handled in fetchall(). If you have more than that, bring them over
in blocks (10,000 or 100,000 records) using fetchmany(). If you get blocks,
you could dispatch them to separate threads to be worked upon (if
feasible). If your number of records is way more than that (billions or
more) you should not be exporting them to a lonely python program to be
processed - harness the power of Hive (or other Hadoop technologies) for
your work and trim it down. Exact details are, of course, dependent on the
application and environment.
hope this helps,
Zachi Klopman
On Wed, May 25, 2016 at 1:55 PM, anshanno notifications@github.com wrote:
@zklopman-rovi https://github.com/zklopman-rovi what option would you
recommend works best for a large data set? Sorry, I am a little late to the
party.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#36 (comment)
@zklopman well, I have about 170 million rows and 200 columns so by big data standards, its not that big. It seems however that it is far too large to pull in a reasonable amount of time with this module though. It takes around 45 seconds per 10k block. Fetching seems to be the bottleneck of the module.
Hi,
170,000,000 x 200 x 5 bytes/col (assumption - could be much bigger) ~ 170GB
data, (and we do not even consider protocol overhead here). That is usually
way too much to handle on a little laptop or small server, that usually
don't have more than 16GB RAM. For this size, you should do your
calculations on the cluster or use other means to transfer your data.
As for the interface speed:
10,000x200x5 ~ 10MB in 45s seems slow (about 1/2 MB/s). I would expect it
to be at least 10x faster with normal hardware and network speeds. Of
course, if the typical width of the columns is not 5 but 50 bytes, that
will make a huge difference. There are also other issues of network
performance tuning, but those vary wildly between locations.
hope this helps,
Zachi
On Thu, May 26, 2016 at 9:12 AM, anshanno notifications@github.com wrote:
@zklopman https://github.com/zklopman well, I have about 170 million
rows and 200 columns so by big data standards, its not that big. It seems
however that it is far too large to pull in a reasonable amount of time
with this module though. It takes around 45 seconds per 10k block. Fetching
seems to be the bottleneck of the module.—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#36 (comment)