altaurog/pgcopy

struct.error on Python 3.8.6

Closed this issue · 19 comments

When I'm inserting an iterable (list of lists) to my table, I get this error. Below this error are the lines missing from the traceback (because of Cython), and below those are the datatypes being used in the copy command. I get this error with mgr.copy, and mgr.threading_copy. I know that the iterable is correct, as using execute_batch with psycopg2 on the data works fine.

  File "cboe_postgres.pyx", line 18, in cboe_postgres.run
  File "cboe_postgres.pyx", line 89, in cboe_postgres.run
  File "cboe_postgres.pyx", line 114, in cboe_postgres.run
  File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/pgcopy/copy.py", line 290, in copy
    self.writestream(data, datastream)
  File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/pgcopy/copy.py", line 321, in writestream
    datastream.write(struct.pack(''.join(fmt), *rdat))
struct.error: required argument is not a float
L18    cpdef run(bint start_from_top=True):
L89        with open(path, "r") as f: # idk why, but opening any file in cython causes it to be added to the traceback?
L114                    mgr.threading_copy(rows) # mgr.copy(rows) also fails
        CREATE TABLE Day(
        row_id integer,
        undersymbol text,
        date_id date,
        ticker text,
        expiration text,
        strike real,
        type text,
        open real,
        high real,
        low real,
        close real,
        volume integer,
        iv real,
        delta real,
        day_avgs integer[] DEFAULT '{}',
        iv_chg real DEFAULT 0,
        avg_iv real DEFAULT 0,
        prev_avg_iv real DEFAULT 0,
        spread real DEFAULT 0,
        PRIMARY KEY(row_id)
        )

These are the values for fmt, ''.join(fmt), and *rdat, in case that helps at all:

fmt: ['>h', 'ii', 'i1s', 'ii', 'i1s', 'i10s', 'if', 'i1s', 'if', 'if', 'if', 'if', 'ii', 'if', 'if']

fmt str: >hiii1siii1si10sifi1sififififiiifif

rdat: 14 4 0 1 b'A' 4 7492 1 b'A' 10 b'2020-07-17' 4 40.000 1 b'C' 4 0.0000 4 0.0000 4 0.0000 4 0.0000 4 0 4 2.0359 4 0.9928

Thanks for reporting this. I think I will need some more context. What version of pgcopy are you using? Is this a regression in a new version?

pip show pgcopy:

Name: pgcopy
Version: 1.4.3
Summary: Fast db insert with postgresql binary copy
Home-page: https://pgcopy.readthedocs.io/en/latest/
Author: Aryeh Leib Taurog
Author-email: python@aryehleib.com
License: MIT
Location: /root/.pyenv/versions/3.8.6/lib/python3.8/site-packages
Requires: pytz, psycopg2
Required-by:

Oh, I think this is because it assumes my columns that are defined as a real data type will be float, even though psycopg2 and postgres either don't require them to be, or automatically typecast them, I'm not sure which. I changed the definition from real to numeric and got this error:

  File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/pgcopy/copy.py", line 290, in copy
    self.writestream(data, datastream)
  File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/pgcopy/copy.py", line 318, in writestream
    f, d = formatter(val)
  File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/pgcopy/copy.py", line 135, in <lambda>
    return lambda v: ('i', (-1,)) if v is None else formatter(v)
  File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/pgcopy/copy.py", line 211, in f
    errors.raise_from(ValueError, message, exc)
  File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/pgcopy/errors/py3.py", line 9, in raise_from
    raise exccls(message) from exc
ValueError: error formatting value 40.000 for column strike

One big problem is that the columns can be 40.000, or 40.500, or 41. The place I'm getting this data from makes no such distinction, and while I could make them all floats/ints myself, that would take a lot of cpu and time considering the huge size of the datasets I'm getting.

Thanks for the super quick response by the way!

Yes, indeed, the datatypes are pretty limited: https://pgcopy.readthedocs.io/en/1.4.3/datatypes.html
In general, the cpu time required to convert types is not your rate-limiting factor performance-wise. But if you have huge datasets (say you’re processing all available options), you probably want to use pandas. I am pretty sure you can pass df.iterrows() to pgcopy, but I don’t off the top of my head know if the date formatters will work with pandas Timestamp values.

Sorry, no, I believe you can pass an int where float is expected. It looks like it is getting a different type, perhaps a string.
Try the following:
struct.pack('f', 4)
struct.pack('f', '4')

Also, it looks like rdat has many more values than fmt. I’m not sure why that would be.

The latter pack gives me the error.

    print(struct.pack('f', '4'))
struct.error: required argument is not a float

The length of fmt is 15 and the length of rdat is 29 for some reason? I think it may be duplicating some values.

Actually, the fmt and rdat values you provided seem okay:

>>> import struct
>>> fmt = '>hiii1siii1si10sifi1sififififiiifif'
>>> rdat = [ 14, 4, 0, 1, b'A', 4, 7492, 1, b'A', 10, b'2020-07-17', 4, 40.000, 1, b'C', 4, 0.0000, 4, 0.0000, 4, 0.0000, 4, 0.0000, 4, 0, 4, 2.0359, 4, 0.9928, ]
>>> struct.pack(fmt, *rdat)
b'\x00\x0e\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x01A\x00\x00\x00\x04\x00\x00\x1dD\x00\x00\x00\x01A\x00\x00\x00\n2020-07-17\x00\x00\x00\x04B \x00\x00\x00\x00\x00\x01C\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04@\x02L0\x00\x00\x00\x04?~($'

That's weird. What OS and Python are you running, as they definitely don't work for me?

Strange indeed. I just tried with python 3.7.1 and python 3.8.2 on linux, works in both.

Could the problem be with datastream.write instead of struct.pack?

This is a struct.pack error: struct.error: required argument is not a float
Are you certain this is the problematic row?

Oh true. Yes, I am definitely sure, I'll show a screenshot of the terminal.

can you post the actual datarow you’re passing to pgcopy, please?

I'll put it in a pastebin as it's huge.

Edit: Never mind I'll just show a single one of the 25k rows, along with accompanying diagnostics:

the row: [0, 'A', datetime.date(2020, 7, 6), 'A', '2020-07-17', '40.000', 'C', '0.0000', '0.0000', '0.0000', '0.0000', '0', '2.0359', 0.9928]
fmt: ['>h', 'ii', 'i1s', 'ii', 'i1s', 'i10s', 'if', 'i1s', 'if', 'if', 'if', 'if', 'ii', 'if', 'if']

fmt str: >hiii1siii1si10sifi1sififififiiifif

rdat: 14 4 0 1 b'A' 4 7492 1 b'A' 10 b'2020-07-17' 4 40.000 1 b'C' 4 0.0000 4 0.0000 4 0.0000 4 0.0000 4 0 4 2.0359 4 0.9928
lengths (fmt/rdat): 15 29

Edit 2: I must have gotten too used to psycopg2 typecasting for me, I can't believe I forgot to check if the values themselves were strings.

yep, '40.000' won’t work here for float value.

Yeah, I see that now. Will automatic typecasting, at least for strings to ints/floats, be considered in the future? There's an interesting library here that I've looked at for personal purposes that could work great for typecasting from strings to numbers if needed.

I’m inclined to say no. That’s a broad area I don’t think is a good match for this project, and it’s handled easily enough outside. I would be more open to addressing issues with pandas compatibility, if there are any.

Ok, sounds good. Thanks for the help!