OverflowError: int too big to convert

I'm processing some huge JSON files with json_stream in Python, and when using the default tokenizer (from this repo), I get "OverflowError: int too big to convert". My problem is that the JSON file is too large to investigate using a text editor or IDE, so I can't provide a reproduction yet. However, when I switch to the Python tokenizer, it works without errors.

Hmmm that's strange... Arbitrary-size integers should be supported since #14.

Search results for OverflowError: int too big to convert suggest that this comes from Python itself when one tries to express a large number in too few bytes. But I don't think we do that anywhere?

Will have to have a look.

Do you use CPython or another Python implementation like PyPy?

And could you perhaps provide the relevant lines from the traceback?

Thanks for your time.

Python:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform                                                   
>>> platform.python_implementation()
'CPython'

Traceback:

Traceback (most recent call last):
  ...
  File "C:\foo\filter_json.py", line 154, in main
    json.dump(
  File "C:\bar\python\Lib\json\__init__.py", line 179, in dump
    for chunk in iterable:
  File "C:\bar\python\Lib\json\encoder.py", line 430, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "C:\bar\python\Lib\json\encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "C:\bar\python\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "C:\bar\python\Lib\json\encoder.py", line 297, in _iterencode_list
    for value in lst:
  File "C:\bar\python\Lib\site-packages\json_stream\writer.py", line 17, in __next__
    return next(self._it)
           ^^^^^^^^^^^^^^
  File "C:\foo\filter_json.py", line 222, in process_results
    for result in results:
  File "C:\bar\python\Lib\site-packages\json_stream\base.py", line 48, in _iter_items
    self._clear_child()
  File "C:\bar\python\Lib\site-packages\json_stream\base.py", line 41, in _clear_child
    self._child.read_all()
  File "C:\bar\python\Lib\site-packages\json_stream\base.py", line 62, in read_all
    collections.deque(self._iter_items(), maxlen=0)
  File "C:\bar\python\Lib\site-packages\json_stream\base.py", line 50, in _iter_items
    item = self._load_item()
           ^^^^^^^^^^^^^^^^^
  File "C:\bar\python\Lib\site-packages\json_stream\base.py", line 215, in _load_item
    token_type, k = next(self._stream)
                    ^^^^^^^^^^^^^^^^^^
OSError: I/O error while parsing (index 831999): Custom { kind: Other, error: "OverflowError: int too big to convert" }

Question: Is index 831999 a line number, or something else?

Oh good, it outputs the index - so you should be able to look at what's around that point in your JSON file using a script like this:

with open("/path/to/your/json-file") as f:
  f.seek(831900)
  print(repr(f.read(200)))

That would help a lot to try and figure out what's wrong. Maybe the numbers need tweaking though if you opened the file as a bytes stream instead of text stream and your file has a lot of UTF-8 (as far as I remember, the index in exceptions is always in characters, not in bytes).

Sorry, wrote the above before I saw your message. index should be a number-of-UTF-8-characters offset (not byte offset) from the start of the file.

I get this output with your provided snippet:

'               "GVM_BEL_ES_PV_OPN_FC": 0.0,\n                "GVM_BEL_ES_PV_OPN_FC_CSMLC": 0.0,\n                "GVM_BEL_ES_PV_OPN_FC_FIE": 0.0,\n                "GVM_BEL_ES_NB": 0.0,\n                "G'

Or without the repr:

               "GVM_BEL_ES_PV_OPN_FC": 0.0,
                "GVM_BEL_ES_PV_OPN_FC_CSMLC": 0.0,
                "GVM_BEL_ES_PV_OPN_FC_FIE": 0.0,
                "GVM_BEL_ES_NB": 0.0,
                "G

OK so it actually has nothing to do with handling large integers in the JSON itself but something with the basic I/O machinery (hence "I/O error" in the exception, not parsing error or something). Specifically, the issue should be somewhere here:

py-json-stream-rs-tokenizer/src/suitable_seekable_buffered_text_stream.rs

Lines 36 to 51 in a33b2af

    
           if let Some(c) = self.chars_iter.next() { 
        
               self.chars_read_from_buf += 1; 
        
               Ok(Some(c)) 
        
           } else { 
        
               // TODO: I don't think this can handle actually getting to EOF very well (buf size 
        
               // becomes 0? => no seek), but probably not relevant 
        
               self.buf_start_seek_pos = Some(self.inner.seek(OpaqueSeekFrom::Current)?); 
        
               let buf = self.inner.read_string(self.buffer_size)?; 
        
               self.chars_iter = buf.into_chars(); 
        
               self.chars_read_from_buf = 0; 
        
               let oc = self.chars_iter.next(); 
        
               if let Some(_) = oc { 
        
                   self.chars_read_from_buf += 1; 
        
               } 
        
               Ok(oc) 
        
           }

But no idea what, the only numbers that get transmitted between Python and Rust there are things like buffer size or current cursor position.

Is your file larger than 4 GB? I don't think that's the reason, since AFAICT I use 64-bit cursor positions which fit much more than 4 GB and the error only happens at position 831999 which is way before the 4 GB point, but who knows...

Thanks for the deep investigation. My file is 808055832 bytes. I have tested like this that my file does not contain invalid JSON:

import json

if __name__ == "__main__":
    with open("original.json", "r") as f:
        data = json.load(f)

    with open("copy.json", "w") as f:
        json.dump(data, f, indent=4)

That takes between 1 and 2 minutes to run on my machine, without errors.

For now, I'm working around this by adding a --stream flag to my program, and using it as follows:

if stream:
    # Stream the input as we process it. This uses less memory, but is slower.
    runs = json_stream.load(f_in)
else:
    # Read the entire input into RAM. This uses more memory, but is faster.
    runs = json.load(f_in)

The pure-python tokenizer is way too slow for me, so I have to choose between the Rust tokenizer, which fails on some input files, and reading everything into memory, which always works, but requires more RAM.

Unfortunately, I have not learned Rust yet, so I can't help you there...

Thanks for the information!

I still don't know what exactly the cause is, but I've just released json-stream-rs-tokenizer version 0.4.23 which should make the error messages you get more verbose (#94) and include, among other information, the original Python traceback.

Could you perhaps try updating to 0.4.23 and re-running what gave you the error, then post the (hopefully) more verbose error message here? Thanks in advance!

I'm back. Sorry for the delay.

With version 0.4.23:

Traceback (most recent call last):
  ...
  File "C:\foo\filter_json.py", line 259, in process_result
    for key, value in result.items():
  File "C:\python\Lib\site-packages\json_stream\base.py", line 50, in _iter_items
    item = self._load_item()
           ^^^^^^^^^^^^^^^^^
  File "C:\python\Lib\site-packages\json_stream\base.py", line 215, in _load_item
    token_type, k = next(self._stream)
                    ^^^^^^^^^^^^^^^^^^
OSError: I/O error while parsing (index 831999): Custom { kind: Other, error: "Error seeking to offset 0 (from Cur) in Python text stream: OverflowError: int too big to convert\n(no traceback available)" }

Here is a reproduction that uses random data!

import json
import json_stream
from json_stream import streamable_list, streamable_dict
from random import choice
from string import ascii_lowercase
from tqdm import tqdm


def main() -> None:
    print("Generating ...")
    with open("foo.json", "w") as f:
        json.dump(
            obj=streamable_list(
                tqdm(
                    iterable=random_json(
                        n_lists=100,
                        n_dicts=100,
                        n_items_per_dict=100,
                    ),
                    total=100,
                )
            ),
            fp=f,
            indent=4,
        )

    print("Reading ...")
    with open("foo.json", "r") as f:
        foo = json_stream.load(f)
        for bars in foo:
            for bar in bars:
                for k, v in bar.items():
                    print(k, v)

    print("Done.")


def random_json(
    *,
    n_lists: int,
    n_dicts: int,
    n_items_per_dict: int,
):
    for _ in range(n_lists):
        yield streamable_list(
            random_dicts(n_dicts=n_dicts, n_items_per_dict=n_items_per_dict)
        )


def random_dicts(
    *,
    n_dicts: int,
    n_items_per_dict: int,
):
    for _ in range(n_dicts):
        yield streamable_dict(random_dict_items(n_items_per_dict=n_items_per_dict))


def random_dict_items(
    *,
    n_items_per_dict: int,
):
    for _ in range(n_items_per_dict):
        yield random_string(), random_value()


def random_value() -> int | float | str | bool | None:
    return choice(
        [
            random_int,
            random_float,
            random_string,
            random_bool,
            lambda: None,
        ]
    )()


def random_string() -> str:
    return "".join(choice(ascii_lowercase) for _ in range(10))


def random_int() -> int:
    return choice([-1, 0, 1]) * choice(range(100))


def random_float() -> float:
    return choice([-1, 1]) * choice(range(100)) / 10


def random_bool() -> bool:
    return choice([True, False])


if __name__ == "__main__":
    main()

That helps a lot, thank you very much!!

I ran your script on my local (Linux) machine several times without errors, so I tried to see if it's OS-dependent by running it in CI, and indeed, it only happens on Windows! No wonder I couldn't reproduce it before.

Now that that's clear, the remaining debugging and fix hopefully won't be that difficult, but it will take me some time of course.

All right, thanks to your script (and your script only, as will be explained below) I've managed to figure out the cause:

As discussed on StackOverflow and in an issue on Python's bug tracker, on Windows, seek/tell will in fact sometimes produce cursor positions that don't fit in an unsigned 64-bit integer.

So I guess I'll have to change the current representation of an opaque seek position to contain an arbitrary-size integer instead of a u64 and add some tests, which can take some time again. Sorry about that.

In the meantime, you could try and see if the issue disappears with files opened in binary mode and use that as a workaround if it does. It should, because binary files have actual byte positions as cursor positions instead of opaque numbers, but I haven't tested it.

Not that important, just to explain why it was hard to debug: As you can see from the example in the StackOverflow question, it will at some point spit out one of these huge numbers, but then go back to normal ones for the majority of the next few lines. Moreover, in my own experiments, it only (or at least preferentially) returned those large numbers when trying to obtain the cursor position at the end of a line. But json-stream-rs-tokenizer only obtains cursor positions when refilling its buffer, which has a fixed size, so the chance that this coincides not only with the end of a line but the end of a line at which this issue occurs is very small. Hence it's fairly hard to provoke this artificially and so far, only your randomizing approach has been able to do so reliably.

Wow, this is one of the weirdest Python language quirks I have seen. I'm glad they updated the docs, but it's still very cryptic! Thanks for your effort. I will try the binary mode in the mean time.

All right, it should be fixed in 0.4.25.

It seems to work! :)

	if let Some(c) = self.chars_iter.next() {
	self.chars_read_from_buf += 1;
	Ok(Some(c))
	} else {
	// TODO: I don't think this can handle actually getting to EOF very well (buf size
	// becomes 0? => no seek), but probably not relevant
	self.buf_start_seek_pos = Some(self.inner.seek(OpaqueSeekFrom::Current)?);
	let buf = self.inner.read_string(self.buffer_size)?;
	self.chars_iter = buf.into_chars();
	self.chars_read_from_buf = 0;
	let oc = self.chars_iter.next();
	if let Some(_) = oc {
	self.chars_read_from_buf += 1;
	}
	Ok(oc)
	}