modin-project/modin

pd.read_json() does not support JSON string input

hamx0r opened this issue · 1 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.15.4
  • Modin version (modin.__version__): 0.7.2
  • Python version:3.7.7
  • Pandas version: 1.0.1
  • Ray version: 0.8.0
  • Code we can use to reproduce:
import modin.pandas as pd
print(pd.read_json("""[{"name": "hamx0r"}]"""))

Related to #554

Describe the problem

Even though #715 adds read_json(), it is limited to:

  1. Reading JSON files, and not strings (ie fetched from a web source)
  2. Each row of the JSON file has to be a complete DataFrame row's worth of data (and all columns must be present in every row).

I would like to see the ability to read JSON strings too, or at least allow some kind or workaround using BytesIO or similar. Currently, modin fails even a simple BytesIO hack:

import modin.pandas as pd
from io import BytesIO
j = b"""[
{"name": "hamx0r"}
]"""
bio = BytesIO(j)
df = pd.read_json(bio, lines=True)
print(df.head())

...results in this error:

...

File "...modin/engines/base/io/file_reader.py", line 15, in get_path
if S3_ADDRESS_REGEX.search(file_path):
TypeError: expected string or bytes-like object

The only workaround I've found is to load with regular pandas, then convert the dataframe to Modin:

import pandas
import modin.pandas as pd
j = """[{"name": "hamx0r"}]"""
df = pandas.read_json(j)
df = pd.DataFrame(df)
print(df.head())

Source code / logs

This traceback comes from running the "Code we can use to reproduce" above:

Traceback (most recent call last):
  File "code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 2, in <module>
  File "/modin/pandas/io.py", line 143, in read_json
    return DataFrame(query_compiler=BaseFactory.read_json(**kwargs))
  File "/modin/data_management/factories.py", line 60, in read_json
    return cls._determine_engine()._read_json(**kwargs)
  File "/modin/data_management/factories.py", line 64, in _read_json
    return cls.io_cls.read_json(**kwargs)
  File "/modin/engines/base/io/text/json_reader.py", line 13, in read
    return cls.single_worker_read(path_or_buf, **kwargs)
  File "/modin/backends/pandas/parsers.py", line 57, in single_worker_read
    pandas_frame = cls.parse(fname, **kwargs)
  File "/modin/backends/pandas/parsers.py", line 117, in parse
    return pandas.read_json(fname, **kwargs)
  File "/pandas/util/_decorators.py", line 186, in wrapper
    return func(*args, **kwargs)
  File "/pandas/io/json/_json.py", line 608, in read_json
    result = json_reader.read()
  File "/pandas/io/json/_json.py", line 731, in read
    obj = self._get_object_parser(self.data)
  File "/pandas/io/json/_json.py", line 753, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/pandas/io/json/_json.py", line 857, in parse
    self._parse_no_numpy()
  File "/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value

Thanks @hamx0r for creating the issue! This is failing when trying to default to the pandas implementation, and the error earlier in your comment happens when we try to open the file.

We will do our best to get this out in the next release, which is planned for next week.