modin-project/modin

Request to implement read_json function

modin-bot opened this issue · 4 comments

🤖 This is a bot message 🤖

feature_requests@modin.org has been sent an email requesting parallel implementation for read_json.

Note: Issues are created only once per method.

To add: even though #715 adds read_json(), it is limited to:

  1. Reading JSON files, and not strings (ie fetched from a web source)
  2. each row of the JSON file has to be a complete DataFrame row's worth of data (and all columns must be present in every row).

I would like to see the ability to read JSON strings too, or at least allow some kind or workaround using BytesIO or similar. Currently (Python 3.7.7, modin 0.7.2, ray 0.8.0, pandas 1.0.1) fails even a simple BytesIO hack:

import modin.pandas as pd
from io import BytesIO
j = b"""
{"name": "hamx0r"}
"""
bio = BytesIO(j)
df = pd.read_json(sbo, lines=True)
print(df.head())

...results in this error:

...
  File "...modin/engines/base/io/file_reader.py", line 15, in get_path
    if S3_ADDRESS_REGEX.search(file_path):
TypeError: expected string or bytes-like object

The only workaround I've found is to load with regular pandas, then convert the dataframe to Modin:

import pandas
import modin.pandas as pd
j = """[{"name": "hamx0r"}]"""
df = pandas.read_json(j)
df = pd.DataFrame(df)
print(df.head())

Thanks @hamx0r, this is a bug, would you be okay to open a bug report for the issue you described so we do not lose track of this? New features and bugs are tracked differently and have different development timeframes. We can fix this much sooner than we can implement all of read_json functionality.

Done! I wrote up #1379. I'm just getting into modin and appreciate all the hard work!

I think read_json is defaulting to pandas now, but not implemented in parallel. The default implementation should mean that the bugs in the original post are fixed now.