databricks/koalas

read_json Error

Maybewuss opened this issue · 6 comments

I try to read a josn file which each row is a json string.

import pandas as pd
import databricks.koalas as ks

pd.read_json(file_path, lines=True) # it works
ks.read_json(file_path) # koalas does not have  parameter lines and failed

Cause Error:
TypeError: Unsupported type in conversion to Arrow: ArrayType(StructType(List(StructField(arguments,ArrayType(StructType(List(StructField(alias,ArrayType(StringType,true),true),StructField(argument,StringType,true),StructField(argument_start_index,LongType,true),StructField(role,StringType,true))),true),true),StructField(class,StringType,true),StructField(event_type,StringType,true),StructField(trigger,StringType,true),StructField(trigger_start_index,LongType,true))),true)

Thanks for the report, @Maybewuss

Could you show a simple example of json file ??

Thanks for the report, @Maybewuss

Could you show a simple example of json file ??

{"text": "雀巢裁员4000人:时代抛弃你时,连招呼都不会打!", "id": "409389c96efe78d6af1c86e0450fd2d7", "event_list": [{"event_type": "组织关系-裁员", "trigger": "裁员", "trigger_start_index": 2, "arguments": [{"argument_start_index": 0, "role": "裁员方", "argument": "雀巢", "alias": []}, {"argument_start_index": 4, "role": "裁员人数", "argument": "4000人", "alias": []}], "class": "组织关系"}]}
{"text": "美国“未来为”子公司大幅度裁员,这是为什么呢?任正非正式回应", "id": "5aec2b5b759c5f8f42f9c0156eb3c924", "event_list": [{"event_type": "组织关系-裁员", "trigger": "裁员", "trigger_start_index": 13, "arguments": [{"argument_start_index": 0, "role": "裁员方", "argument": "美国“未来为”子公司", "alias": []}], "class": "组织关系"}]}

Here are 2 lines in json file.

Thanks for sharing the example, @Maybewuss !

Koalas internally use pyspark.sql.readwriter.DataFrameReader for reading the json format that has no parameter such as lines.

Let me investigate if we can support it.

FYI: For quick workaround only for this case, you can manually create the PySpark DataFrame and covert into Koalas DataFrame for now as below:

>>> import databricks.koalas as ks
>>> from databricks.koalas.utils import default_session

>>> sdf = default_session().read.load("json_example.json", format="json")
>>> sdf = sdf.select(sdf["event_list"].astype("string"), sdf["id"], sdf["text"])
>>> sdf.to_koalas(index_col="id")  # specifying the column name that is used as an index if needed
                                                           event_list                                id                            text
0  [{[{[], 雀巢, 0, 裁员方}, {[], 4000, 4, 裁员人数}], 组织关系, 组织关系-裁员, 裁员, 2}]  409389c96efe78d6af1c86e0450fd2d7       雀巢裁员4000人时代抛弃你时连招呼都不会打1               [{[{[], 美国未来为子公司, 0, 裁员方}], 组织关系, 组织关系-裁员, 裁员, 13}]  5aec2b5b759c5f8f42f9c0156eb3c924  美国未来为子公司大幅度裁员这是为什么呢任正非正式回应

Thx~

Just FYI, @Maybewuss .
I just fixed some of my comments, you can just simply use to_koalas() rather than InternalFrame I commented before.

>>> sdf = default_session().read.load("json_example.json", format="json")
>>> sdf = sdf.select(sdf["event_list"].astype("string"), sdf["id"], sdf["text"])
>>> sdf.to_koalas(index_col="id")  # specifying the column name that is used as an index if needed

This is a limitation from PySpark. I think we should fix it in PySPark first. If they support, Koalas will support it too.