read_json Error
Maybewuss opened this issue · 6 comments
I try to read a josn file which each row is a json string.
import pandas as pd
import databricks.koalas as ks
pd.read_json(file_path, lines=True) # it works
ks.read_json(file_path) # koalas does not have parameter lines and failed
Cause Error:
TypeError: Unsupported type in conversion to Arrow: ArrayType(StructType(List(StructField(arguments,ArrayType(StructType(List(StructField(alias,ArrayType(StringType,true),true),StructField(argument,StringType,true),StructField(argument_start_index,LongType,true),StructField(role,StringType,true))),true),true),StructField(class,StringType,true),StructField(event_type,StringType,true),StructField(trigger,StringType,true),StructField(trigger_start_index,LongType,true))),true)
Thanks for the report, @Maybewuss
Could you show a simple example of json file ??
Thanks for the report, @Maybewuss
Could you show a simple example of json file ??
{"text": "雀巢裁员4000人:时代抛弃你时,连招呼都不会打!", "id": "409389c96efe78d6af1c86e0450fd2d7", "event_list": [{"event_type": "组织关系-裁员", "trigger": "裁员", "trigger_start_index": 2, "arguments": [{"argument_start_index": 0, "role": "裁员方", "argument": "雀巢", "alias": []}, {"argument_start_index": 4, "role": "裁员人数", "argument": "4000人", "alias": []}], "class": "组织关系"}]}
{"text": "美国“未来为”子公司大幅度裁员,这是为什么呢?任正非正式回应", "id": "5aec2b5b759c5f8f42f9c0156eb3c924", "event_list": [{"event_type": "组织关系-裁员", "trigger": "裁员", "trigger_start_index": 13, "arguments": [{"argument_start_index": 0, "role": "裁员方", "argument": "美国“未来为”子公司", "alias": []}], "class": "组织关系"}]}
Here are 2 lines in json file.
Thanks for sharing the example, @Maybewuss !
Koalas internally use pyspark.sql.readwriter.DataFrameReader
for reading the json format that has no parameter such as lines
.
Let me investigate if we can support it.
FYI: For quick workaround only for this case, you can manually create the PySpark DataFrame and covert into Koalas DataFrame for now as below:
>>> import databricks.koalas as ks
>>> from databricks.koalas.utils import default_session
>>> sdf = default_session().read.load("json_example.json", format="json")
>>> sdf = sdf.select(sdf["event_list"].astype("string"), sdf["id"], sdf["text"])
>>> sdf.to_koalas(index_col="id") # specifying the column name that is used as an index if needed
event_list id text
0 [{[{[], 雀巢, 0, 裁员方}, {[], 4000人, 4, 裁员人数}], 组织关系, 组织关系-裁员, 裁员, 2}] 409389c96efe78d6af1c86e0450fd2d7 雀巢裁员4000人:时代抛弃你时,连招呼都不会打!
1 [{[{[], 美国“未来为”子公司, 0, 裁员方}], 组织关系, 组织关系-裁员, 裁员, 13}] 5aec2b5b759c5f8f42f9c0156eb3c924 美国“未来为”子公司大幅度裁员,这是为什么呢?任正非正式回应
Thx~
Just FYI, @Maybewuss .
I just fixed some of my comments, you can just simply use to_koalas()
rather than InternalFrame
I commented before.
>>> sdf = default_session().read.load("json_example.json", format="json") >>> sdf = sdf.select(sdf["event_list"].astype("string"), sdf["id"], sdf["text"]) >>> sdf.to_koalas(index_col="id") # specifying the column name that is used as an index if needed
This is a limitation from PySpark. I think we should fix it in PySPark first. If they support, Koalas will support it too.