TEST: opening json file from big dump
evaristoc opened this issue · 0 comments
evaristoc commented
Python:
- json library won't work directly on the file - no
json.load(...)
.
The following code won't work:
import json
with open('output.json', 'r') as f_in:
data = json.load(f_in)
#this will throw a MemoryError
The code I normally used is the following, with this one specifically created to get a random sample:
import os, sys
import json
import random
directory = "/bigdumpdata"
datred = defaultdict(int)
counter = -1
sample = {}
with open(directory+"/output.json", "r") as f_in:
while 1:
record = f_in.readline()
counter += 1
if not record:
break
if len(record) > 3:
try:
if random.uniform(0,1) <= .01:
recordjson = json.loads(record[:-2])
rec = sorted([(rec["completedDate"],rec["name"]) for rec in recordjson if "name" in list(rec.keys()) and "completedDate" in list(rec.keys())])
if rec == []:
continue
recordjson = json.loads(record[:-2])
sample[counter] = recordjson
except ValueError:
if record == '':
continue
with open(directory+'/outputsample.json','w') as f_out:
json.dump(sample, f_out)
R:
rjson
takes long time to load before throwing an error after the file is converted from 3.7GB into a 5GB one:
library('rjson')
json_data <- fromJSON(file='output.json')
#Error in paste(readLines(file, warn = FALSE), collapse = "") :
#result would exceed 2^31-1 bytes
jsonlite
takes long time to load (about 10min) but open after the file is converted from 3.7GB into a one of 5GB:
library(jsonlite)
json_data <- fromJSON("1_archive/output.json", flatten=TRUE)
#result in a R's list data type