freeCodeCamp/open-data

TEST: opening json file from big dump

evaristoc opened this issue · 0 comments

Python:

  • json library won't work directly on the file - no json.load(...) .
    The following code won't work:
import json

with open('output.json', 'r') as f_in:
     data = json.load(f_in)

#this will throw a MemoryError

The code I normally used is the following, with this one specifically created to get a random sample:

import os, sys
import json
import random


directory = "/bigdumpdata"

datred = defaultdict(int)
counter = -1
sample = {}
with open(directory+"/output.json", "r") as f_in:
    while 1:
        record = f_in.readline()
        counter += 1
        if not record:
            break
        if len(record) > 3:
            try:
                if random.uniform(0,1) <= .01:
                    recordjson = json.loads(record[:-2])
                    rec = sorted([(rec["completedDate"],rec["name"]) for rec in recordjson if "name" in list(rec.keys()) and "completedDate" in list(rec.keys())])
                    if rec == []:
                        continue
                    recordjson = json.loads(record[:-2])
                    sample[counter] = recordjson

            except ValueError:
                if record == '':
                    continue

with open(directory+'/outputsample.json','w') as f_out:
    json.dump(sample, f_out)

R:

  • rjson takes long time to load before throwing an error after the file is converted from 3.7GB into a 5GB one:
library('rjson')
json_data <- fromJSON(file='output.json')
#Error in paste(readLines(file, warn = FALSE), collapse = "") : 
#result would exceed 2^31-1 bytes
  • jsonlite takes long time to load (about 10min) but open after the file is converted from 3.7GB into a one of 5GB:
library(jsonlite)
json_data <- fromJSON("1_archive/output.json", flatten=TRUE)
#result in a R's list data type