Refactoring idea - data extraction library for FB json files

Question

Refactoring idea - data extraction library for FB json files

epogrebnyak opened this issue 4 years ago · 9 comments

Maybe it is worthwhile separate data extraction and visualisation functionality? The data extraction utility should accept the working directory with data and produce serialised data of friends, likes, etc. This part can be covered by unit tests.

Answer 1 · 2020-08-02T12:29:51.000Z

The visualisation part can work on results of data extraction, also should be useful to expose data extraction function to the user, so one can construct own visualizations.

Answer 2 · 2020-08-02T21:22:26.000Z

Something like below:

import json
import pandas as pd

def read_json(filename: str):
    with open(filename) as f:
        return json.load(f)

def get_timestamp(x: int):
    return pd.Timestamp(x, unit="s")

def decode(s: str):
    return s.encode('latin-1').decode("utf-8")

def get_friends_df(filename: str, key: str):      
    df = pd.DataFrame(read_json(filename)[key])
    df['name'] =  df['name'].map(decode) 
    df['timestamp'] =  df['timestamp'].map(get_timestamp) 
    return df

friends_df = get_friends_df("friends.json", "friends")

Answer 3 · 2020-08-04T08:31:23.000Z

Got some progress here, maybe add more functionality?

https://github.com/epogrebnyak/facebook-json-to-csv/blob/master/friends.py

Answer 4 · 2020-08-05T07:30:39.000Z

It's good suggestion, but in fviz data extraction & manipulation is done here, where as data visualisation is handled here.

Answer 5 · 2020-08-05T07:31:33.000Z

Got some progress here, maybe add more functionality?

https://github.com/epogrebnyak/facebook-json-to-csv/blob/master/friends.py

Yes adding more functionality will be good.

Answer 6 · 2020-08-05T07:33:05.000Z

I've already implemented lots of those data manipulation functionalities in fviz, you can take a look 1, 2

Answer 7 · 2020-08-05T09:44:27.000Z

In fviz now everyhting is plugged into classes, make it hard to reuse, I think data acquisition should be separate from analysis (like here). Also the Comment, Post look better as strustures with final data, not holders of raw information.

In a script I wrote starts at providing folder and ends providing clean the data, no intent for plottng.

There good bits in you code, but they look hidden in classes, hard to reuse.

Answer 8 · 2020-08-05T18:27:07.000Z

In my opinion putting functions under certain class helps in keeping namespace clean, though of course makes it harder to find them. But they are placed under seperate hoods. Rather for reusability, I think I can improved API doc. What do you think ?

@epogrebnyak

Answer 9 · 2020-08-05T19:24:11.000Z

@itzmeanjan depends on your approach - myself I find it cleaner to follow some kind of a pipeline with functions, it is often more testable too (easier to inject test parameters), less duplicate code. The parser part is folder -> funcs to find file and originate stream of values form JSON (getter) -> saving values. To extract something from a JSON one needs just about the following:

address_book = Getter(
        name="address_book",
        path=["about_you", "your_address_books.json"],
        unpack=lambda xs: xs["address_book"]["address_book"],
        elem=lambda x: (decode(x["name"]), extract_address_book_details(x)),
        columns=["name", "contact"],

then you can construct the filename from folder and address_book.path and apply [elem(x) for x in unpack(read_json(filename))]. This saves you creating extra two modules and two new classes for each piece of information (friends, messages, posts, comments, etc)