Refactoring idea - data extraction library for FB json files
epogrebnyak opened this issue · 9 comments
Maybe it is worthwhile separate data extraction and visualisation functionality? The data extraction utility should accept the working directory with data and produce serialised data of friends, likes, etc. This part can be covered by unit tests.
The visualisation part can work on results of data extraction, also should be useful to expose data extraction function to the user, so one can construct own visualizations.
Something like below:
import json
import pandas as pd
def read_json(filename: str):
with open(filename) as f:
return json.load(f)
def get_timestamp(x: int):
return pd.Timestamp(x, unit="s")
def decode(s: str):
return s.encode('latin-1').decode("utf-8")
def get_friends_df(filename: str, key: str):
df = pd.DataFrame(read_json(filename)[key])
df['name'] = df['name'].map(decode)
df['timestamp'] = df['timestamp'].map(get_timestamp)
return df
friends_df = get_friends_df("friends.json", "friends")
Got some progress here, maybe add more functionality?
https://github.com/epogrebnyak/facebook-json-to-csv/blob/master/friends.py
Got some progress here, maybe add more functionality?
https://github.com/epogrebnyak/facebook-json-to-csv/blob/master/friends.py
Yes adding more functionality will be good.
In fviz now everyhting is plugged into classes, make it hard to reuse, I think data acquisition should be separate from analysis (like here). Also the Comment
, Post
look better as strustures with final data, not holders of raw information.
In a script I wrote starts at providing folder and ends providing clean the data, no intent for plottng.
There good bits in you code, but they look hidden in classes, hard to reuse.
In my opinion putting functions under certain class helps in keeping namespace clean, though of course makes it harder to find them. But they are placed under seperate hoods. Rather for reusability, I think I can improved API doc. What do you think ?
@itzmeanjan depends on your approach - myself I find it cleaner to follow some kind of a pipeline with functions, it is often more testable too (easier to inject test parameters), less duplicate code. The parser part is folder -> funcs to find file and originate stream of values form JSON (getter) -> saving values
. To extract something from a JSON one needs just about the following:
address_book = Getter(
name="address_book",
path=["about_you", "your_address_books.json"],
unpack=lambda xs: xs["address_book"]["address_book"],
elem=lambda x: (decode(x["name"]), extract_address_book_details(x)),
columns=["name", "contact"],
then you can construct the filename from folder and address_book.path
and apply [elem(x) for x in unpack(read_json(filename))]
. This saves you creating extra two modules and two new classes for each piece of information (friends, messages, posts, comments, etc)