a module for the data pipeline for the Scaled Humanity team
questions: slack i-gao
Bang exports each batch as a long JSON file that is basically an exact transcript of the redux store. This data pipeline seeks to abstract away the parsing of that JSON data.
Usage Instructions:
- clone this repo
- install the package
- import in desired classes
# Install bangdatapipeline & multibatch packages
!pip install ./bangdatapipeline
from bangdatapipeline import BangDataPipeline
from multibatch import Multibatch # imports default Multibatch
from multibatch.parallelworlds import Multibatch # or whatever version you add
The BangDataPipeline class in the bangdata.py
file handles JSON retrieval and shaping. The current structure assumes that each round is followed by a mid-survey with the following kinds of questions:
- Viability -- a set of multiple-choice questions whose numerical values we want to average.
- Fracture -- a set of two questions, one Yes/No, one short response, that says whether a participant wants to keep their team and why.
If you want to add in analysis of other types of questions, implement a new class in a new file that inherits the base BangDataPipeline and overload functions. Import as from bangdatapipeline.FILENAME import CLASSNAME
.
Because Bang does not label questions in the JSON file, the BangDataPipeline class requires that you specify what 0-index these questions fall on. For example, if my survey has 16 questions, the first 14 of which are viability questions, I'd need to specify a SETTINGS object and initialize the BangDataPipleline with those settings. By default, BangDataPipeline is initialized with a simple setting of survey length = 0.
SETTINGS = {
"VIABILITY_START": 0,
"VIABILITY_END": 13,
"FRACTURE_INDEX": 14,
"FRACTURE_WHY": 15,
"LENGTH": 16
}
bdp = BangDataPipeline(TOKEN, SETTINGS) # TOKEN is your Bang API token
BangDataPipeline includes a set of functions to analyze a batch. Given a batch ID, BangDataPipeline will fetch the batch json and create tables respresenting viability scores, fracture results, etc. As an example, see the code block below:
# programmatically select the 5 most recent batches
fetch = bdp.fetch() # returns a df
batches = fetch[:5]['batch_id'].tolist() # select the 5 IDs you want
singleres = bdp.analyze(batches[0]) # analyze one ID
res = bdp.analyze_all(batches) # loop and analyze all in the list
You should never need to directly import this class, but you will need to interact with it. When BangDataPipeline analyzes a batch, it returns a BangDataResult class that acts as the Viewer for the individual batch's results.
Some useful fields & functions (here, res
is of the BangDataResult class)
res.batch
= batch IDres.users
= a list of user IDs in batchres.teams
= a list of team IDsres.refPair1
= a tuple of the experimental rounds (expRound1 in JSON)res.refPair2
= a tuple of the second experimental rounds (expRound2 in JSON)res.labels
= what the two experimental rounds are calledparse_chat()
= given a JSON snippet of a chat log, formats a nice table of time-ordered chat logcombine()
= given two tables, zips each entry to form a table where cell i,j is (A[i,j], B[i,j])json()
= batch's raw JSONraw_df()
= base df for analysisteam_df()
= df indexed by teams with member IDs as entriesraw_chats()
= series of chats indexed by teamuser_df()
= df indexed by user with mid-survey JSONs named by roundviability()
= user-indexed viability dffracture()
= user-indexed fracture outcome dfmanipulation()
= user-indexed manipulation check answers df
Multibatch is an engine to aggregate multiple batch results into summary analyses on a study level. The base class is in multibatch/base.py
; an example of a study-specific child implementation is in multibatch/parallelworlds.py
.