/edx2bigquery

Tool to convert & load data from edX platform into BigQuery

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

edx2bigquery

edx2bigquery is a tool for importing edX SQL and log data into Google BigQuery for research and analysis.

To get started, install the Google Cloud SDK first:

https://cloud.google.com/sdk/

Then generate authentication tokens using "gcloud auth login". Make sure that "bq" and "gsutil" work properly.

To install:

python setup.py develop

Also, setup the file edx2bigquery_config.py in your current working directory. Example:

#-----------------------------------------------------------------------------
#
# sample edx2bigquery_config.py file
#
course_id_list = [
        "MITx/2.03x/3T2013",
]

courses = {
    'year2': course_id_list,
    'all_harvardx': [
        "HarvardX/AI12.1x/2013_SOND",
    ],
}

# google cloud project access
auth_key_file = "USE_GCLOUD_AUTH"
auth_service_acct = None

# google bigquery config
PROJECT_ID = "x-data"

# google cloud storage
GS_BUCKET = "gs://x-data"

# local file configuration
COURSE_SQL_BASE_DIR = "X-Year-1-data-sql"
COURSE_SQL_DATE_DIR = '2013-09-08'
TRACKING_LOGS_DIRECTORY = "TRACKING_LOGS"
ORG = 'X-university'

#-----------------------------------------------------------------------------

To run:

edx2bigquery

Command help message:

usage: edx2bigquery [-h] [--course-base-dir COURSE_BASE_DIR]
                    [--course-date-dir COURSE_DATE_DIR]
                    [--start-date START_DATE] [--end-date END_DATE]
                    [--tlfn TLFN] [-v] [--year2] [--clist CLIST]
                    [--force-recompute] [--dataset-latest] [--skip-geoip]
                    [--skip-if-exists] [--nskip NSKIP] [--logs-dir LOGS_DIR]
                    [--dbname DBNAME] [--table TABLE] [--org ORG]
                    [--collection COLLECTION]
                    [--output-project-id OUTPUT_PROJECT_ID]
                    [--output-dataset-id OUTPUT_DATASET_ID]
                    [--output-bucket OUTPUT_BUCKET]
                    command [courses [courses ...]]

usage: %prog [command] [options] [arguments]

Examples of common commands:

edx2bigquery --clist=all_mitx logs2gs 
edx2bigquery setup_sql MITx/24.00x/2013_SOND
edx2bigquery --tlfn=DAILY/mitx-edx-events-2014-10-14.log.gz  --year2 daily_logs
edx2bigquery --year2 person_course
edx2bigquery --year2 report
edx2bigquery --year2 combinepc
edx2bigquery --year2 --output-bucket="gs://harvardx-data" --nskip=2 --output-project-id='harvardx-data' combinepc >& LOG.combinepc

Examples of not-so common commands:

edx2bigquery person_day MITx/2.03x/3T2013 >& LOG.person_day
edx2bigquery --force-recompute person_course --year2 >& LOG.person_course
edx2bigquery testbq
edx2bigquery make_uic --year2
edx2bigquery logs2bq MITx/24.00x/2013_SOND
edx2bigquery person_course MITx/24.00x/2013_SOND >& LOG.person_course
edx2bigquery split DAILY/mitx-edx-events-2014-10-14.log.gz 

positional arguments:
  command               A variety of commands are available, each with different arguments:
                        
                        --- TOP LEVEL COMMANDS
                        
                        setup_sql <course_id> ...   : Do all commands (make_uic, sql2bq, load_forum) to get edX SQL data into the right format, upload to
                                                      google storage, and import into BigQuery.  See more information about each of those commands, below.
                                                      This step is idempotent - it can be re-run multiple times, and the result should not change.
                                                      Returns when all uploads and imports are completed.
                        
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                                                      Accepts the --dataset-latest flag, to use the latest directory date in the SQL data directory.
                                                      Directories should be named YYYY-MM-DD.  When this flag is used, the course SQL dataset name has
                                                      "_latest" appended to it.
                        
                        daily_logs --tlfn=<path>    : Do all commands (split, logs2gs, logs2bq) to get one day's edX tracking logs into google storage 
                                   <course_id>        and import into BigQuery.  See more information about each of those commands, below.
                                   ...                This step is idempotent - it can be re-run multiple times, and the result should not change.
                                                      Returns when all uploads and imports are completed.
                        
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        doall <course_id> ...       : run setup_sql, analyze_problems, logs2gs, logs2bq, axis2bq, person_day, enrollment_day,
                                                      and person_course, for each of the specified courses.  This is idempotent, and can be run
                                                      weekly when new SQL dumps come in.
                        
                        nightly <course_id> ...     : Run sequence of commands for common nightly update (based on having new tracking logs available).
                                                      This includes logs2gs, logs2bq, person_day, enrollment_day
                        
                        --- SQL DATA RELATED COMMANDS
                        
                        waldofy <sql_data_dir>      : Apply HarvardX Jim Waldo conventions to SQL data as received from edX, which renames files to be
                                <course_id> ...       more user friendly (e.g. courseware_studentmodule -> studentmodule) and converts the tab-separated
                                                      values form (*.sql) to comma-separated values (*.csv).  Also compresses the resulting csv files.
                                                      Does this only for the specified course's, because the edX SQL dump may contain a bunch of
                                                      uknown courses, or scratch courses from the edge site, which should not be co-mingled with
                                                      course data from the main production site.  
                                                      
                                                      It is assumed that <sql_data_dir> has a name which contains a date, e.g. xorg-2014-05-11 ;
                                                      the resulting data are put into the course SQL base directory, into a subdirectory with
                                                      name given by the course_id and date, YYYY-MM-DD.
                        
                                                      The SQL files from edX must already be decrypted (not *.gpg), before running this command.
                        
                        make_uic <course_id> ...    : make the "user_info_combo" file for the specified course_id, from edX's SQL dumps, and upload to google storage.
                                                      Does not import into BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        sql2bq <course_id> ...      : load specified course_id SQL files into google storage, and import the user_info_combo and studentmodule
                                                      data into BigQuery.
                                                      Also upload course_image.jpg images from the course SQL directories (if image exists) to
                                                      google storage, and make them public-read.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        load_forum <course_id> ...  : Rephrase the forum.mongo data from the edX SQL dump, to fit the schema used for forum
                                                      data in the course BigQuery tables.  Saves this to google storage, and imports into BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        makegeoip                   : Creates table of geoip information for IP addresses in person_course table missing country codes.
                                                      Accepts the --table argument to specify the person_course table to use.
                                                      Alternatively, provide the --org argument to specify the course_report_ORG dataset to look
                                                      in for the latest person_course dataset.
                        
                        tsv2csv                     : filter, which takes lines of tab separated values and outputs lines of comma separated values.
                                                      Useful when processing the *.sql files from edX dumps.
                        
                        analyze_problems <c_id> ... : Analyze capa problem data in studentmodule table, generating the problem_analysis table as a result.  
                                                      Uploads the result to google cloud storage and to BigQuery.
                                                      This table is necessary for the insights dashboard.
                        
                        staff2bq <staff.csv>        : load staff.csv file into BigQuery; put it in the "courses" dataset.
                        
                        --- TRACKING LOG DATA RELATED COMMANDS
                        
                        split <daily_log_file> ...  : split single-day tracking log files (should be named something like mitx-edx-events-2014-10-17.log.gz),
                                                      which have aleady been decrypted, into DIR/<course>/tracklog-YYYY-MM-DD.json.gz for each course_id.
                                                      The course_id is determined by parsing each line of the tracking log.  Each line is also
                                                      rephrased such that it is consistent with the tracking log schema defined for import
                                                      into BigQuery.  For example, "event" is turned into a string, and "event_struct" is created
                                                      as a parsed JSON dict for certain event_type values.  Also, key names cannot contain
                                                      dashes or periods.  Uses --logs-dir option, or, failing that, TRACKING_LOGS_DIRECTORY in the
                                                      edx2bigquery_config file.  Employs DIR/META/* files to keep track of which log files have been
                                                      split and rephrased, such that this command's actions are idempotent.
                        
                        logs2gs <course_id> ...     : transfer compressed daily tracking log files for the specified course_id's to Google cloud storage.
                                                      Does NOT import the log data into BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        logs2bq <course_id> ...     : import daily tracking log files for the specified course_id's to BigQuery, from Google cloud storage.
                                                      The import jobs are queued; this does not wait for the jobs to complete,
                                                      before exiting.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        mongo2gs <course_id> ...    : extract tracking logs from mongodb (using mongoexport) for the specified course_id and upload to google storage.
                                                      uses the --start-date and --end-date options.  Skips dates for which the correspnding file in google storage
                                                      already exists.  
                                                      Rephrases log file entries to be consistent with the schema used for tracking log file data in BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        --- COURSE CONTENT DATA RELATED COMMANDS
                        
                        axis2bq <course_id> ...     : construct "course axis" table, upload to gs, and generate table in BigQuery dataset for the
                                                      specified course_id's.  
                                                      Accepts the "--clist" flag, to process specified list of courses in the config file's "courses" dict.
                        
                        make_cinfo listings.csv     : make the courses.listings table, which contains a listing of all the courses with metadata.
                                                      The listings.csv file should contain the columns Institution, Semester, New or Rerun, Course Number,
                                                      Short Title, Title, Instructors, Registration Open, Course Launch, Course Wrap, course_id
                        
                        --- REPORTING COMMANDS
                        
                        person_day <course_id> ...  : Compute the person_course_day (pcday) for the specified course_id's, based on 
                                                      processing the course's daily tracking log table data.
                                                      The compute (query) jobs are queued; this does not wait for the jobs to complete,
                                                      before exiting.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        enrollment_day <c_id> ...   : Compute the enrollment_day (enrollday2_*) tables for the specified course_id's, based on 
                                                      processing the course's daily tracking log table data.
                                                      The compute (query) jobs are queued; this does not wait for the jobs to complete,
                                                      before exiting.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        person_course <course_id> ..: Compute the person-course table for the specified course_id's.
                                                      Needs person_day tables to be created first.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                                                      Accepts the --force-recompute flag, to force recomputation of all pc_* tables in BigQuery.
                                                      Accepts the --skip-if-exists flag, to skip computation of the table already exists in the course's dataset.
                        
                        report <course_id> ...      : Compute overall statistics, across all specified course_id's, based on the person_course tables.
                                                      Accepts the --nskip=XXX optional argument to determine how many report processing steps to skip.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        combinepc <course_id> ...   : Combine individual person_course tables from the specified course_id's, uploads CSV to
                                                      google storage.
                                                      Does NOT import the data into BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        --- TESTING & DEBUGGING COMMANDS
                        
                        rephrase_logs               : process input tracking log lines one at a time from standard input, and rephrase them to fit the
                                                      schema used for tracking log file data in BigQuery.  Used for testing.
                        
                        testbq                      : test authentication to BigQuery, by listing accessible datasets.
                        
                        get_tables <dataset>        : dump information about the tables in the specified BigQuery dataset.
                        
                        get_table_info <dataset> <table_id>   : dump meta-data information about the specified dataset.table_id from BigQuery.
                        
                        delete_empty_tables <course_id> ...   : delete empty tables form the tracking logs dataset for the specified course_id's, from BigQuery.
                                                                Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
  courses               courses or course directories, depending on the command

optional arguments:
  -h, --help            show this help message and exit
  --course-base-dir COURSE_BASE_DIR
                        base directory where course SQL is stored, e.g. 'HarvardX-Year-2-data-sql'
  --course-date-dir COURSE_DATE_DIR
                        date sub directory where course SQL is stored, e.g. '2014-09-21'
  --start-date START_DATE
                        start date for person-course dataset generated, e.g. '2012-09-01'
  --end-date END_DATE   end date for person-course dataset generated, e.g. '2014-09-21'
  --tlfn TLFN           path to daily tracking log file to import, e.g. 'DAILY/mitx-edx-events-2014-10-14.log.gz'
  -v, --verbose         increase output verbosity
  --year2               increase output verbosity
  --clist CLIST         specify name of list of courses to iterate command over
  --force-recompute     force recomputation
  --dataset-latest      use the *_latest SQL dataset
  --skip-geoip          skip geoip processing in person_course
  --skip-if-exists      skip processing in person_course if table already exists
  --nskip NSKIP         number of steps to skip
  --logs-dir LOGS_DIR   directory to output split tracking logs into
  --dbname DBNAME       mongodb db name to use for mongo2gs
  --table TABLE         bigquery table to use, specified as dataset_id.table_id
  --org ORG             organization ID to use
  --collection COLLECTION
                        mongodb collection name to use for mongo2gs
  --output-project-id OUTPUT_PROJECT_ID
                        project-id where the report output should go (used by the report and combinepc commands)
  --output-dataset-id OUTPUT_DATASET_ID
                        dataset-id where the report output should go (used by the report and combinepc commands)
  --output-bucket OUTPUT_BUCKET
                        gs bucket where the report output should go, e.g. gs://x-data (used by the report and combinepc commands)