This repository contains a Django app that populates the peekbank database. A few notes:
- We use Django to enforce the database schema, build an object-relational model and then populate the database
- The workflows below involve pushing data to the
peekbank_dev
database, then copying that to named releases that will otherwise remain unchanged.peekbank_dev
may change at any time
Follow these directions if you are setting up a new server (e.g., a new EC2 box) from scratch. Otherwise, jump down to "Update the Dev Database" for instructions on how to get into an existing installation and update the database with a newer version of the data. We provide the instructions here for setting up a new server because one might want to add this to an existing server (e.g., an image with Shiny on it) because the MySQL requirements are quite simple by comparison.
-
First, SSH onto the server.
-
Make sure that you have mysql server installed. For a debian based OS, this is most likely as simple as running
sudo apt install mysql-server
. However, you will also need to figure out user accounts in the database with appropriate privileges. Theconfig.json
should give you some hints. Modify according to your environment. -
Clone this repo into the user folder
-
Get the
config.json
file with database credentials and place them in the root of this repo. This includes Django settings and passwords and is not part of the repo because it has passwords etc. -
Set up a virtual environment; by convention
peekbank-env
:virtualenv peekbank-env -p python3
-
Change python3 to a specific path if you want to use a specific Python installation. Then activate the venv:
source peekbank-env/bin/activate
-
And you should see the venv name in your shell (peekbank-env). Then install the requirements to the venv:
pip3 install -r requirements.txt
At this step, if you get an error related to missing mysql_config
file, make sure that you install the package libmysqlclient-dev
with sudo apt install libmysqlclient-dev
(or something similar depending on your OS)
This server should be ready to run MySQL, so try updating the dev database as below!
When you want to update the database, you don't need to install anything new (i.e. MySQL or python libraries) -- all you need to do isget on the existing machine, load the appropriate virtual environment so that the system can see the right python libraries, download the fresh data, and then push that data to MySQL. In more detail:
-
cd peekbank
to enter the peekbank folder -
Activate the virtual environment:
source peekbank-env/bin/activate
-
Download the most recent version of all of the files from OSF with the Django command:
python3 manage.py download_osf --data_root ../peekbank_data_osf
-
Right now there is a separate directory called
peekbank_data_testing
for adding a subset of datasets for testing (by copying them manually frompeekbank_data_osf
). The script in the next step looks at a folder calledpeekbank_data
, which is either symlinked to the testing directory or to the output OSF directory (when you are ready to process all of the datasets). Change the symlink by deleting it withrm
and then symlinking it withln -s <destination> <symlink name>
, e.g.,ln -s peekbank_data_osf peekbank_data
so that it runs all datasets ORln -s peekbank_data_testing peekbank_data
so that it only looks at the test datasets. -
cd scripts
and run./new_dev_db.sh.
This drops the existing database calledpeekbank_dev
(if it exists), and creates a fresh one. Then it invokes Django migrations to enforce the correct schema, and invokes the Django populate command on whatever is in thepeekbank_data
directory. This then runs a special Django management command that adds another table with the run length encoding. If a script whines about permissions, make sure it is executable withchmod +x [filename]
.
Unless this errors out, you should be able to see the new data in the peekbank_dev
database when this process finishes.
If the contents of peekbank_dev
look good when inspected with an SQL client (and, when we have them, pass tests), you can promote the dev database to a named production database with ./dev_to_prod.sh
. Supply the new name to this script (e.g., ./dev_to_prod.sh 2021.1
) , otherwise it will overwrite the default database peekbank
. Note that this will overwrite an existing database of the same name, so be careful.
The peekbank
application uses a JSON-specified representation of the schema, in static/
. This same schema is used in three places:
- by the
peekds
file readers, in order to parse and validate input files - by
models.py
in Django to establish the data model (i.e., Django object relational model) and to define migrations - by
populate_peekbank2.py
to populate the fields from CSVs output by peekds
The ingestion pipeline can also be used to check that data meets the requirements specified in the schema (without writing anything to the database) using the --valdiate_only
flag. This is useful for checking the compliance of all datasets:
python3 -m pdb -c c manage.py populate_db --data_root /home/ubuntu/peekbank_data --validate_only
The ingestion pipeline can be run on a single datset using the --dataset
flag:
python3 -m pdb -c c manage.py populate_db --data_root /home/ubuntu/peekbank_data --dataset swingley_aslin_2002