SPSM Database API
Command-line tool for downloading data from the SPSM project's PostgreSQL database server.
It takes advantage of 2 different Database Management Systems (DBMS) to do 2 different types of tasks.
Table of contents
Install the tool
- Create a virtual Python environment with version 3.11 of Python.
- Activate the environment.
- With the environment activated, install this tool using the following command:
pip install https://github.com/medialab/spsm-database-api.git
- Test your installation with the following command:
$ spsm --help
Download data
For downloading data from the project's secure PostgreSQL server, you'll need two things:
- A terminal running in the background, in which the remote PostgreSQL server's port is being rerouted to your computer's port 54321.
- A user profile on the server, which has been granted permissions to select from tables.
Start downloading with spsm download SUBCOMMAND
. After which, the terminal will prompt you to enter information necessary for all data downloads.
If you don't want to be prompted, you can enter the information directly as options after spsm download
.
$ spsm download --username 'YOUR.USERNAME' --password YOUR-PASSWORD --download-directory /PATH/TO/FILE.csv SUBCOMMAND
Download an entire table
To download an entire table onto your computer and/or disk, run the following command:
$ spsm download table
As described above, the terminal will prompt you to enter the information necessary for all types of data download.
Then you will be prompted to enter the name of the table you want to download.
If you don't want to be prompted, you can enter the information with options.
$ spsm download --username 'YOUR.USERNAME' --password YOUR-PASSWORD --download-directory /PATH/TO/FILE.csv table --table TABLE_NAME
Download select columns from a table
To download only certain columns of a table, which might be a good idea if it has text columns and the file would otherwise be very big, run the following command:
$ spsm download columns
First, as in the spsm download table
command, the terminal will prompt you to enter the name of the table you want to download.
Then, you'll be asked to begin entering the columns you want to download.
Local DuckDB database (for analysis)
DuckDB is another Database Management System (DBMS), different from PostgreSQL. Whereas PostgreSQL is great for reliably storing data and managing access, DuckDB is excellent at efficiently executing queries. DuckDB is faster than most other Database Management Systems, including PostgreSQL.
TPC-H benchmark 1
query | duckdb (sec) | postgres (sec) |
---|---|---|
1 | 0.03 | 1.12 |
2 | 0.01 | 0.18 |
3 | 0.02 | 0.21 |
4 | 0.03 | 0.11 |
5 | 0.02 | 0.13 |
6 | 0.01 | 0.21 |
7 | 0.04 | 0.20 |
8 | 0.02 | 0.18 |
9 | 0.05 | 0.61 |
10 | 0.04 | 0.35 |
11 | 0.01 | 0.07 |
12 | 0.01 | 0.36 |
13 | 0.04 | 0.32 |
14 | 0.01 | 0.21 |
15 | 0.03 | 0.46 |
16 | 0.03 | 0.12 |
17 | 0.05 | > 60.00 |
18 | 0.08 | 1.05 |
19 | 0.03 | 0.31 |
20 | 0.05 | > 60.00 |
21 | 0.09 | 0.35 |
22 | 0.03 | 0.15 |
Furthermore, contrary to PostgreSQL, DuckDB is what's called an "in-process" DBMS and does not require a server. Everything is immediately accessible in a single file. (An example file name, if you want to save a database with the tables you create / import from CSV files, is spsm.db
.)
For these reasons, DuckDB is a great solution for locally working with tables you have previously downloaded from the remote database--and joining with CSV files you have on your computer. The only drawback is ensuring the tables you're working with locally are up to date with what's on the PostgreSQL server, but that's less of an issue in the SPSM project.
Execute SQL file
To execute the query saved in an SQL file, call the query
subcommand.
$ spsm duckdb query
First, the terminal will ask if you want to save the database to a file, which will preserve data imports for later use. If you do not want to create a database file, DuckDB will process everything in-memory.
Then, the terminal will prompt you to provide the paths to (a) the SQL file in which your query is saved and (b) the CSV file in which you want to export the result.
Once the program reads the provided SQL query, it will search for the referenced tables (relations) in the DuckDB database. If it doesn't find them, you will be prompted to provide the paths to CSV files that contain this information.
If you don't want to be prompted, you can provide the relations' names and their data files directly as options after the command.
$ spsm duckdb --database ":memory:" query --query query.sql --outfile output.csv --table claims downloads/claims.csv --table dataset_de_facto downloads/dataset_de_facto.csv
In the example below, you see that (1) we are working with a new, in-memory DuckDB database, which has no prior tables, and (2) the provided SQL query requires the tables claims
and dataset_de_facto
. Using the commands above (i.e. spsm download table
), we have these referenced tables saved in a folder named ./downloads
. The example shows how, having parsed the query in the provided SQL file (./query.sql
), the program sequentially asks us to provide these files and then proceeds to create tables in the in-memory database containing their rows and columns. Finally, after preparing the database, the program executes the query and writes the result to the given out-file (./output.csv
).
Launch Jupyter notebook
To experiment with SQL queries, run spsm duckdb notebook
and launch a Jupyter notebook.
$ spsm duckdb notebook
To provide access to relevant files, including CSV files and/or a DuckDB file, you should provide the path to a directory. If you don't provide a path, the notebook's workspace will be launched from wherever you run the command. Finally, if the directory from which the notebook is launched doesn't already have a Jupyter notebook file, one will be created using the same database configuration provided with all spsm duckdb
commands, as described above. If the directory already has a Jupyter notebook but you want to create a new one anyway, provide a path with the option "--new".
$ spsm duckdb --database './spsm.db' notebook --new 'Notebook.ipynb'
The above command would create a notebook in the current working directory, meaning the location from which the command is run. And it relies on an existant DuckDB database file in the same directory.
./
|__ Notebook.ipynb
|__ spsm.db
|__ downloads/
|__ claims.csv
|__ selection_tweet.csv
|__ tweet_claim.csv
With the database configuration declared in the parent spsm duckdb
command, the subcommand spsm duckdb notebook
launches a Jupyter notebook with code blocks establishing a DuckDB connection already written in the notebook.