simple-rolap: Simple relational online analytical processing

This collection of scripts allows maintainable and time-efficient relational online analytical processing through the specification of small, modular SQL queries. It is mainly suited for querying a rarely modified data set, such as that of GHTorrent. The result of each SQL query is saved in corresponding table, which can then be used by subsequent queries. As a result, each query can be independently developed and tested. Automatic dependency analysis of the queries ensures that new queries can use already calculated results and that every time a query is changed all tables that depend on it (and only those) are automatically repopulated.

When using the MySQL engine, the table population timestamps are automatically propagated from the database to the files that track the table creation times in the tables folder. This allows multiple users to work together with separate instances of a simple-rolap repository, without requiring the recalculation of expensive tables. When the database's information schema is locked (e.g. due to the creation of new indices) this part of the dependency generation step can block until the (potentially long) operation finishes. To avoid this, set the SKIP_TIMESTAMPING environment variable, for example by running make SKIP_TIMESTAMPING=1.

The provided functionality is mainly useful in cases where materialized views are unsupported or unusable. The scripts are written for MySQL and SQLite, but they should be easy to port to other relational database systems.

For complex relational OLAP queries, simple-rolap can be combined particularly effectively with RDBUnit. You can find a complete tutorial on using simple-rolap with RDBUnit for mining Git repositories in a technical briefing presented at the 2017 International Conference on Software Engineering. You can cite this work as follows.

Georgios Gousios and Diomidis Spinellis. Mining software engineering data from GitHub. In Proceedings of the 39th International Conference on Software Engineering Companion, ICSE-C '17, pages 501–502, Piscataway, NJ, USA, May 2017. IEEE Press. Technical Briefing. doi:10.1109/ICSE-C.2017.164

Global installation

The simple-rolap scripts are used by including the provided Makefile. Consequently, all that is needed is to provide the repository in suitably accessible location. Here is an example.

cd /usr/local/lib
sudo git clone --depth=1 https://github.com/dspinellis/simple-rolap.git

Alternatively, you can perform a local install by adding two lines in the project's Makefile. (See below.)

The system where simple-rolap is run must have and installation of the database being used, GNU make, GNU sed, and (if you want to visualize the associated dependency graphs) GraphViz.

Use

To start using simple-rolap, create a Makefile that a) specifies the project's configuration by defining a few variables, and b) includes the simple-rolap Makefile. Consider the following self-explanatory example.

# Database engine to use (One of sqlite or mysql)
export RDBMS?=sqlite
# The (default) database containing the data you want to query
export MAINDB?=rxjs-ghtorrent
# The (explicitly specified) database that will contain your queries' results
export ROLAPDB?=stratsel

include /usr/local/src/simple-rolap/Makefile

To work with a local install of simple-rolap change the Makefile's last line into the following.

include simple-rolap/Makefile

simple-rolap/Makefile:
        git clone https://github.com/dspinellis/simple-rolap

If some actions need to be performed before running the queries, you can specify them in a variable named DEPENDENCIES and then add corresponding rules. Example:

export DEPENDENCIES=rxjs-ghtorrent.db

rxjs-ghtorrent.db:
	wget https://github.com/ghtorrent/tutorial/raw/master/rxjs-ghtorrent.db

Then comes the specification of SQL statements that will analyze the data. The simple-rolap system supports two types of modules: those that create tables and those that run queries to create a report. Below is a table creation module.

-- Projects that have been forked

create table stratsel.forked_projects as
  select distinct forked_from as id from projects
  where forked_from is not null;

The SQL statement creates a table in the result database by using tables from the data and the result database. The module must reside in a file named after the table it creates, with the suffix .sql, e.g. forked_projects.sql in the preceding example.

By typing make the module will be run, if and only if its results have never been produced, or if they are older than the result tables on which they depend. This is accomplished by creating a timestamped file for each module execution in a directory named tables.

Each module can contain database engine-specific extensions and also SQL statements that create indices, as shown in the following example.

-- Projects in our candidate set that have their URL blacklisted

create table leadership.blacklisted_projects ENGINE=MyISAM as
  select projects.id
  from leadership.blacklisted_urls
  left join projects
  on blacklisted_urls.url = projects.url;

alter table leadership.blacklisted_projects add index(id);

A query module is a simple SQL SELECT query, such as the following.

-- URLs of popular projects
select projects.id, concat('https://github.com/', substr(url, 30)) as url
from stratsel.popular_projects
left join projects
on projects.id = popular_projects.id;

When make is run on such a module, the results of the query will appear in the directory reports in a file named after the module's name with the suffix .sql replaced by .txt.

To authenticate yourself with the main database, setup suitable environment variables (e.g. PGPASSWORD) or files (e.g. .my.cnf or .pgpass).

Goodies

Here are some more things that the provided Makefile allows you to do.

Run make graph.png or make graph.pdf to generate a diagram of the queries' dependencies.
Run make test to run any RDBUnit unit tests you may have written.
Run make clean to remove all auto-generated files, so that you can start a new analysis from scratch.
Run make tags to create a tags file that many editors can use to automatically navigate between the queries.
Run make sorted-dependencies to create a file named sorted-dependencies with the queries listed in the order determined by their dependencies.

gousiosg/simple-rolap

simple-rolap: Simple relational online analytical processing

Global installation

Use

Goodies

See also