
A preprocessor of Stack Overflow dump to perform stemming and remove stop words in table posts.

A preprocessor of Stack Overflow dump to perform stemming, remove stop words, generate synonyms for tags and extract code blocks in table posts.



  1. Java 1.8
  2. Postgres 9.3. Configure your DB to accept local connections. An example of pg_hba.conf configuration:
# TYPE  DATABASE        USER            ADDRESS                 METHOD
# "local" is for Unix domain socket connections only
local   all             all                                     md5
# IPv4 local connections:
host    all             all               md5
  1. PgAdmin (we used PgAdmin 3) but feel free to use any DB tool for PostgreSQL.

  2. Maven 3

Installing the app.

  1. Download the SO Dump of March 2017 containing the original content downloaded from SO Official Dump.

  2. On your DB tool, create a new database named stackoverflow2017. This is a query example:

CREATE DATABASE stackoverflow2017
  WITH OWNER = postgres
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_US.UTF-8'
       LC_CTYPE = 'en_US.UTF-8'
  1. Restore the downloaded dump to the created database.

Obs: restoring this dump would require at least 100 Gb of free space. If your operating system runs in a partition with insufficient free space, create a tablespace pointing to a larger partition and associate the database to it by replacing the "TABLESPACE" value to the new tablespace name: TABLESPACE = tablespacename.

  1. Assert the database is sound. Execute the following SQL command: select title,body,tags,tagssyn,code from posts where title is not null limit 10. The return should list the main fields for 10 posts. Here, the column "body" should contains special html tags like <p>.  

  2. Assert Maven is correctly installed. In a Terminal enter with the command: mvn --version. This should return the version of Maven.

Running the process

  1. Edit the application.properties file under src/main/resources/ and set the your data_base password parameter. The file comes with default values for performing stemming and removing stop words. You need to fill only variables: spring.datasource.password=YOUR_DB_PASSWORD. Change spring.datasource.username if your db user is not postgres.

  2. In a terminal, go to the Project_folder and build the jar file with the Maven command: mvn package -Dmaven.test.skip=true. Assert that preprocessor.jar is built under target folder.

  3. Go to Project_folder/target and run the command to execute the jar: java -Xms1024M -Xmx20g -jar ./preprocessor.jar.


The logs are displayed in the terminal but you can check if the process ended successfully by running the following SQL on your DB tool: select title,body,tags,tagssyn,code from posts where title is not null limit 10. Observe that title and body fields are stemmed and had the stop words removed. Also the tagssyn and code columns were filled.



This project is licensed under the MIT License - see the LICENSE.md file for details