A preprocessor of Stack Overflow dump to perform stemming, remove stop words, generate synonyms for tags and extract code blocks in table posts.
- Java 1.8
- Postgres 9.3. Configure your DB to accept local connections. An example of pg_hba.conf configuration:
# "local" is for Unix domain socket connections only
local all all md5
# IPv4 local connections:
host all all md5
Download the SO Dump of March 2017 containing the original content downloaded from SO Official Dump.
On your DB tool, create a new database named stackoverflow2017. This is a query example:
CREATE DATABASE stackoverflow2017
WITH OWNER = postgres
TABLESPACE = pg_default
LC_CTYPE = 'en_US.UTF-8'
- Restore the downloaded dump to the created database.
Obs: restoring this dump would require at least 100 Gb of free space. If your operating system runs in a partition with insufficient free space, create a tablespace pointing to a larger partition and associate the database to it by replacing the "TABLESPACE" value to the new tablespace name: TABLESPACE = tablespacename
Assert the database is sound. Execute the following SQL command:
select title,body,tags,tagssyn,code from posts where title is not null limit 10
. The return should list the main fields for 10 posts. Here, the column "body" should contains special html tags like<p>
. -
Assert Maven is correctly installed. In a Terminal enter with the command:
mvn --version
. This should return the version of Maven.
Edit the application.properties file under src/main/resources/ and set the your data_base password parameter. The file comes with default values for performing stemming and removing stop words. You need to fill only variables:
. Changespring.datasource.username
if your db user is not postgres. -
In a terminal, go to the Project_folder and build the jar file with the Maven command:
mvn package -Dmaven.test.skip=true
. Assert that preprocessor.jar is built under target folder. -
Go to Project_folder/target and run the command to execute the jar:
java -Xms1024M -Xmx20g -jar ./preprocessor.jar
The logs are displayed in the terminal but you can check if the process ended successfully by running the following SQL on your DB tool: select title,body,tags,tagssyn,code from posts where title is not null limit 10
. Observe that title and body fields are stemmed and had the stop words removed. Also the tagssyn and code columns were filled.
- Rodrigo Fernandes - Initial work - Muldon
- Carlos Eduardo - Carlos
- Klerisson Paixao - Klerisson
- Marcelo Maia - Marcelo
This project is licensed under the MIT License - see the LICENSE.md file for details