A preprocessor of Stack Overflow dump to perform stemming, remove stop words, generate synonyms for tags and extract code blocks in table posts.
Softwares:
- Java 1.8
- Postgres 9.3. Configure your DB to accept local connections. An example of pg_hba.conf configuration:
...
# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all all md5
# IPv4 local connections:
host all all 127.0.0.1/32 md5
...
-
Download the SO Dump of March 2017 containing the original content downloaded from SO Official Dump.
-
On your DB tool, create a new database named stackoverflow2017. This is a query example:
CREATE DATABASE stackoverflow2017
WITH OWNER = postgres
ENCODING = 'UTF8'
TABLESPACE = pg_default
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
CONNECTION LIMIT = -1;
- Restore the downloaded dump to the created database.
Obs: restoring this dump would require at least 100 Gb of free space. If your operating system runs in a partition with insufficient free space, create a tablespace pointing to a larger partition and associate the database to it by replacing the "TABLESPACE" value to the new tablespace name: TABLESPACE = tablespacename
.
-
Assert the database is sound. Execute the following SQL command:
select title,body,tags,tagssyn,code from posts where title is not null limit 10
. The return should list the main fields for 10 posts. Here, the column "body" should contains special html tags like<p>
. -
Assert Maven is correctly installed. In a Terminal enter with the command:
mvn --version
. This should return the version of Maven.
-
Edit the application.properties file under src/main/resources/ and set the your data_base password parameter. The file comes with default values for performing stemming and removing stop words. You need to fill only variables:
spring.datasource.password=YOUR_DB_PASSWORD
. Changespring.datasource.username
if your db user is not postgres. -
In a terminal, go to the Project_folder and build the jar file with the Maven command:
mvn package -Dmaven.test.skip=true
. Assert that preprocessor.jar is built under target folder. -
Go to Project_folder/target and run the command to execute the jar:
java -Xms1024M -Xmx20g -jar ./preprocessor.jar
.
The logs are displayed in the terminal but you can check if the process ended successfully by running the following SQL on your DB tool: select title,body,tags,tagssyn,code from posts where title is not null limit 10
. Observe that title and body fields are stemmed and had the stop words removed. Also the tagssyn and code columns were filled.
- Rodrigo Fernandes - Initial work - Muldon
- Carlos Eduardo - Carlos
- Klerisson Paixao - Klerisson
- Marcelo Maia - Marcelo
This project is licensed under the MIT License - see the LICENSE.md file for details