Stack Mining

Stack Overflow mining scripts used for the following paper during the 19th International Conference on Mining Software Repositories (MSR '22):

Mining the Usage of Reactive Programming APIs: A Mining Study on GitHub and Stack Overflow.

Complementary scripts, also utilized during the paper production, are available in:

Data

Under the folders in /assets, data either genereated by or collected for the scripts execution can be found. The table gives a brief description of each folder:

Folder	Description
data explorer	Contains posts collected from Stack Exchange Data Explorer
extracted-posts	Includes JSON files having the posts related to the most relenvat topics (RQ3)
lda-results	Contains the results of the last LDA execution
operators-search	Includes the results for the operator search for Rx libraries
operators	Includes JSON files consisting of Rx libraries' operators
result-processing	Contains data presented in the Result section (RQ2)

The file stopwords.txt contains a list of stop words used during preprocessing.

LDA results

The results for the last LDA (Latent Dirichlet Allocation) are available under /assets/2022-01-12 02-21-28/. As detailed in the paper, the execution with the following settings generated the most coherent results:

Parameter	Value
Topics	23
HyperParameters	α=β=0.01
Iterations	1,000

Each result is comprised of three CSV files following the bellow file name pattern:

[file name of the posts file]_doctopicdist_[#topics]_[analyzed post field].csv - contains the posts' ids and their distribution of topics+proportion, including the dominant topic and its proportion in a separate column for easy retrieval;
[file name of the posts file]_topicdist_[#topics]_[analyzed post field].csv - the topic distribution along with their words+proportion descendingly sorted by word proportion;
[file name of the posts file]_topicdist_[#topics]_[analyzed post field] - topwords.csv - (extra) the same as the above one but presenting the topics only with their top words (set in config) to facilitate the open card sorting technique.

Where:

[file name of the posts file]: is a file under assets/data explorer/consolidated sources and set through config;
[#topics]: number of topics for that specific execution;
[analyzed post field]: either Title or Body (see Configuration).

Execution

Requirements

Most of the scripts utilize Golang as the main language and they have be executed the following version:

Go v1.17.5

Before execution of the Golang scripts, the following command must be issued in a terminal (inside the root of the project) to download the dependencies:

go mod tidy

Scripts

The Go scripts are available under the /cmd folder

consolidate-sources

Script to unify all the CSV acquired from Stack Exchange Data Explorer.

go run cmd/consolidate-sources/main.go

:floppy_disk: After execution, the result is available at assets/data explorer/consolidated sources/.

extract-posts

Script to extract post from a given topic.

go run cmd/extract-posts/main.go

:floppy_disk: After execution, the result is available at assets/extracted-posts.

lda

Script to execute the LDA algorithm.

go run cmd/lda/main.go

:floppy_disk: After execution, the result is available at assets/lda-results.

open-sort

Script to generate random posts according to their topics and facilitate the open sort (topic labeling) execution.

go run cmd/open-sort/main.go

:floppy_disk: After execution, the result is available at assets/opensort.

operators-search

Script to search for operators among the Stack Overflow posts.

go run cmd/operators-search/main.go

:floppy_disk: After execution, the result is available at assets/operators-search.

process-results

Script to process results and generate info about the topics, the popularities and difficulties.

go run cmd/process-results/main.go

:floppy_disk: After execution, the result is available at assets/result-processing.

Configuration

The LDA script require the setting of some configuration in a JSON(config.json) under /configs folder. This JSON is expecting a array of objects, each one representing a LDA execution. The objective must have the following structure (this is the object present by default in config.json):

{
    "fileName": "all_withAnswers",
    "field": "Body",
    "combineTitleBody": true,
    "minTopics": 10,
    "maxTopics": 35,
    "sampleWords": 20
  }

Where:

fileName(string): the name of the file with the posts(at assets/data explorer/consolidated sources);
field(string): the field to considered in LDA (either Title or Body);
combineTitleBody(boolean): set it to combine title and body and assign the result to the post's Body field (only applicable if field is set to "Body");
minTopics(integer): the minimum quantity of posts to be generated;
maxTopics(integer): the maximum quantity of posts to be generated;
sampleWords(integer): the amount of sample top words to be included in an extra file with file name ending with - topwords.

Stack Exchange Data Explorer

Possible requirements:

Internet browser
Node.js (tested with v14.17.5)

We elaborated a tiny JS script to download the Stack Overflow posts (questions with and without accepted answers) related to the rx libraries from Stack Exchange Data Explorer (SEDE). It's available at /scripts/data explorer/data-explorer.js. To execute it, one must:

Be logged in SEDE;
Place the script in the DevTools's Console;
Call executeQuery passing 0 (for RxJava), 1 (for RxJS), and 2 (for RxSwift) as a parameter.

Moreover, there's a second script(/scripts/data explorer/rename.js) that can be used to move (and rename) the results to the their proper folder /assets/data explorer/[rx library folder], so they can be further used by the Go consolidate-sources script. In order for this second JS script to work, one must place the results under /scripts/data explorer/staging area and call the script in a terminal (with node) and passing either 0 (for RxJava), 1 (for RxJS), and 2 (for RxSwift). For example:

node rename 0

Before execution of node.js script, one must execute the following terminal command within /scripts/data explorer/:

npm install

As detailed in the paper, these were the Stack Overflow tags used:

rx-java, rx-java2, rx-java3 (RxJava)
rxjs, rxjs5, rxjs6, rxjs7 (RxJS)
rx-swift (RxSwift)

Other Useful Information

Stack Overflow Removed Terms

As defined in the preprocessing phase in the paper, some terms commonly found in the Stack Overflow posts were removed from the corpus. Those include:

differ, specif, deal, prefer, easili, easier, mind, current, solv, proper, modifi, explain, hope, help, wonder, altern, sens, entir, ps, solut, achiev, approach, answer, requir, lot, feel, pretti, easi, goal, think, complex, eleg, improv, look, complic, day, chang, issu, add, edit, remov, custom, suggest, comment, ad, refer, stackblitz, link, mention, detect, face, fix, attach, perfect, mark, reason, suppos, notic, snippet, demo, line, piec, appear

Topic-Label Mapping

Topic #	Label/Name
0	Concurrency
1	Stream Creation and Composition
2	Typing and Correctness
3	UI for Web-based Systems
4	Input Validation
5	Introductory Questions
6	Testing and Debugging
7	REST API Calls
8	Android Development
9	Data Access
10	State Management and JavaScript
11	Control Flow
12	HTTP Handling
13	Stream Manipulation
14	Error Handling
15	Stream Lifecycle
16	Array Manipulation
17	Web Development
18	General Programming
19	iOS Development
20	Multicasting
21	Timing
22	Dependency Management

Tables and Figures

Scripts used to produce some tables and figures present in the paper are located at the GitHub Mining repository. That was made to facilitate cross evaluation (GitHub + SO data) that some of those illustrations required.

alnp/combine-so-mining