Stack Overflow mining scripts used for the following paper during the 19th International Conference on Mining Software Repositories (MSR '22):
Mining the Usage of Reactive Programming APIs: A Mining Study on GitHub and Stack Overflow.
Complementary scripts, also utilized during the paper production, are available in:
Under the folders in /assets, data either genereated by or collected for the scripts execution can be found. The table gives a brief description of each folder:
| Folder | Description |
|---|---|
| data explorer | Contains posts collected from Stack Exchange Data Explorer |
| extracted-posts | Includes JSON files having the posts related to the most relenvat topics (RQ3) |
| lda-results | Contains the results of the last LDA execution |
| operators-search | Includes the results for the operator search for Rx libraries |
| operators | Includes JSON files consisting of Rx libraries' operators |
| result-processing | Contains data presented in the Result section (RQ2) |
The file stopwords.txt contains a list of stop words used during preprocessing.
The results for the last LDA (Latent Dirichlet Allocation) are available under /assets/2022-01-12 02-21-28/. As detailed in the paper, the execution with the following settings generated the most coherent results:
| Parameter | Value |
|---|---|
| Topics | 23 |
| HyperParameters | α=β=0.01 |
| Iterations | 1,000 |
Each result is comprised of three CSV files following the bellow file name pattern:
- [file name of the posts file]_doctopicdist_[#topics]_[analyzed post field].csv - contains the posts' ids and their distribution of topics+proportion, including the dominant topic and its proportion in a separate column for easy retrieval;
- [file name of the posts file]_topicdist_[#topics]_[analyzed post field].csv - the topic distribution along with their words+proportion descendingly sorted by word proportion;
- [file name of the posts file]_topicdist_[#topics]_[analyzed post field] - topwords.csv - (extra) the same as the above one but presenting the topics only with their top words (set in config) to facilitate the open card sorting technique.
Where:
- [file name of the posts file]: is a file under
assets/data explorer/consolidated sourcesand set through config; - [#topics]: number of topics for that specific execution;
- [analyzed post field]: either Title or Body (see Configuration).
Most of the scripts utilize Golang as the main language and they have be executed the following version:
- Go v1.17.5
Before execution of the Golang scripts, the following command must be issued in a terminal (inside the root of the project) to download the dependencies:
go mod tidyThe Go scripts are available under the /cmd folder
Script to unify all the CSV acquired from Stack Exchange Data Explorer.
go run cmd/consolidate-sources/main.go :floppy_disk: After execution, the result is available at assets/data explorer/consolidated sources/.
Script to extract post from a given topic.
go run cmd/extract-posts/main.go :floppy_disk: After execution, the result is available at assets/extracted-posts.
Script to execute the LDA algorithm.
go run cmd/lda/main.go :floppy_disk: After execution, the result is available at assets/lda-results.
Script to generate random posts according to their topics and facilitate the open sort (topic labeling) execution.
go run cmd/open-sort/main.go :floppy_disk: After execution, the result is available at assets/opensort.
Script to search for operators among the Stack Overflow posts.
go run cmd/operators-search/main.go :floppy_disk: After execution, the result is available at assets/operators-search.
Script to process results and generate info about the topics, the popularities and difficulties.
go run cmd/process-results/main.go :floppy_disk: After execution, the result is available at assets/result-processing.
The LDA script require the setting of some configuration in a JSON(config.json) under /configs folder. This JSON is expecting a array of objects, each one representing a LDA execution. The objective must have the following structure (this is the object present by default in config.json):
{
"fileName": "all_withAnswers",
"field": "Body",
"combineTitleBody": true,
"minTopics": 10,
"maxTopics": 35,
"sampleWords": 20
}Where:
- fileName(string): the name of the file with the posts(at
assets/data explorer/consolidated sources); - field(string): the field to considered in LDA (either Title or Body);
- combineTitleBody(boolean): set it to combine title and body and assign the result to the post's Body field (only applicable if
fieldis set to"Body"); - minTopics(integer): the minimum quantity of posts to be generated;
- maxTopics(integer): the maximum quantity of posts to be generated;
- sampleWords(integer): the amount of sample top words to be included in an extra file with file name ending with
- topwords.
Possible requirements:
- Internet browser
- Node.js (tested with v14.17.5)
We elaborated a tiny JS script to download the Stack Overflow posts (questions with and without accepted answers) related to the rx libraries from Stack Exchange Data Explorer (SEDE).
It's available at /scripts/data explorer/data-explorer.js. To execute it, one must:
- Be logged in SEDE;
- Place the script in the DevTools's Console;
- Call
executeQuerypassing 0 (for RxJava), 1 (for RxJS), and 2 (for RxSwift) as a parameter.
Moreover, there's a second script(/scripts/data explorer/rename.js) that can be used to move (and rename) the results to the their proper folder /assets/data explorer/[rx library folder], so they can be further used by the Go consolidate-sources script. In order for this second JS script to work, one must place the results under /scripts/data explorer/staging area and call the script in a terminal (with node) and passing either 0 (for RxJava), 1 (for RxJS), and 2 (for RxSwift). For example:
node rename 0Before execution of node.js script, one must execute the following terminal command within /scripts/data explorer/:
npm installAs detailed in the paper, these were the Stack Overflow tags used:
- rx-java, rx-java2, rx-java3 (RxJava)
- rxjs, rxjs5, rxjs6, rxjs7 (RxJS)
- rx-swift (RxSwift)
As defined in the preprocessing phase in the paper, some terms commonly found in the Stack Overflow posts were removed from the corpus. Those include:
differ,specif,deal,prefer,easili,easier,mind,current,solv,proper,modifi,explain,hope,help,wonder,altern,sens,entir,ps,solut,achiev,approach,answer,requir,lot,feel,pretti,easi,goal,think,complex,eleg,improv,look,complic,day,chang,issu,add,edit,remov,custom,suggest,comment,ad,refer,stackblitz,link,mention,detect,face,fix,attach,perfect,mark,reason,suppos,notic,snippet,demo,line,piec,appear
| Topic # | Label/Name |
|---|---|
| 0 | Concurrency |
| 1 | Stream Creation and Composition |
| 2 | Typing and Correctness |
| 3 | UI for Web-based Systems |
| 4 | Input Validation |
| 5 | Introductory Questions |
| 6 | Testing and Debugging |
| 7 | REST API Calls |
| 8 | Android Development |
| 9 | Data Access |
| 10 | State Management and JavaScript |
| 11 | Control Flow |
| 12 | HTTP Handling |
| 13 | Stream Manipulation |
| 14 | Error Handling |
| 15 | Stream Lifecycle |
| 16 | Array Manipulation |
| 17 | Web Development |
| 18 | General Programming |
| 19 | iOS Development |
| 20 | Multicasting |
| 21 | Timing |
| 22 | Dependency Management |
Scripts used to produce some tables and figures present in the paper are located at the GitHub Mining repository. That was made to facilitate cross evaluation (GitHub + SO data) that some of those illustrations required.