Presented by Dr. Chakkrit (Kla) Tantithamthavorn at the International Conference on Mining Software Repositories (MSR): Education Track on May 26, 2019.
Software analytics focuses on analyzing and modeling a rich source of software data using well-established data analytics techniques in order to glean actionable insights for improving development practices, productivity, and software quality. However, if care is not taken when analyzing and modeling software data, the predictions and insights that are derived from analytical models may be inaccurate and unreliable. The goal of this hands-on tutorial is to guide participants on how to (1) analyze software data using statistical techniques like correlation analysis, hypothesis testing, effect size analysis, and multiple comparisons, (2) develop accurate, reliable, and reproducible analytical models, (3) interpret the models to uncover relationships and insights, and (4) discuss pitfalls associated with analytical techniques including hands-on examples with real software data. R will be the primary programming language. Code samples will be available in a public GitHub repository. Participants will do exercises via either RStudio or Jupyter Notebook through Binder.
Begin by cloning or downloading the tutorial GitHub project https://github.com/awsm-research/tutorial.
To run this tutorial on Jupyter ANYTIME and ANYWHERE, please access Binder the following command:
If you need to install Docker, follow the installation instructions at docker.com (the community edition is sufficient).
Now we'll run the docker image. It's important to follow the next steps carefully. We're going to mount two local directories inside the running container, one for the data we want to use so and one for the notebooks.
- Open a terminal or command window
- Change to the directory where you expanded the tutorial project or cloned the repo
- To run this tutorial on Jupyter on your local machine, please run the following command:
First, we need to build a docker container.
docker build -t awsmdocker/binder .
Then, we run the container. If the command is run successfully,
docker run -v `pwd`:/home/rstudio -p 8888:8888 awsmdocker/binder:latest
Finally, the Jupyter can be accessed via the URL (http://localhost:8888/). The required token for Jupyter can be obtained from the Terminal.
- Include control factors
- Remove correlated factors
- Build interpretable models
- Explore different parameter settings
- Use out-of-sample bootstrap
- Summarize by a Scott-Knott test
- Visualize the relationship
- Don’t use ANOVA Type-I
- Don’t optimize prob thresholds
- Don’t solely use F-measure
Dr. Chakkrit (Kla) Tantithamthavorn is a lecturer in the Faculty of Information Technology, Monash University, Australia. His research aims to develop technologies that enable software practitioners to produce the highest quality software systems with the lowest costs. Currently, his research focused on inventing practical and explainable analytics to prevent future software defects. He is best known as a lead instructor at MSR Education 2019 about Guidelines and Pitfalls for Mining, Analyzing, Modelling, and Explaining Software Defects, and the author of the ScottKnott ESD R package (i.e., a statistical mean comparison test) with more than 5,000 downloads. More about him is available at http://chakkrit.com
@inproceedings{tantithamthavorn2018pitfalls,
Author={Tantithamthavorn, Chakkrit and Hassan, Ahmed E.},
Title = {An Experience Report on Defect Modelling in Practice: Pitfalls and Challenges},
Booktitle = {In Proceedings of the International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP'18)},
Pages = {286--295},
Year = {2018}
}
Anyone may contribute to our project. Submit a pull request or raise an issue.
Thank you for working through this tutorial. Feedback and pull requests are welcome.