/ICPC-RENE-Paper

This is the experimental package of paper entitled "What is the Vocabulary of Flaky Tests? An Extended Replication" that was ​submitted for publication in ICPC 2021 - Replications and Negative Results (RENE).

Primary LanguageJupyter NotebookCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

CC BY-SA 4.0 DOI arXiv

Bruno Henrique Pachulski Camara 1, 2,
Marco Aure ́lio Graciotto Silva 3,
Andre T. Endo 4,
Silvia Regina Vergilio 2.

1 Centro Universitário Integrado, Campo Mourão, PR, Brazil
2 Department of Computer Science, Federal University of Parana ́, Curitiba, PR, Brazil
      bhpachulski@ufpr.br, silvia@inf.ufpr.br
3 Department of Computing, Federal University of Technology - Parana ́, Campo Mourão, PR, Brazil
      magsilva@utfpr.edu.br
4 Department of Computing, Federal University of Technology - Parana ́, Cornélio Procópio, PR, Brazil
      andreendo@utfpr.edu.br

This paper has been submitted for publication in ICPC 2021 - Replications and Negative Results (RENE).

This experimental package is organized by research questions. For each of the questions, some files can be executed to obtain the data that are presented in the paper.

Abstract

Software systems have been continuously evolved and delivered with high quality due to the widespread adoption of automated tests. A recurring issue hurting this scenario is the presence of flaky tests, a test case that may pass or fail non-deterministically. A promising, but yet lacking more empirical evidence, approach is to collect static data of automated tests and use them to predict their flakiness. In this paper, we conducted an empirical study to assess the use of code identifiers to predict test flakiness. To do so, we first replicate most parts of the previous study of Pintoetal.(MSR2020). This replication was extended by using a different ML Python platform (Scikit-learn) and adding different learning algorithms in the analyses. Then, we validated the performance of trained models using datasets with other flaky tests and from different projects. We successfully replicated the results of Pintoetal.~(2020), with minor differences using Scikit-learn; different algorithms had performance similar to the ones used previously. Concerning the validation, we noticed that the recall of the trained models was smaller, and classifiers presented a varying range of decreases. This was observed in both intra-project and inter-projects test flakiness prediction.

Keywords: test flakiness, regression testing, replication studies, machine learning

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0