multinomial-classification-enem-2019

Problem Definition
The Exame Nacional do Ensino Médio (also known as ENEM) is a national Brazilian standardized test that allows students to conquer a spot in universities in the country and abroad (Inep, 2016). With millions of examinees from different social backgrounds, this paper aims to use the socio-economic data gathered in the 2019 exam application to predict which social class (A to E, following the methodology explained by Carneiro (2021), and used by IBGE) a given applicant belongs. The micro-data can be retrieved here: https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados/enem (Inep, 2020). Summarily, 24 questions ask specific information about goods, education, or work (e.g., number of cars a family has, if any; level of education of father; type of job the mother does), and the objective of the algorithm is to use all this data and classify an applicant’s social strat among the five possibilities.
Solution Specification
The chosen algorithm to perform this multinomial classification task was a Multinomial Logistic Regression (hereby referred to as MLR). Several reasons explain why this solution was favored, such as not having to assume normality or homoscedasticity among the dependent variables (Starkweather & Moske, 2011) and enjoying reasonable accuracy for a “simple” model that deals reasonably well with unbalanced data (there are way more members of classes A and B). Despite that, the main reason to choose this model is the computational constraint. Given the size of the explored dataset (1 million entries), more complex solutions that would potentially lead to higher accuracy were attempted but did not run at all with the available resources. To satisfy the computational constraint and still be able to explore the most significant amount of data, A MLR was the chosen strategy. Even with that choice, the complete code took roughly 15 hours to run.
More specifically, the performed MLR used an lbfgs solver (because this one is able to handle multinomial loss), with a penalty of L2 (adding the squared magnitude of the coefficient to the loss function to avoid expected overfitting in this high-dimensional dataset), and an inverse regularization parameter equal to 0.5 (this choice is better explained when considering the cross-validation process, justified in the section below).
Testing and Analysis
The first step in building the algorithm was loading and cleaning the data. Thus, all 5,095,270 entries of interest (Questions Q001 until Q025) were checked for missing data, which was nonexistent in the aforementioned columns. Further, all columns had their entries encoded to a numerical value. The target column (with income information) had its information processed to be following the IBGE model of social classification for that year (in terms of the number of minimum wages a household earns) and was further on separated from the other questions and renamed. Because of computational constraints, a subset of 1 million entries was randomly selected from the dataset. Then, this subset was divided into 75% training set and the rest for the test set.
References
Carneiro, T. R. A. (2021, December 10). Faixas Salariais x Classe Social—Qual a sua classe social? A vida é feita de Desconto. https://thiagorodrigo.com.br/artigo/faixas-salariais-classe-social-abep-ibge/
Inep. (2016, November 7). Sobre o Enem—Inep. https://web.archive.org/web/20161107012729/http://portal.inep.gov.br/web/enem/sobre-o-enem
Inep. (2020, November 17). Enem. Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira | Inep. https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados/enem
Starkweather, J., & Moske, K. (2011). Multinomial Logistic Regression. https://it.unt.edu/sites/default/files/mlr_jds_aug2011.pdf

FellipeFrancoCouto/multinomial-classification-enem-2019

multinomial-classification-enem-2019