/genderBR

Predict gender from Brazilian first names

Primary LanguageRGNU General Public License v2.0GPL-2.0

genderBR

CRAN_Status_Badge R-CMD-check AppVeyor Build Status Codecov test coverage Package-License

genderBR predicts gender from Brazilian first names using data from the Instituto Brasileiro de Geografia e Estatistica’s 2010 Census.

How does it work?

genderBR’s main function is get_gender, which takes a string with a Brazilian first name and predicts its gender using data from the IBGE’s 2010 Census – specifically, from its API and from an internal dataset.

More specifically, it uses data on the number of females and males with the same name in Brazil, or in a given Brazilian state, and calculates the proportion of female’s uses of it. The function then classifies a name as male or female only when that proportion is higher than a given threshold (e.g., female if proportion > 0.9, or male if proportion <= 0.1); proportions below those threshold are classified as missing (NA). An example:

library(genderBR)
#> 
#> To cite genderBR in publications, use: citation('genderBR')
#> To learn more, visit: fmeireles.com/genderbr

get_gender("joão")
#> [1] "Male"
get_gender("ana")
#> [1] "Female"

Multiple names can be passed at the same function call:

get_gender(c("pedro", "maria"))
#> [1] "Male"   "Female"

And both full names and names written in lower or upper case are accepted as inputs:

get_gender("Mario da Silva")
#> [1] "Male"
get_gender("ANA MARIA")
#> [1] "Female"

Additionally, one can filter results by state with the argument state; or get the probability that a given first name belongs to a female person by setting the prob argument to TRUE (defaults to FALSE).

# What is the probability that the name Ariel belongs to a female person in Brazil?
get_gender("Ariel", prob = TRUE)
#> [1] 0.09219289

# What about differences between Brazilian states?
get_gender("Ariel", prob = TRUE, state = "RJ") # RJ, Rio de Janeiro
#> [1] 0.2627399
get_gender("Ariel", prob = TRUE, state = "RS") # RS, Rio Grande do Sul
#> [1] 0.05144695
get_gender("Ariel", prob = TRUE, state = "SP") # SP, Sao Paulo
#> [1] 0.1294782

Note that a vector with states’ abbreviations is a valid input for get_gender function, so this also works:

name <- rep("Ariel", 3)
states <- c("rj", "rs", "sp")
get_gender(name, prob = T, state = states)
#> [1] 0.26273991 0.05144695 0.12947819

This can be useful also to predict the gender of different individuals living in different states:

df <- data.frame(name = c("Alberto da Silva", "Maria dos Santos", "Thiago Rocha", "Paula Camargo"),
                 uf = c("AC", "SP", "PE", "RS"),
                 stringsAsFactors = FALSE
                 )

df$gender <- get_gender(df$name, df$uf)

df
#>               name uf gender
#> 1 Alberto da Silva AC   Male
#> 2 Maria dos Santos SP Female
#> 3     Thiago Rocha PE   Male
#> 4    Paula Camargo RS Female

Brazilian state abbreviations

The genderBR package relies on Brazilian state abbreviations (acronyms) to filter results. To get a complete dataset with the full name, IBGE code, and abbreviations of all 27 Brazilian states, use the get_states functions:

get_states()
#> # A tibble: 27 x 3
#>    state            abb    code
#>    <chr>            <chr> <int>
#>  1 ACRE             AC       12
#>  2 ALAGOAS          AL       27
#>  3 AMAPA            AP       16
#>  4 AMAZONAS         AM       13
#>  5 BAHIA            BA       29
#>  6 CEARA            CE       23
#>  7 DISTRITO FEDERAL DF       53
#>  8 ESPIRITO SANTO   ES       32
#>  9 GOIAS            GO       52
#> 10 MARANHAO         MA       21
#> # … with 17 more rows

Geographic distribution of Brazilian first names

The genderBR package can also be used to get information on the relative and total number of persons with a given name by gender and by state in Brazil. To that end, use the map_gender function:

map_gender("maria")
#> # A tibble: 27 x 6
#>    nome                   uf    freq populacao sexo    prop
#>    <chr>               <int>   <int>     <int> <chr>  <dbl>
#>  1 Piauí                  22  363139   3118360 ""    11645.
#>  2 Ceará                  23  967042   8452381 ""    11441.
#>  3 Paraíba                25  423026   3766528 ""    11231.
#>  4 Rio Grande do Norte    24  341940   3168027 ""    10793.
#>  5 Alagoas                27  321330   3120494 ""    10297.
#>  6 Pernambuco             26  838534   8796448 ""     9533.
#>  7 Sergipe                28  188619   2068017 ""     9121.
#>  8 Maranhão               21  574689   6574789 ""     8741.
#>  9 Acre                   12   63172    733559 ""     8612.
#> 10 Minas Gerais           31 1307650  19597330 ""     6673.
#> # … with 17 more rows

To specify gender in the consultation, use the optional argument gender (valid inputs are f, for female; m, for male; or NULL, the default option).

map_gender("iris", gender = "m")
#> # A tibble: 23 x 6
#>    nome                uf  freq populacao sexo   prop
#>    <chr>            <int> <int>     <int> <chr> <dbl>
#>  1 Goiás               52   840   6003788 m     14.0 
#>  2 Tocantins           17   156   1383445 m     11.3 
#>  3 Bahia               29   422  14016906 m      3.01
#>  4 Mato Grosso         51    91   3035122 m      3   
#>  5 Minas Gerais        31   512  19597330 m      2.61
#>  6 Distrito Federal    53    65   2570160 m      2.53
#>  7 Espírito Santo      32    69   3514952 m      1.96
#>  8 Rondônia            11    28   1562409 m      1.79
#>  9 Pará                15   129   7581051 m      1.7 
#> 10 Rio de Janeiro      33   225  15989929 m      1.41
#> # … with 13 more rows

Installing

To install genderBR’s last stable version on CRAN, use:

install.packages("genderBR")

To install a development version, use:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("meirelesff/genderBR")

Data

The surveyed population in the Instituto Brasileiro de Geografia e Estatistica’s (IBGE) 2010 Census includes 190,8 million Brazilians – with more than 130,000 unique first names.

To extracts the numer of male or female uses of a given first name in Brazil, the package employs the IBGE’s API and, from in 1.1.0 version, also from an internal dataset containing all the names recorded in the IBGE’s Census. In this service, different spelling (e.g., Ana and Anna, or Marcos and Markos) implies different occurrences, and only names with more than 20 occurrences, or more than 15 occurrences in a given state, are included in the database.

For more information on the IBGE’s data, please check (in Portuguese): https://censo2010.ibge.gov.br/nomes/

Author

Fernando Meireles