trinker/lexicon

Dupe profanity list words results in error

trinker opened this issue · 1 comments

dats <- c( 
    "crowdflower_deflategate", 
    "crowdflower_products", 
    "course_evaluations", 
    "crowdflower_self_driving_cars", 
    "crowdflower_weather", 
    "hotel_reviews", 
    "kaggle_movie_reviews", 
    "cannon_reviews", 
    "kotzias_reviews_amazon_cells"
) 


cdat <- combine_data(dats[1])


sdat <- get_sentences(cdat)
swears <- profanity(sdat, profanity_list = c( 'shit', 'shit'))

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 187093 rows; more than 187044 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the ..

Checking to make the list unique is needed.

Use:

fix_profanity_list <- function(x, warn = TRUE, ...){
    if(!is.atomic(x)) stop('A `profanity_list` must be an atomic character vector.')
    if(!is.character(x)) stop('A `profanity_list` must be a character vector.')  
    if (any(grepl('[A-Z]', x), na.rm = TRUE)) {
        if (warn) warning('Upper case characters found in `profanity_list`...\nConverting to lower', call. = FALSE)
        x <- tolower(x)
    }
    if (anyNA(x)) {
        if (warn) warning('missing values found in `profanity_list`...\nRemoving all `NA` values', call. = FALSE)
        x <- x[!is.na(x)]
    }    
    if (anyDuplicate(x) > 0) {
        if (warn) warning('duplicate values found in `profanity_list`...\nRemoving all duplicates', call. = FALSE)
        x <- unique(x)
    }   
    x
}

Already in dev version as a function, just needs a line:

profanity_list <- fix_profanity_list(profanity_list)