Dupe profanity list words results in error
trinker opened this issue · 1 comments
trinker commented
dats <- c(
"crowdflower_deflategate",
"crowdflower_products",
"course_evaluations",
"crowdflower_self_driving_cars",
"crowdflower_weather",
"hotel_reviews",
"kaggle_movie_reviews",
"cannon_reviews",
"kotzias_reviews_amazon_cells"
)
cdat <- combine_data(dats[1])
sdat <- get_sentences(cdat)
swears <- profanity(sdat, profanity_list = c( 'shit', 'shit'))
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 187093 rows; more than 187044 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the ..
Checking to make the list unique is needed.
trinker commented
Use:
fix_profanity_list <- function(x, warn = TRUE, ...){
if(!is.atomic(x)) stop('A `profanity_list` must be an atomic character vector.')
if(!is.character(x)) stop('A `profanity_list` must be a character vector.')
if (any(grepl('[A-Z]', x), na.rm = TRUE)) {
if (warn) warning('Upper case characters found in `profanity_list`...\nConverting to lower', call. = FALSE)
x <- tolower(x)
}
if (anyNA(x)) {
if (warn) warning('missing values found in `profanity_list`...\nRemoving all `NA` values', call. = FALSE)
x <- x[!is.na(x)]
}
if (anyDuplicate(x) > 0) {
if (warn) warning('duplicate values found in `profanity_list`...\nRemoving all duplicates', call. = FALSE)
x <- unique(x)
}
x
}
Already in dev version as a function, just needs a line:
profanity_list <- fix_profanity_list(profanity_list)