KoNLP function preprocessing in hangulUtils.R
Closed this issue · 2 comments
preprocessing <- function(inputs){
if(!is.character(inputs)) {
warning("Input must be legitimate character!")
return(FALSE)
}
newInput <- gsub("[[:space:]]", " ", inputs)
newInput <- gsub("[[:space:]]+$", "", newInput)
newInput <- gsub("^[[:space:]]+", "", newInput)
if((nchar(newInput) == 0) |
(nchar(newInput) > 20 & length(strsplit(newInput, " ")[[1]]) <= 1)){
warning(sprintf("It's not kind of right sentence : '%s'", inputs))
return(FALSE)
}
return(newInput)
}
ex_str_A = '가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다'
If inputs is ex_str_A, it returns F.
ex_str_B = '하하 호호 가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다'
But, If inputs is ex_str_B, it returns not F but ex_str_B.
Because, nchar(newInput) > 20 is T, but length(strsplit(newInput, " ")[[1]]) <= 1 is F.
So, If SimplePos09 gets ex_str_B, It could make a problem. (maybe related to memory using HannanumObj)
This rule "nchar(newInput) > 20 is T, but length(strsplit(newInput, " ")[[1]]) <= 1" is for protecting KoNLP with abnormal sentences which cause infinite wait for function call. It needs time to solve this problem.
Added new InformalEojeolSentenceFilter on HanNanum-Analyzer to prevent this issues.
Next KoNLP could not stop processing even if abnormal sentence input.