KoNLP function preprocessing in hangulUtils.R

Question

KoNLP function preprocessing in hangulUtils.R

Closed this issue 8 years ago · 2 comments

preprocessing <- function(inputs){
if(!is.character(inputs)) {
warning("Input must be legitimate character!")
return(FALSE)
}
newInput <- gsub("[[:space:]]", " ", inputs)
newInput <- gsub("[[:space:]]+$", "", newInput)
newInput <- gsub("^[[:space:]]+", "", newInput)
if((nchar(newInput) == 0) |
(nchar(newInput) > 20 & length(strsplit(newInput, " ")[[1]]) <= 1)){
warning(sprintf("It's not kind of right sentence : '%s'", inputs))
return(FALSE)
}
return(newInput)
}

ex_str_A = '가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다'
If inputs is ex_str_A, it returns F.

ex_str_B = '하하 호호 가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다'
But, If inputs is ex_str_B, it returns not F but ex_str_B.

Because, nchar(newInput) > 20 is T, but length(strsplit(newInput, " ")[[1]]) <= 1 is F.

So, If SimplePos09 gets ex_str_B, It could make a problem. (maybe related to memory using HannanumObj)

Answer 1 · 2016-08-09T04:48:40.000Z

This rule "nchar(newInput) > 20 is T, but length(strsplit(newInput, " ")[[1]]) <= 1" is for protecting KoNLP with abnormal sentences which cause infinite wait for function call. It needs time to solve this problem.

Answer 2 · 2016-10-21T15:33:07.000Z

Added new InformalEojeolSentenceFilter on HanNanum-Analyzer to prevent this issues.
Next KoNLP could not stop processing even if abnormal sentence input.

haven-jeon/HanNanum-Analyzer@9412eb8