Stemming problem
ensozos opened this issue · 4 comments
ensozos commented
sentence: " 12-Gauge Angle" gives the stemmed word angle "angl" which is correct but
sentence: "Angle brucket" gives "angle" as stemmed word
master commented
Can't reproduce:
scala> val sentenceDataFrame = spark.createDataFrame(Seq((0, "12-Gauge Angle"), (1, "Angle brucket"))).toDF("id", "sentence")
sentenceDataFrame: org.apache.spark.sql.DataFrame = [id: int, sentence: string]
scala> sentenceDataFrame.show
+---+--------------+
| id| sentence|
+---+--------------+
| 0|12-Gauge Angle|
| 1| Angle brucket|
+---+--------------+
scala> val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_73998f8e2c95
scala> val data = tokenizer.transform(sentenceDataFrame).select("words")
data: org.apache.spark.sql.DataFrame = [words: array<string>]
scala> data.show
+-----------------+
| words|
+-----------------+
|[12-gauge, angle]|
| [angle, brucket]|
+-----------------+
scala> val stemmer = new Stemmer().setInputCol("words").setOutputCol("stemmed").setLanguage("English").transform(data).show(false)
+-----------------+---------------+
|words |stemmed |
+-----------------+---------------+
|[12-gauge, angle]|[12-gaug, angl]|
|[angle, brucket] |[angl, brucket]|
+-----------------+---------------+
master commented
Closing as can't reproduce. Feel free to reopen if needed.
ensozos commented
I had the older version ( 0.1.1 ) that's why i was getting the bug. With 0.2.0 it works perfect!
Sorry for the silly mistake and thank you for your reply
master commented
No worries :)