master/spark-stemming

Stemming problem

ensozos opened this issue · 4 comments

sentence: " 12-Gauge Angle" gives the stemmed word angle "angl" which is correct but
sentence: "Angle brucket" gives "angle" as stemmed word

Can't reproduce:

scala> val sentenceDataFrame = spark.createDataFrame(Seq((0, "12-Gauge Angle"), (1, "Angle brucket"))).toDF("id", "sentence")
sentenceDataFrame: org.apache.spark.sql.DataFrame = [id: int, sentence: string]

scala> sentenceDataFrame.show
+---+--------------+
| id|      sentence|
+---+--------------+
|  0|12-Gauge Angle|
|  1| Angle brucket|
+---+--------------+

scala> val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_73998f8e2c95
scala> val data = tokenizer.transform(sentenceDataFrame).select("words")
data: org.apache.spark.sql.DataFrame = [words: array<string>]

scala> data.show
+-----------------+
|            words|
+-----------------+
|[12-gauge, angle]|
| [angle, brucket]|
+-----------------+

scala> val stemmer = new Stemmer().setInputCol("words").setOutputCol("stemmed").setLanguage("English").transform(data).show(false)
+-----------------+---------------+
|words            |stemmed        |
+-----------------+---------------+
|[12-gauge, angle]|[12-gaug, angl]|
|[angle, brucket] |[angl, brucket]|
+-----------------+---------------+

Closing as can't reproduce. Feel free to reopen if needed.

I had the older version ( 0.1.1 ) that's why i was getting the bug. With 0.2.0 it works perfect!
Sorry for the silly mistake and thank you for your reply

No worries :)