JohnSnowLabs/spark-nlp-workshop

How to use exported model from HuggingFace (BERT For Sequence Classification)

xegulon opened this issue · 2 comments

Description

Hi, I have followed and succeeded doing the steps in the notebook about exporting HuggingFace BERT For Sequence Classification to Spark NLP, but I can't find a way to use it for inference. So the question is: how to use it for real inference? Could you add an example at the end of the notebook?

My try

This is what I have tried:

from pyspark.ml.pipeline import PipelineModel
from sparknlp.base import *
from sparknlp.annotator import *

document = DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bert_embeddings = BertEmbeddings().pretrained(name="test", lang='en') \
    .setInputCols(["document",'token'])\
   .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

sequenceClassifier_loaded = BertForSequenceClassification.load("./best_model_1_spark_nlp")\ # the model resulting from the conversion notebook
  .setInputCols(["sentence_embeddings", 'token'])\
  .setOutputCol("class")

pipeline_model = PipelineModel([document, tokenizer, sequenceClassifier_loaded])

model = LightPipeline(pipeline_model)

sentences = [...]

model.fullAnnotate(sentences)

I get the error:

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-8-1e53793d7914> in <module>
      1 pipeline_model = PipelineModel([document, tokenizer, sequenceClassifier_loaded])
      2 
----> 3 model = LightPipeline(pipeline_model)
      4 
      5 sentences = [...]

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sparknlp/base.py in __init__(self, pipelineModel, parse_embeddings)
     77     def __init__(self, pipelineModel, parse_embeddings=False):
     78         self.pipeline_model = pipelineModel
---> 79         self._lightPipeline = _internal._LightPipeline(pipelineModel, parse_embeddings).apply()
     80 
     81     @staticmethod

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sparknlp/internal.py in __init__(self, pipelineModel, parse_embeddings)
    265 class _LightPipeline(ExtendedJavaWrapper):
    266     def __init__(self, pipelineModel, parse_embeddings):
--> 267         super(_LightPipeline, self).__init__("com.johnsnowlabs.nlp.LightPipeline", pipelineModel._to_java(),
    268                                              parse_embeddings)
    269 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pyspark/ml/pipeline.py in _to_java(self)
    331         java_stages = gateway.new_array(cls, len(self.stages))
    332         for idx, stage in enumerate(self.stages):
--> 333             java_stages[idx] = stage._to_java()
    334 
    335         _java_obj =\

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/py4j/java_collections.py in __setitem__(self, key, value)
    236 
    237         elif isinstance(key, int):
--> 238             return self.__set_item(key, value)
    239         else:
    240             raise TypeError("list indices must be integers, not {0}".format(

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/py4j/java_collections.py in __set_item(self, key, value)
    219         command += proto.END_COMMAND_PART
    220         answer = self._gateway_client.send_command(command)
--> 221         return get_return_value(answer, self._gateway_client)
    222 
    223     def __setitem__(self, key, value):

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    330                 raise Py4JError(
    331                     "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
--> 332                     format(target_id, ".", name, value))
    333         else:
    334             raise Py4JError(

Py4JError: An error occurred while calling None.None. Trace:
py4j.Py4JException: Cannot convert com.johnsnowlabs.nlp.annotators.Tokenizer to org.apache.spark.ml.Transformer
	at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166)
	at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144)
	at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:834)

Your Environment

Not relevant.