gnes-ai/gnes

How to use GNES for text classification?

ilham-bintang opened this issue · 2 comments

Problem and Question

Hi, I have take a look the poem project. And want to use another data for indexing. How to use labeled csv to do supervised learning with text data?

my data sample.tsv with thai text:

intent	question
ClassA	FAQ 1? ចូលទៅប្រើប្រាស់កម្មវិធីនេះ?
ClassA	Another FAQ Question similar to FAQ 1?
ClassB	TestQuestion with thai text ចូលទៅប្រើប្រាស់កម្មវិធីនេះ?
ClassB	Another data sample

What I have trial

  1. I try to pass Pandas Series and it raise GRPC Error,
  2. I try to pass tuple with (intent, question) and raise GRPC error
  3. I try to use the question only to index it and convert the str into bytes. This is successfully build without GRPC error, but it raise
W:EncoderService:[enc:emb: 42]:document (doc_id=20) contains no chunks!
W:IndexerService:[ind:_ha: 57]:document (doc_id=10) contains no chunks!
W:EncoderService:[enc:emb: 42]:document (doc_id=22) contains no chunks!
W:IndexerService:[ind:_ha: 57]:document (doc_id=12) contains no chunks!
W:EncoderService:[enc:emb: 42]:document (doc_id=24) contains no chunks!
W:IndexerService:[ind:_ha: 57]:document (doc_id=16) contains no chunks!
E:EncoderService:[enc:emb: 67]:can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
W:EncoderService:[enc:emb: 68]:encoder service throws an exception, the sequel pipeline may not work properly

Question

How to train the labeled csv data?

Hi, short answer, not straightforward. Long answer, let me explain.

GNES as its name suggested, focusing on the search scenario. With the recent release GNES Flow, it becomes more obvious that GNES is to some extent similar to Kubeflow/Airflow: it provides a cloud-native workflow for AI-powered microservices. However, the major difference is that GNES' workflow is designed and optimized for search scenario only.

If you look at GNES components and predefined flows, they are completely search-driven. This is because from the day one, this project is designed to be the next-gen search engine, nothing else.

So if you ask me whether it can be used for classification, clustering, recommendation etc. Maybe it can be done easily, maybe one needs more component or flow. To be honest, I didn't put much thoughts about these tasks, not as much as I put in search (where also my experience in). Meanwhile, I do welcome people to contribute their ideas on this thread, in particular,

  • What other components besides Encoder, Router, Preprocessor, Indexer are required to achieve the classification task?
  • How is the classification workflow look like? Can we represent it using GNES Flow?
  • What are the responses in this task to the client, is it describable via the current response protobuf?

Once these questions are figured out, you will get the answer.

Hi @hanxiao
First of all, Thank you for your contribution to this great project and clean explanation about GNES.

Actually the issue above has been fixed by removing the Cambodian letter.
I see the Cambodian letter can not be encoded with utf-8.


My answer:
My task is creating FAQ searching (similar to the Demo Poem Search). This task completely searching task.

In the very first project, we check the similarity by calculating the distance between the sentences vector with Flair DocumentEmbedding stack. But the result pretty bad.

So we found this search engine using the neural net and the result can be better than measure the Cosine Distance.

thanks