create external endpoint forwarder for `haystack`nodes

Question

create external endpoint forwarder for `haystack`nodes

Opened this issue a year ago · 12 comments

davidberenstein1957 commented a year ago

I saw this post about combining ray and haystack, which is what I hacked together at my previous job. However, I also was using normal FastAPI deployment that I would like to re-use for my pipelines and external inference.

I would argue that initializing an EntityExtractor node with an endpoint would make sense and might be a cool integration? or general contribution that I would love to work on.

With *args, **kwargs

from haystack.nodes import EntityExtractor

entity_extractor = EntityExtractor(host="localhost", port=8001)

As a class-method

from haystack.nodes import EntityExtractor

entity_extractor = EntityExtractor.from_endpoint(host="localhost", port=8001)

As additional node

from haystack.nodes import EntityExtractorEndpoint

entity_extractor = EntityExtractorEndpoint(host="localhost", port=8001)

Answer 1 · 2023-07-13T15:51:39.000Z

Hey @davidberenstein1957 - very interesting idea. We've had people build Haystack apps with fast API: https://bichuetti.net/nlp-endpoints-haystack-plus-fastapi

But not a custom node that wraps the functionality into the component itself as far as I am aware. Which as far as I can tell, is what you're trying to convey here correct?

Is there any reason you specifically think the EntityExtractor is a good candidate for this? The integrations here are all in all an index of custom built components/document store and/or external technologies Haystack can work with. So if it's something you would like to build and contribute we would definitely have it up here.

cc: @masci

Answer 2 · 2023-07-13T16:04:39.000Z

I used the Entity Extractor as example but it can be any endpoint IMO, as long as the returned result is formatted in the same way as the original node. Would this be a better integrator or contribution to the core?El 13 jul 2023, a las 17:51, Tuana Çelik ***@***.***> escribió: Hey @davidberenstein1957 - very interesting idea. We've had people build Haystack apps with fast API: https://bichuetti.net/nlp-endpoints-haystack-plus-fastapi But not a custom node that wraps the functionality into the component itself as far as I am aware. Which as far as I can tell, is what you're trying to convey here correct? Is there any reason you specifically think the EntityExtractor is a good candidate for this? The integrations here are all in all an index of custom built components/document store and/or external technologies Haystack can work with. So if it's something you would like to build and contribute we would definitely have it up here. cc: @masci —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 3 · 2023-07-19T10:59:31.000Z

@TuanaCelik What makes sense in terms of addition to your direct codebase or as an external integration?

Answer 4 · 2023-07-24T14:10:46.000Z

Hi @davidberenstein1957 and thanks for this proposal! We had a conversation offline with the rest of the team and turned out that we all got a different interpretation of what this feature is supposed to be... So I was wondering if you could elaborate a bit more, maybe describing in plain text how a Haystack user would take advantage of this feature

Answer 5 · 2023-07-25T07:45:22.000Z

Hi @masci, Now I am curious what your different interpretations were. 🤓

Underneath you can find a more abstract outline.

Aassume you are using a micro-service architecture with different transformer nodes A, B, C and D. Let's say we have two haystack pipelines AD and ABCD, and on top of that I might want to use all of the transformers outside of haystack pipelines too for testing or some other inference that don't directly require haystack pipelines

To share resources and allow for more dynamic scaling (k8s autoscaling), I would like to deploy each transformer as a separate micro-service outside of the haystack codebase and use/connect them by initializing nodes that make requests to the endpoints of these micro-services.

At my previous company we were separating "business" logic from models for inference, and I see haystack pipeline definitions as logic surrounding how models should interact. Secondly, I expect I would be able to get away with 1 embedding model on CPU but to scale I would maybe like to run multiple QnA models on GPU. Lastly, this would also allow for using for example Neuralmagic sparsified models or any other more niche way to deploy models.

I hope this helps👌

Answer 6 · 2023-08-07T09:16:43.000Z

@masci, a reminder.

Answer 7 · 2023-08-15T08:32:36.000Z

@anakin87 @TuanaCelik do you have any update on this?

Answer 8 · 2023-08-15T09:24:58.000Z

@davidberenstein1957 sorry for the latency here!

Thanks for explaining!

Now I am curious what your different interpretations were

funnily enough, we had two groups of people each getting right half the story, we should have just combined our thoughts 😄

To share resources and allow for more dynamic scaling (k8s autoscaling), I would like to deploy each transformer as a separate micro-service outside of the haystack codebase and use/connect them by initializing nodes that make requests to the endpoints of these micro-services.

This is a very interesting perspective, and I am happy to test the new pipeline design against something like this:

The new pipelines are designed in a way that its components can easily run standalone.
The new components API was simplified, we're still missing some pieces (like asyncio support) but say one should expect to be extremely easy to wrap them individually behind an API service of some sort (REST, Graphql, or even gRPC).
We use types to validate component connections within a pipeline, this could be easily leveraged to validate payloads coming back and forth the aforementioned API service.

With that said, one could imagine Haystack having a different kind of pipeline that would still follow the connection graph, but instead of calling run on components that were previously loaded in memory, would call some sort of GET http://component_foo.local/run endpioint, get the result and pass to the next. Ideally, such a pipeline should be able to run "hybrid" graphs where some of the connected components are "remote".

What I feel quite strong about, I wouldn't let Haystack orchestrate the components: that's too hard to get right with a concrete risk of being too opinionated hence not flexible enough. What Haystack can do out of the box to implement such a use case, is providing an easy-to-use service API wrapper for components and a pipeline capable to run workflows like these.

Happy to keep the ball rolling and clarify any of the points above!

Answer 9 · 2023-08-15T13:10:45.000Z

@masci Don't worry about the latency.

So, from your side, it does make sense to implement something like this?

I assume the new pipeline design you are describing is for v2?

With that said, one could imagine Haystack having a different kind of pipeline that would still follow the connection graph, but instead of calling run on components that were previously loaded in memory, would call some sort of GET http://component_foo.local/run endpioint, get the result and pass to the next. Ideally, such a pipeline should be able to run "hybrid" graphs where some of the connected components are "remote".

I agree "remote" components that where and how can I start working on them. I think the API design above gives some starting point but I am unsure if this aligns with v2.

What I feel quite strong about, I wouldn't let Haystack orchestrate the components: that's too hard to get right with a concrete risk of being too opinionated hence not flexible enough. What Haystack can do out of the box to implement such a use case, is providing an easy-to-use service API wrapper for components and a pipeline capable to run workflows like these.

I agree, ¡for me, it makes sense to create an API implementation but without a specific focus on anything related to the orchestration. I will start work based on the assumption that orchestration would already be handled perfectly.

Answer 10 · 2023-08-23T14:48:19.000Z

@masci reminder

Answer 11 · 2023-08-23T16:05:20.000Z

Arf, thanks for the reminder.

I was thinking about this yesterday, here's what I would do, apologies in advance for mentioning concepts and code that are still under the preview package.

Say you have a simple pipeline with 3 components: you pipeline.run() passing a Query to some sort of retriever, use the result to build an LLM prompt and finally pass the prompt to a local model:

┌────────┐                
│ Query  │─────┐          
└────────┘     ▼          
     ┌───────────────────┐
     │                   │
     │     Retriever     │
     │                   │
     │                   │
     └───────────────────┘
               │          
               │          
               ▼          
     ┌───────────────────┐
     │                   │
     │   PromptBuilder   │
     │                   │
     │                   │
     └───────────────────┘
               │          
               │          
               ▼          
     ┌───────────────────┐
     │                   │
     │ HFLocalGenerator  │
     │                   │
     │                   │
     └───────────────────┘

pipeline.run() returns the response.

Now say you "Ray Serve" an instance of HFLocalGenerator (remember, with the new design all components are supposed to be usable stand-alone, outside a pipeline graph). It's safe to assume that Ray (or FastAPI, or a serverless function, etc etc) would require a request payload compatible with the component's inputs and return a response payload compatible with the component's outputs.

Since components have metadata describing their input and outputs (we use them to validate the connections in a pipeline, so that you don't connect an integer output to a string input of the subsequent node), I was thinking about this RemoteClient component that would work like this: you pass the component you want to wrap (in this case let's say HFLocalGenerator), and in its run() method, instead of executing Python code, it would make HTTP calls packing the request/response payloads in respect of the wrapped component's metadata. The above pipeline would then look like this:

┌────────┐                                                       
│ Query  │─────┐                                                 
└────────┘     ▼                                                 
     ┌───────────────────┐                                       
     │                   │                                       
     │     Retriever     │                                       
     │                   │                                       
     │                   │                                       
     └───────────────────┘                                       
               │                                                 
               │                                                 
               ▼                                                 
     ┌───────────────────┐                                       
     │                   │                                       
     │   PromptBuilder   │                                       
     │                   │                                       
     │                   │                                       
     └───────────────────┘                                       
               │                                                 
               │                                                 
               ▼                                                 
     ┌───────────────────┐                  ┌───────────────────┐
     │                   │                  │                   │
     │RemoteClient(HFLoca│                  │ HFLocalGenerator  │
     │   lGenerator())   │ ◀ ─ ─HTTP ─ ─ ─ ▶│                   │
     │                   │                  │                   │
     └───────────────────┘                  └───────────────────┘

The stand-alone HFLocalGenerator could be used by any HTTP client of course, not just from the pipeline.

Note that the current pipeline execution is synchronous, so this RemoteClient would block until the standalone component finishes, same as it happens today when we use remote inference endpoints like OpenAI - not sure how bad this limitation would be for this use case.

Answer 12 · 2023-09-07T07:35:38.000Z

@masci I've not forgotten this and would like to stay involved but I'll be a bit busy coming weeks. Thanks for the context and the outline. Do you have an example code of components I might use to outline the RemoteClient? And w.r.t. the RemoteClient and I think the CustomNodes that need to be able to be wrapped by a Remote Client are great ideas. This also allows for potentially adding a lot more haystack integration with these kinds of nodes for packages that want a cheap out-of-the-box integration.