ahyatt/llm

Errors when function calling with Claude

Closed this issue · 16 comments

In a scratch buffer, evaluating

(let ((provider (make-llm-claude
                 :key (exec-path-from-shell-getenv "CLAUDE_API_KEY")
                 :chat-model "claude-3-5-sonnet-20240620")))
  (llm-tester-function-calling-conversation-sync provider))

(adapt the :key field suitably) yields a debugger error that begins:

Debugger entered--Lisp error: (error "LLM request failed with code 400: Bad Request (additional information: ((type . error) (error (type . invalid_request_error) (message . messages.1.tool_use_id: Extra inputs are not permitted))))")
error("LLM request failed with code %d: %s (additional information: %s)" 400 "Bad Request" ((type . "error") (error (type . "invalid_request_error") (message . "messages.1.tool_use_id: Extra inputs are not permitted"))))

Evaluating

(let ((provider (make-llm-claude
                 :key (exec-path-from-shell-getenv "CLAUDE_API_KEY")
                 :chat-model "claude-3-5-sonnet-20240620")))
  (llm-tester-function-calling-conversation-async provider))

yields the following in *llm-tester*:

FAILURE: async function calling conversation for llm-claude, error of type error received: Error invalid_request_error: 'messages.1.tool_use_id: Extra inputs are not permitted'

Are these known issues?

I encountered a different error "invalid_request_error: 'messages.1.content: Input should be a valid list'" when trying to return function calls to Claude in my own code, but figured I'd start by trying to understand the built-in tests.

It hasn't happened before in testing, but the APIs often change slightly (or it's possible that some error prevented the claude tests I usually run from running this test in a way I never noticed).

I can reproduce your clear example, though, so let me look into it. Thanks for the report!

The error you reported is trivial to fix, but I now realize why I never ran this test for Claude: it's because what Claude expects to happen for converesations is very different from what everyone else expects. In particular, in Open AI and others, you are expected to send the function call back to the LLM for a further textual response, but Claude does this in one step, so no extra call is necessary. It will take some experimenting before finding some elegant solution for this, if any great solution can be found.

Thanks for taking a look and sharing your thoughts on this.

I wasn't able to understand what you mean when you say "Claude does this in one step". Step 3 in https://docs.anthropic.com/en/docs/build-with-claude/tool-use seems to consist of sending the result back to Claude, albeit in the form of a new user block rather than something else.

Anyway, happy to help brainstorm or experiment on this.

When I say "Claude does this in one step", normally you get the function call results as a special return value from the LLM. Then you have to take that and call back into the LLM to get text that uses the result to say something. Claude returns text (what it is calling chain-of-thought), that accompanies the function call as well. So it doesn't need or want to have the results sent back to it without further user input.

I have a new branch that has a fix. Can you take a look and give feedback? I suspect there's something better we can do for function calling conversations, since all the clients will have to basically do the same logic, but haven't yet hit upon a good solution yet. But at least this solution in the branch is good enough to fix this issue.

Thanks. I tried testing it out, but I'm actually still getting the same error I alluded to in my first message here, which is apparently unrelated to the tests. Or maybe I'm just using it wrong. Here's what I'm trying:

(let* ((use-claude nil)
       (provider
        (if use-claude
            (make-llm-claude
             :key (exec-path-from-shell-getenv "CLAUDE_API_KEY")
             :chat-model "claude-3-5-sonnet-20240620")
          (make-llm-openai
           :key (exec-path-from-shell-getenv "OPENAI_API_KEY")
           :chat-model "gpt-4o")))
       (prompt
        (llm-make-chat-prompt
         "Compute 3+5."
         :temperature 0.1
         :functions
         (list (make-llm-function-call
                :function (lambda (a b)
                            (+ a b))
                :name "add"
                :description "Sums two numbers."
                :args (list (make-llm-function-arg
                             :name "a"
                             :description "A number."
                             :type 'integer
                             :required t)
                            (make-llm-function-arg
                             :name "b"
                             :description "A number."
                             :type 'integer
                             :required t))))))
       (response-1 (llm-chat provider prompt)))
  (message "Response 1: %s" response-1)
  (let ((response-2 (llm-chat provider prompt)))
    (message "Response 2: %s" response-2)))

Evaluating this in a scratch buffer produces the expected output:

Response 1: ((add . 8))
Response 2: The result of 3 + 5 is 8.

But if we change use-claude from nil to t, then the second call to llm-chat produces the following backtrace:

Debugger entered--Lisp error: (error "LLM request failed with code 400: Bad Request (additional information: ((type . error) (error (type . invalid_request_error) (message . messages.1.content: Input should be a valid list))))")
  error("LLM request failed with code %d: %s (additional information: %s)" 400 "Bad Request" ((type . "error") (error (type . "invalid_request_error") (message . "messages.1.content: Input should be a valid list"))))
  llm-request-plz-sync-raw-output("https://api.anthropic.com/v1/messages" :headers (("x-api-key" . "sk-ant-XXX") ("anthropic-version" . "2023-06-01") ("anthropic-beta" . "tools-2024-04-04")) :data (("temperature" . 0.1) ("tools" (("name" . "add") ("description" . "Sums two numbers.") ("input_schema" (type . object) (properties ("a" (type . integer) (description . "A number.")) ("b" (type . integer) (description . "A number."))) (required "a" "b")))) ("model" . "claude-3-5-sonnet-20240620") ("stream" . :json-false) ("max_tokens" . 4096) ("messages" (("role" . user) ("content" . "Compute 3+5.")) (("role" . assistant) ("content" . 8)))) :timeout nil)
  ...

I'd be happy to try debugging this further, but thought I'd check that I'm not doing something wrong.

Interesting, thanks for sharing the error. This is a different error than the first one (note the error you're getting back from Claude). It looks like your code is correct, and I'm testing Claude in a very similar way right now in the integration tests I just added (see llm-integration-test.el) in main (I'll merge main back in to this branch now). If you would like to look into this, that would be helpful, but I'll take a look later today.

I tried your example out yesterday on the claude-fc branch, but I was unable to reproduce your error. I'll try again but I'd like to create a pull request soon and merge it into main, and put out a release within the next few days.

I tried it with that branch and it persisted. Wasn't quickly able to diagnose. Will send details later

Yes, reproduced again just now. The main difference in the two cases between the value of prompt before the final call to llm-chat is in the interactions field, which for OpenAI is given by

(#s(llm-chat-prompt-interaction user "Compute 3+5."
                                nil)
 #s(llm-chat-prompt-interaction assistant
                                (((id
                                   . "call_HmW4jbtMrAcGPmW9mnVFw5nv")
                                  (function
                                   (name . "add")
                                   (arguments
                                    . "{\"a\":3,\"b\":5}"))))
                                nil)
 #s(llm-chat-prompt-interaction function 8
                                #s(llm-chat-prompt-function-call-result
                                   "call_HmW4jbtMrAcGPmW9mnVFw5nv"
                                   "add" 8)))

and for Claude by

(#s(llm-chat-prompt-interaction user "Compute 3+5."
                                nil)
 #s(llm-chat-prompt-interaction assistant 8
                                #s(llm-chat-prompt-function-call-result
                                   "toolu_012hyPgAZMwxLz7BZYsSb6YW"
                                   "add" 8)))

It's mentioned in llm-claude.el that there is no function role for Claude, which explains why assistant is used instead. Anyway, from this we get the API error that I mentioned in my earlier email. Happy to try anything else you might suggest, but traveling this week and probably won't get the chance to debug properly until next weekend

looking again, I see I missed the second call you did (so I only tried to reproduce the error in the first call, which isn't a problem). Yes, what you are trying to do is not correct - for Claude, you can't call it twice in a row, so you just treat the function call like normal, and next append a user response. This is why I introduced a new function, llm-should-send-function-result-back - if this is nil, like it is for Claude, you shouldn't be making that second call.

Thanks. In my last example, adding (llm-chat-prompt-append-response prompt "Please continue the conversation, using the function call result.") between setting response-1 and response-2 produces the same error, so I don't understand how to have a back-and-forth with claude where the result of a function call is processed without encountering this error.

To focus the issue further, here's some code that seems equivalent to llm-tester-function-calling-conversation-sync, except in the specifics of the function call. This produces the same error. Is this code still incorrect?

(let* ((provider (make-llm-claude
                  :key (exec-path-from-shell-getenv "CLAUDE_API_KEY")
                  :chat-model "claude-3-5-sonnet-20240620"))
       (prompt
        (llm-make-chat-prompt
         "Compute 3+5."
         :temperature 0.1
         :functions
         (list (make-llm-function-call
                :function (lambda (a b)
                            (+ a b))
                :name "add"
                :description "Sums two numbers."
                :args (list (make-llm-function-arg
                             :name "a"
                             :description "A number."
                             :type 'integer
                             :required t)
                            (make-llm-function-arg
                             :name "b"
                             :description "A number."
                             :type 'integer
                             :required t))))))
       (responses nil))
  (push (llm-chat provider prompt) responses)
  (when (llm-should-send-function-result-back provider)
    (push (llm-chat provider prompt) responses))
  (llm-chat-prompt-append-response prompt "Please continue the conversation using the result of the function call.")
  (push (llm-chat provider prompt) responses)
  (when (llm-should-send-function-result-back provider)
    (push (llm-chat provider prompt) responses))
  (llm-tester-log "SUCCESS: Provider %s had a function conversation and got results %s"
                  (type-of provider)
                  (nreverse responses)))

Thanks! I can reproduce this. I think the issue is that we send the result back as an integer in this case, not a string, which Claude can't parse. But we shouldn't be sending the result back, we should be sending something closer to a representation of what they sent. According to the docs, the user should send back the result of execution, and can do so in a variety of ways. It probably is not necessary but I'll have to experiment and see what works well.

I've reworked my solution - we no longer need llm-should-send-function-result-back, for Claude, the client should always send the results back as well; but we'll do it as a user response, which is what Claude documentation says should happen. Plus, I fixed the stringification issue I mentioned before.

Solution sounds good to me. Here's a related bug (please let me know if you'd prefer that I open a fresh issue for such things):

(let* ((use-claude t)
       (two-questions t)
       (provider
        (if use-claude
            (make-llm-claude
             :key (exec-path-from-shell-getenv "ANTHROPIC_KEY")
             :chat-model "claude-3-5-sonnet-20240620")
          (make-llm-openai
           :key (exec-path-from-shell-getenv "OPENAI_KEY")
           :chat-model "gpt-4o")))
       (prompt
        (llm-make-chat-prompt
         (if two-questions "Compute 2+3 and 4+5." "Compute 2+3.")
         :temperature 0.1
         :functions
         (list (make-llm-function-call
                :function (lambda (a b)
                            (+ a b))
                :name "add"
                :description "Sums two numbers."
                :args (list (make-llm-function-arg
                             :name "a"
                             :description "A number."
                             :type 'integer
                             :required t)
                            (make-llm-function-arg
                             :name "b"
                             :description "A number."
                             :type 'integer
                             :required t))))))
       (responses nil))
  (push (llm-chat provider prompt) responses)
  (push (llm-chat provider prompt) responses))

This fails on the latest main with the error

"LLM request failed with code 400: Bad Request (additional information: ((type . error) (error (type . invalid_request_error) (message . messages.2: the following `tool_use` ids were not found in `tool_result` blocks: {'toolu_01XUFLBrcUZbLTTe6gFZEfgH'}))))"

but works fine if I set either use-claude or two-questions to nil.

Can you open a new bug on this one? Yes, it's a bug, but supporting multiple function calls is unsupported by many providers, so fixing this is probably low priority.

Sure, done.

Regarding your comment, I see this as a breaking bug for function calling with Claude, since we have no direct control over when Claude decides to attempt multiple function calls. Happy to help debug when I get the chance.