1989Ryan/llm-mcts

Is this article essentially a prompt engineering?

hxypqr opened this issue · 3 comments

The multiplication, virtual home path planning, and shortest journey problem solutions mentioned in the article all seem to involve passing the output of LLM to specially designed calling functions in specific scenarios. Meanwhile, MCTS appears to act more as a buffer layer between LLM and the calling functions, please correct me if I am wrong, thank you for your time in advance .

First, thank you for raising your question!

This paper is not about prompt engineering. Our core idea is that you can use different prompts to make LLM act as different modules in compositional planning algorithms (MCTS, in our case). MCTS is not just a buffer layer; it is a search algorithm which makes a lookahead search in a world model (simulation of a domain) and anticipates the consequences of different actions. Sometimes, this search can be very computationally inefficient due to the complexity of the problem, while LLM has commonsense knowledge that can supplement the information needed by MCTS to accelerate the search. It basically integrates all the information from LLM, which plays different roles in the solution (world model and commonsense policy), not just a buffer layer.

To explore when and why this strategy helps, we compare using LLM purely as a policy or using it to supply a compositional algorithm, just as we did. We do the planning, shortest journey, and multiplication, investigate via sample complexity, and draw an empirical conclusion: choose the strategy whose sample complexity is low.

Thank you for your answer.
I did not explicitly mention what the cache layer means here, sorry, what I actually want to express is that due to the architecture of LLM, direct RAG will encounter problems that the knowledge base and LLM cannot be well coupled, more strictly speaking, the problem lies in the sorting of the recall (LLM without the training when new knowledge injected into it how to understand), or that LLM itself cannot learn through the prompt how the input knowledge base should be embedded into the latent space in what way, and with MCTS as a buffer, first, the instability issue caused by long-range reasoning in Latent space for long text input and LLM can be greatly reduced. Second, adding a Gaussian noise term to the policy output of the model and continuously improving the model performance through feedback signals and iterative model outputs can enhance the performance of the model itself. It can even be proven that as long as the training process is long enough, the shortest path problem in a graph with a finite number of points and edges can always reach the optimal algorithm. Is my understanding correct?
I'm not sure if my understanding is correct, but I'm wondering if there is any room for improvement in this approach when dealing with sparse connectivity in the graph, meaning that the signals obtained by MCTS are very weak, for example, use LLM for mathematical reasoning tasks, thank you for your time in advance.

That is a great insight! I think improvements could be made in reward shaping. As for now, our reward signal is received when the outcome of planning/reasoning is received, while the intermediate steps lack feedback. The potential improvement should happen if denser rewards are provided (such as divide and conquer: decompose a problem into subgoals to provide more reward for each subgoal). LLM's commonsense policy helps in this regard, but with a denser reward, it would work better. For sparse graph search issues, without additional information, it isn't easy to algorithmically optimise. We assume LLM is sufficient to provide additional information (subgoal decomposition, heuristics, etc.) so that we can make it more efficient.