What is the great fear and concern about Q*? When it can teach itself math tabula rasa (Latin for “blank slate”)… with enough compute… in the foreseeable future… due to exponential development… in an Intelligence Explosion… an AI that teaches itself math could find a solution to encryption. The financial system, infrastructure, everything digital, would be vulnerable to hacking and loss of control to AGI.
What is Q*? My first approach
The beginning
Logical thinking and self-learning are a necessity that many thinkers believe is a prerequisite for artificial general intelligence (although there is still no standard definition for AGI and Google offers the first approaches to a definition [3]. An AGI therefore requires absolute correctness in its output in order to be transferred to all (human) processes (something that OpenAI itself repeatedly emphasizes in blog posts: “In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still produce logical errors, often called _hallucinations_. Mitigating hallucinations is a critical step towards building aligned AGI”. Sam Altman put it even more concretely in a video:
) .
In addition, AGI requires general knowledge at expert level in all areas of knowledge (one must not forget the “G” in AGI, generalization). In this respect, this breakthrough in OpenAI reported by Reuters and The Information seemed to be the key on the road to AGI, which may have frightened many. Some content creators such as “AI Explained” (
), “Matthew Bermann” (
) have made excellent videos about this, which I can also highly recommend to you.
Reuters wrote at the time:
“Nov 22 (Reuters) – Ahead of OpenAI CEO [Sam Altman’s four days in exile](
), several staff researchers wrote a letter to the board of directors warning of a powerful artificial intelligence discovery that they said could threaten humanity, two people familiar with the matter told Reuters. (…) Some at OpenAI believe Q* (pronounced Q-Star) could be a breakthrough in the startup’s search for what’s known as artificial general intelligence (AGI), one of the people told Reuters. OpenAI defines AGI as autonomous systems that surpass humans in most economically valuable tasks. (…) Given vast computing resources, the new model was able to solve certain mathematical problems, the person said on condition of anonymity because the individual was not authorized to speak on behalf of the company. Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success, the source said. (…) Researchers consider math to be a frontier of generative AI development. Currently, generative AI is good at writing and language translation by statistically predicting the next word, and answers to the same question can vary widely. But conquering the ability to do math — where there is only one right answer — implies AI would have greater reasoning capabilities resembling human intelligence. This could be applied to novel scientific research, for instance, AI researchers believe. (…) “Four times now in the history of OpenAI, the most recent time was just in the last couple weeks, I’ve gotten to be in the room, when we sort of push the veil of ignorance back and the frontier of discovery forward, and getting to do that is the professional honor of a lifetime,” he said at the Asia-Pacific Economic Cooperation summit.”
The Information wrote at the time [4]
Iterative solution finding through process subdivision
System 2 thinking is explained in more detail in a research paper by OpenAI (“Lets verify step by step”, [6], published by Ilya Sutskever and Jan Leike (ex-OpenAI), among others. Similarly, the idea is already used today in prompts by telling a model to “think step by step” or “divide the task into subsections”, which is of course only a superficial attempt to apply System 2 thinking, although the model is not designed for this in terms of its architecture (“Promoting techniques like “take a deep breath” and “think step by step” are now expanding into advanced methods for inference with parallel computation and heuristics (some fundamentals of search).”
). One part of the document and its conclusions is the so-called “Process Reward Model (PRM)” (see below). In principle, this is an evaluation of the individual process steps. Instead of evaluating the result as a whole, points are awarded for each reasoning step.
“This allows finer-tuned generation with reasoning problems, by sampling over the maximum average reward or other metrics, instead of just relying on one score (standard RMs are called outcome RMs in this literature). Using [Best-of-N sampling](
), essentially generating a bunch of times and using the one that scored the highest by the reward model (the inference time cousin of Rejection Sampling popularized with Llama 2), PRMs outperform standard RMs on reasoning tasks.” (ebd.) This method is also supported by the so-called “Tree of Thoughts”: The paper “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” presents a new framework called Tree of Thoughts (ToT), which is based on large language models and improves their problem-solving capabilities through structured and planned decision-making processes. In contrast to the traditional Chain of Thought (CoT) method, which relies on sequential decisions, ToT enables the simultaneous exploration of multiple thoughts and the evaluation of these paths to achieve more effective problem solving. [7]. The ToT framework consists of four main components: 1 Thought process decomposition: Decomposition of the problem into smaller, manageable steps (thoughts). 2 Thought generation: Generation of suggestions for the next thought step. 3. Status evaluation: Heuristic evaluation of the progress of different thought paths. 4 Search algorithm: Systematic exploration of the thought tree using algorithms such as breadth-first search (BFS) or depth-first search (DFS). In experiments with tasks such as the “Game of 24”, creative writing and mini crossword puzzles, ToT showed significant improvements over conventional methods. For example, ToT achieved a success rate of 74% in the “Game of 24”, while the CoT method only achieved 4%. So we see here too that the planned, structured and sequential decision is essential for the accuracy of the solution. “The innovations that make this click are the chunking of reasoning steps and prompting a model to create new reasoning steps.
ToT seems like the first “recursive” prompting technique for improving inference performance, which sounds remarkably close to the AI Safety concern of recursively self-improving models (though I am not an expert)”.
) “Large language models are capable of solving tasks that require complex multistep reasoning by generating solutions in a step-by-step chain-of-thought format (Nye et al., 2021; Wei et al., 2022; Kojima et al., 2022). However, even stateof-the-art models are prone to producing falsehoods — they exhibit a tendency to invent facts in moments of uncertainty (Bubeck et al. , 2023). These hallucinations (Maynez et al., 2020) are particularly problematic in domains that require multi-step reasoning, since a single logical error is enough to derail a much larger solution. Detecting and mitigating hallucinations is essential to improve reasoning capabilities.” The article states that process-monitored models show better performance in solving complex mathematical problems. This process monitoring evaluates each intermediate step, similar to the evaluation of each node expansion in A*. The “chain of thought” of process monitoring is similar to Kahnemann’s System 2 thinking in that it represents reasoned thinking that evaluates logical steps, similar to the process monitoring approach. We can therefore see that System 2 thinking, i.e. thinking in process steps, not only leads to more precise results, but is also an essential component in solving complex tasks. There are various ways to do this. PRM can be part of finding solutions in Q*, as it originates from OpenAI’s own research, and ToT presumably also. Unfortunately, a more precise classification is not yet possible and cannot be derived from the various sources. Q* is probably a combination of Q-learning and A* search. OpenAI’s Q* algorithm is considered a breakthrough in AI research, particularly in the development of AI systems with human reasoning capabilities. Q* combines elements of Q-learning and A* (A-star search), which leads to an improvement in goal-oriented thinking and solution finding. This algorithm shows impressive capabilities in solving complex mathematical problems (without prior training data) and symbolizes an evolution towards general artificial intelligence (AGI). It is a fusion of Q-learning and A*-search (as others also suggest:
).
It is based on the idea of self-learning and predictive planning. “Self-play is the idea that an agent can improve its gameplay by playing against slightly different versions of itself because it’ll progressively encounter more challenging situations. In the space of LLMs, it is almost certain that the largest portion of self-play will look like AI Feedback rather than competitive processes.” Look-ahead planning is the idea of using a model of the world to reason into the future and produce better actions or outputs. The two variants are based on [Model Predictive Control](
.) (MPC), which is often used on continuous states, and [Monte-Carlo Tree Search](
) (MCTS), which works with discrete actions and states.” What is Q-learning? Different theories Theorie 1: “[Q-learning](
) is a type of reinforcement learning, a method where AI learns to make decisions by trial and error. In Q-learning, an agent learns to make decisions by estimating the “quality” of action-state combinations. **The difference between this approach and OpenAI’s current approach—known as Reinforcement Learning Through Human Feedback or [RLHF](
)—is that it does not rely on human interaction and does everything on its own**. Imagine a robot navigating a maze. With Q-learning, it learns to find the quickest path to the exit by trying different routes, receiving positive rewards set by its own design when it moves closer to the exit and negative rewards when it hits a dead end. Over time, through trial and error, the robot develops a strategy (a “Q-table”) that tells it the best action to take from each position in the maze. This process is autonomous, relying on the robot’s interactions with its environment. (…) **In Q-learning, Q* represents the desired state in which an agent knows exactly the best action to take in every state to maximize its total expected reward over time**. In math terms, it satisfies the [Bellman Equation](
).” Theorie 2 Algorithm from MRPPS: “One way to explain the process is to consider the fictitious detective Sherlock Holmes trying to solve a complex case. He gathers clues (semantic information) and connects them logically (syntactic information) to reach a conclusion. The Q* algorithm works similarly in AI, combining semantic and syntactic information to navigate complex problem-solving processes. This would imply that OpenAI is one step closer to having a model capable of understanding its reality beyond mere text prompts and more in line with the fictional J.A.R.V.I.S (for GenZers) or the Bat Computer (for boomers). So, while Q-learning is about teaching AI to learn from interaction with its environment, the Q _algorithm is more about improving AI’s deductive capabilities. Understanding these distinctions is key to appreciating the potential implications OpenAI’s “Q_.” Both hold immense potential in advancing AI, but their applications and implications vary significantly.” Of course, we don’t know what might be relevant in Q*. However, I clearly lean towards theory 1 because it is more consistent with the papers already published by OpenAI.
What is A* search?
A* search is the method of finding the correct path between a start state and a target state. It uses a heuristic function to calculate the estimated cost and find the best path. It also guarantees that the solution found is optimal if the heuristic is admissible (i.e. does not overestimate the cost). In short, the algorithm finds the shortest or cheapest solution if the heuristic is admissible, is multifunctional for different problems or questions (flexible), adaptable and robust. A* is similar to Monte Carlo Tree Search (MCTS) in some ways, but it is fundamentally different and better because it uses heuristics for optimal path finding instead of random simulations for decision making (MCTS). In other words, A* systematically searches for the best path, while MCTS uses random simulations for decision making. (A* SEARCH WITHOUT EXPANSIONS: LEARNING HEURISTIC FUNCTIONS WITH DEEP Q-NETWORKS [8]). Q* uses the principle of A* to find the best path by combining path costs and heuristic values. By integrating DQNs, Q* can calculate the costs and heuristic values of the child nodes in a single pass, which significantly reduces the algorithmic effort. – The stepwise computation and validation in Q* is similar to the process monitoring used in STaR to minimize hallucinations (STaR see below). One meta-scientist summarized this on Twitter as follows “From my past experience on OpenGo (reproduction of AlphaZero), A* can be regarded as a deterministic version of MCTS with value (i.e., heuristic) function Q only. This should be suitable for tasks in which the state is easy to evaluate given the action, but the actions are much harder to predict given the state. Math problems seem to fit this scenario quite well.” (
)
STaR: Step-by-Step Rationalization
Conclusion
“OpenAI executives told employees that the company believes it is currently on the first level, according to the spokesperson, but on the cusp of reaching the second, which it calls “Reasoners.” This refers to systems that can do basic problem-solving tasks as well as a human with a doctorate-level education who doesn’t have access to any tools. At the same meeting, company leadership gave a demonstration of a research project involving its GPT-4 AI model that OpenAI thinks shows some new skills that rise to human-like reasoning, according to a person familiar with the discussion who asked not to be identified because they were not authorized to speak to press. Reached for comment about the demonstration, the spokesperson said OpenAI is always testing new capabilities internally, a common practice in the industry. However, it is quite clear where the path is heading. OpenAI says quite directly that Tier 3 will be agents for them: “According to the levels OpenAI has come up with, the third tier on the way to AGI would be called “Agents,” referring to AI systems that can spend several days taking actions on a user’s behalf. Level 4 describes AI that can come up with new innovations. And the most advanced level would be called “Organizations.” [11] We don’t yet know exactly how Q* aka Strawberry works. So far, these are all just hypotheses, approaches as to how it could work. But: I think they are quite plausible. It’s not a secret, but a fact that models need to start implementing System 2 thinking in their architecture. To create the highest accuracy and overcome hallucinations as much as possible (RAG will not be enough for that), this iterative step-by-step process is necessary. We do not know how System 2 thinking will be implemented, but I have tried to show that there are already valid ways of doing this today. The second essential process is self-learning. Independent of external data and independent of RLHF, there needs to be a way of self-learning through evaluation. PRM could help here by evaluating the respective arguments of the language model and not just the result, within the Tree of Thought process mentioned above. In addition to these theses I have mentioned, however, there are numerous other ways to achieve a more accurate result output from Large Language Models that work by means of pathfinding, planning and self-learning. Most recently, someone has made further suggestions on X using scientific approaches to achieve similar results.
[12]. So, as I said at the beginning, we can only speculate about the scientific basis of Q* at the moment. From the name and the plausibility of the technology, however, I still think it is Q-learning, a form of A*-search, ToT and PRM. But I could be wrong. But I am firmly convinced that planning and System 2 thinking were key guiding principles of Q* and account for its success. Q* will probably produce superior accuracy in the output of results. It will achieve at least similar, if not better, results than Google Deepmind’s AlphaProof2 and Alpha Geometry (which recently achieved silver) through self-learning, pathfinding and process subdivision. Q* is probably the closest to AGI. Whether and how much compute and energy is necessary for this remains unclear (though it seems very much of both is needed) [1]
[2]
[3]
) [4]
[5]
[6]
) [7]
[8]
[9]
[10]
[11]
[12]