Standard chess algorithms rely on a position evaluation that describes who is winning in a given position and by how much. This allows an algorithm to perform a search algorithm (e.g., minimax) to find the “best move.” Furthermore, this is approximately how humans play chess as well, where multiple sequences of moves are considered and pruned based on position evaluations.

One question I had is: “is this also how a LLM like GPT 4o arrives at its move?” Clearly, a prerequisite to search would be an evaluation function, so I tried to search for that. Initially, prompts like “output a numerical evaluation of a position” failed miserably, so I switched to the prompt of “who’s winning.”

So far, it seems that these models do not know who’s winning! Although it’s likely too early to reach this conclusion, here’s what I’ve tried so far.

Data Collection:

I took a subset of the December 2024 games from the lichess open database. Then, I randomly 190 games and positions from those games.

Attempts and Results:

All user prompts were numbered chess moves similar to when prompting for the “next move”. However, metadata such as player ELO, and player names were removed. Here are various prompts tried:

GPT 4o

Here are the results for the full set of 190 games, as well as a subset that only includes games that are imbalanced (one side is clearly winning):

o1-mini

Even though o1-mini seemed like a poor chess player, perhaps a reasoning model would fare better in evaluation?