In 2025, will I be able to play Civ against an LLM?

1kṀ5962

2026

15%

chance

ALL

I've been enjoying Unciv, an open source Civ 5 clone that can run on a "potato". Though the game mechanics are well designed, the player-bot interaction is very limited. A small menu of possible actions like "Declare War" or "Propose a Peace Treaty" is the only way to communicate with the AI players.

This market resolves YES if sometime in 2025 it's possible to play either Unciv, Freeciv, or any official version of Sid Meier's Civilization with LLMs powering the other civilizations and city states.

These are the specific features that must be provided:

I can open a text chat with any bot player to discuss diplomacy, disclose information, and make trade deals.
An LLM is ultimately responsible for each individual action taken by each bot, instead of simply being an NPC dialogue generator.
The LLM needs to be a SOTA model.
The bots should be having dialogues with each other as well, in exactly the same way that I'd have dialogues with them.

Clarification (h/t @LiamZ):

The LLM can have access to any tools, including querying "advice" from traditional AI bot code, as long as it still qualifies for point (2) above.

Clarification to point (1):

Any game-relevant social behavior should be possible, where "game-relevant" excludes things like "Passing the Turing Test", but includes things like "Negotiating a secret pact to betray another Civ at a particular moment" (a double-betrayal in this circumstance is also acceptable, as long as the LLM player does it intentionally and not as a result of e.g. losing the context of the previous conversation it had).

Clarification to point (3) (h/t @TheAllMemeingEye, @ampdot):

The model appears in the top 20 of this list at any moment in the year.

I won't bet in this market.

Technology

Technical AI Timelines

LLMs

Get

1,000

to start trading!

People are also trading

Will the most interesting AI in 2027 be a LLM?

64% chance

Will LLMs become a ubiquitous part of everyday life by June 2026?

82% chance

Will RL work for LLMs "spill over" to the rest of RL by 2026?

35% chance

Will an LLM consistently create 5x5 word squares by 2026?

84% chance

EOY 2025: Will open LLMs perform at least as well as 50 Elo below closed-source LLMs on coding?

30% chance

Will an LLM do a task that the user hadn't requested in a notable way before 2026?

91% chance

Will China have the best LLM by the end of 2025?

19% chance

Will an LLM that someone is trying to shut down stop or avoid that in some way before 2026?

12% chance

Will I write an academic paper using an LLM by 2030?

65% chance

Will the most advanced LLM stop being from a US-based company any time before 2030?

Sort by:

does it count for point 2 if the ai only gives broad strategies to a normal civ ai and declares the wars something like the ai being like we want to be friendly with player 1 and 2 and we want to get ready for player 3 and then a few turns later the ai says to declare war but the troop movements and production and such are managed by a normal civ ai?

It's fine as long as (a) the LLM directly performs all the game-interactions itself and (b) the AI player integrates its decisions with its dialogue history.

For example, if I opened a chat with an AI player and said "I'll pay you 500 gold if you move your explorer at cell_1 to cell_2, then wait 5 turns without using it, and then donate it to the Persians after moving it to cell_3" it should be capable of doing those specific actions if it agrees to do them. Though, it could also deliberately decide to take my gold upfront and then default on our tacit agreement.
Things like "I'll pay you 500 gold to pass the turing test" aren't relevant though.

If you're asking if the AI is allowed to use tools to decide what the best move is to take, then yes it's allowed to use any tool. But being able to beat me isn't a requirement for the question.

bought Ṁ10 YES

Is the most SOTA free to use model ok?

Currently I'm thinking about using the top twenty as mentioned here, which will probably include the best SOTA free models. What do you think about that?

Sure that works 👍

This seems pretty trivial to do even today, particularly because you don't require it to be good at the game.

The game state can easily be condensed into a small amount of text; you can ask an LLM for an encoding scheme that it will understand, and it often outputs a better compression than existing compression algorithms by cutting out concepts it doesn't need. A 1M context window is more than sufficient.

The two limiting factors are likely that the official Civilization 7 game in 2025 won't have API calls or whatever is required to access Claude 4 Sonnet, and more importantly that the game would just be too expensive to play like this.

If trivial and easily represented then implement it and share a repo with working code. Also the question is explicitly not limited to “the official civilization 7 game.”

I said above that it can't be done reasonably, so I won't do that. The cost would be extraordinary to make the number of API calls required to set up a game like that.

I lost millions of dollars in the FTX scams; I can't be fooling around with API calls to implement something like this.

You don’t need to make lots of API calls to do the part you said was trivial and easy.

I'd like clarification on this as well. There's nothing in the question which specifies how good the gameplay has to be, or how fast the AI player has to be.

It would be quite easy to take an open source civ game and have it make a call to a local llama model for every decision. It would also be terrible.

@dayoshi I'd be fine running it overnight (like how a correspondence match would work). I don't care how bad it is at tactics/strategy as long as its behavior is consistent and coherent. Do you believe that can be achieved with a simple modification to an existing bot?

You can encode it with pictures. Gemini, Claude, and GPT-4o today all support image inputs.

You could, but I've found that it's very expensive to process images. They have a huge number of tokens, which would almost certainly be represented more cheaply as text.

@ampdot I bet eventually, you could just point a camera at a screen and give ChatGPT access to your mouse and keyboard. But right now it's very hard for multimodal models to extract detailed spatial info from images. Example.

No, this would not be successful. His screen does not have access to imagery that shows where the opponents' units are. In the early game, he probably doesn't even know how many opponents exist or what even a tiny portion of the world looks like. This AI can't be developed through imagery alone.

@SteveSokolowski you mean, the AI player would have the same limitations as a human, with needing to scroll around to see the visible map?

The LLM needs to be a SOTA model.

What does "SOTA model" mean? e.g. would LLaMa 3.1 8B suffice today? what if it's e.g. Phi 3.5 7B, which is SOTA for its size? does it need to be SOTA on release date or market close date?

It needs to be one of the best existing models when I play against it. How about, if it's in the top twenty on this page. How would you prefer to word that, and/or do you know of a better metric to use?

How do you expect a language model to model things which are not language (#2)? Wouldn’t it make more sense to use the language model as an interface to either a traditional civ AI or a statistical model trained on the game actions like the OpenAI DOTA bot from a few years ago?

I have a small project, where I built a custom RL environment for a multiplayer game, trained a few agents, and the plan is to hook them up to a small LLM and allow them to send messages to each other. The point is that middle layers of the LLM would be connected directly to the other network, and would still be subject to the normal backprop. So the LLM becomes literally a part of the agent, learns the game, and is not a passive "descriptor" of the events, but can actually be negotiated with.

I don't have huge expectations that it will work as intended, the credit allocation is too inefficient, but something like that is one path to achieving the results OP wants.

bought Ṁ1,000 NO

That being said, of course, I doubt anyone would actually do it.

Civ/freeciv/unciv seem extremely burdensome envs to actually train a RL agent in.

The actually plausible path is just prompt engineering + bells and whistles.

This makes sense but I think at that point it's not just an off the shelf LLM. I have no doubt one can engineer some ML based agent and integrate it with an LLM but it's not obvious how the LLM would have any connections between the subtleties of the game state and the text tokens without significant training effort.

It's just not clear to me if OP means to restrict it to LLMs or if they intended to allow multi-modal models.

@LiamZ what I want is for the bots to be able to discuss things with me and each other, reach a consensus on an issue, make threats, brag, bargain and whatever else real political agents can do with language. They need to be language models of some kind in order to do that. IIRC the OpenAI DOTA bot couldn't talk.

In the setup I imagine the bots would have prompts with condensed descriptions of what happened each turn (exactly like the condensed notifications the player sees). If multimodal vision is good enough in 2025 to be able to interpret a hexagonal board, that would make the design much easier, but otherwise a text representation could be used for the current board state (an important part of it would be a list of city/unit coordinates, or the relative distances between them). The LLM could also be fed whatever heuristics or feature representation the traditional AI was using, and have the option to sample it to see what it would do in the current situation or in some hypothetical that the LLM is considering.

See it's certain details that are keeping me from betting here like it isn’t clear what you mean by “the LLM is considering.” There’s no running internal dialogue and reasoning over time in current LLMs, they do next token prediction when prompted at inference time. A system with some internal reasoning mechanism would be different from a large statistical model of our language.

I should have said "LLM agent" to avoid the ambiguity, systems like this.

Got it, thanks, to be transparent I have a very different credence for “some agent integrated with an LLM component” vs “an off the shelf SOTA general purpose LLM” for successful application to novel complex tasks so these distinctions change a lot in what I think the probability should be for this question.

An LLM needs to decide the action which is performed, by outputting the action. An off the shelf general purpose LLM could very well do that, like how chatGPT can do function calling and interact with external APIs. Regarding who or what has agency once LLMs are augmented with tools, IMO it's mainly a semantic question whether e.g. chatGPT is merely a "python interpreter integrated with an LLM component". But since it's the LLM issuing the commands, I think it's more natural to regard it as the central part of the system.

You can completely change the python interpreter without touching the LLM generating the code. I don’t consider that central to the LLM since they started with external calls to python because Jupyter notebooks are well established and there’s lots of python and notebooks available in the training set rather than some deep integration with the language.

The fact it isn’t central to the model is actually a strength because it means the only barrier to doing so with other languages is reliable code generation and safety rails.

To make this more clear, often when I use tools like chatGPT for things like python I am interpreting their python output myself. I can determine the corresponding behavior of the code using my own code expertise and then I immediately give feedback to the chat without needing to run a separate interpreter. I do not think my brain is part of the LLM even though I am interpreting code and providing it feedback.

I'm sorry, I think I was unclear in my last message. And now I don't really see where we're agreeing or disagreeing anymore.

To emphasize, I don't think a Civ agent like the kind I mentioned would be an "agent with an LLM component", for the same reasons that I don't think chatGPT's built-in python interpreter makes it a "python interpreter with an LLM component". In your example, you are helping the LLM to help you by giving it more context about your problem, which establishes who is the tool and who is the user. The example doesn't seem to contradict anything about the Civ agent mentioned above.

I’m not actively disagreeing, the point is where you draw the line for “part of the LLM” seems to matter for the question. I don’t think the python interpreter whether it be through the Python program or my internal model of the program is part of the LLMs you interact with in chatGPT. If you do think that then it’s less clear where the line is for what you mean by “LLM” since you are including non-language model components. That raises the credence of you potentially resolving “yes” because it means you will consider things outside the probabilistic language model to be part of the LLM if they are in a product together.

Thanks for the clarification. Can you give me a hypothetical example where all the criteria are met, but a YES resolution would be unsatisfying?

Basically, it’s not clear how you are differentiating “the model is interfacing with an external program (Python, a traditional civ AI, etc.) for processing information” and “the model isn’t an interface for something else.”

One could imagine implementing a traditional Civ AI in Python and then throwing it in ChatGPT and telling it to run that and pick from a list of arguments based on pkls of game states using whichever data structures make sense for different aspects.

That’s much easier to do than training the LLM to have weights corresponding to intelligent play in some plain text representation of the game state.

One could imagine implementing a traditional Civ AI in Python and then throwing it in ChatGPT and telling it to run that and pick from a list of arguments based on pkls of game states using whichever data structures make sense for different aspects.

That sounds like it would qualify for this question, just as long as that ChatGPT also performed the natural-language reasoning component. See the example here.

The skill level of the agent isn't very interesting to me. The generality of its social behavior is what I'm interested in, the approximation to a political agent that tries to persuade, to manipulate with words and promises. To be capable of that, it must also be vulnerable to the receiving end as well. This seems more fascinating to me than a better Stockfish or AlphaGo.