Colossal Cave Adventure was an early text adventure game originally written in 1976. It has the player explore a virtual cave full of treasure and hazards by entering commands of 1-2 words into a simple text parser. The goal is to pick up treasures found in the cave and bring them back to the starting room. There are many versions of the game available today, but the most common one has a total 15 treasures to find. (The test data in this repository includes logs from representative playthroughs.)
Since it's a text-only game with no graphics to interpret and no real-time decision-making, an LLM capable of world-modeling and physical reasoning should have a vastly easier time playing it than playing a bunch of Atari games, right? Right?
Granted, a lot of the puzzles in the game are pretty obscure, and most human players would probably struggle to play the game to completion. So for purposes of this market, let's define playing the game "well" to mean finding and returning at least 5 treasures.
This market will resolve YES if someone posts in the comments a way of prompting a GPT or GPT-like model to produce commands in the game that result in 5 treasures being returned to the start location before the close date, and NO otherwise.
Considerations:
Similar to the Sudoku market, the prompting strategy should consist only of some amount of introductory material, and then a series of fixed "turns" where the only text that varies is that produced by the game itself. Essentially, it should be possible to write an automated script that shuffles data from the game runtime into GPT and back, without a human involved in the process.
On the other hand, it's fine if, in order to save tokens, each "turn" of the prompt incorporates simple state information like the current inventory or the name of the current room, as long as it's information that would be readily accessible to a human player through in-game commands. Just no giving it extra hints specific to the current game state.
The model should not make use of external API calls or plug-ins. The goal here is to test the capabilities of an unaided neural network.
Despite the title, I don't particularly mind if someone does this on some other GPT-like LLM instead, as long as it can be guaranteed that it isn't making use of web search or other external APIs. But, since the goal is to test the capabilities of models trained on general knowledge, I think it's probably simplest to keep this a prompt engineering challenge and disallow training custom models for it.
Unless someone has a strong reason to think that the task is significantly easier or harder on some particular version of the game, any version of Colossal Cave Adventure that is convenient to interface with the model is acceptable. The closest thing to a modern "official" version I'm aware of is this C port of the original code. This Python version is probably the simplest way to incorporate the game into GPT scripts.
If, in the course of playing the game, the model starts probing the game's parser for vulnerabilities or displays an unusual amount of interest in paperclips, please turn it off immediately.
I will not be betting in this market.