Summary
Geodesic is going to use prediction markets to select their projects for MARS 4.0 and we need your help to make the markets run efficiently! Please read through the proposals, and then trade on the markets for the proposals you think might succeed or fail. We intend to choose the best proposals in two weeks!
Full proposals are in Google doc linked below, links to markets are in the section "The Projects".
Google Doc (similar content to this post + full proposal overviews).
LessWrong post (similar content to this post).
Introduction
Geodesic is a new AI safety startup focused on research that is impactful for short AGI/ASI timelines. As part of this, we are committed to mentoring several projects as part of the Mentorship for Alignment Research Students program (MARS), run by the Cambridge AI Safety Hub (CAISH).
We are also excited about new ways to choose and fund research that reflect the aggregated perspectives of our team and the broader community. One way of doing this is using conditional prediction markets, also known as Futarchy, where people bet on the outcomes of taking various actions so that the predicted-best action can be taken.
We believe a system similar to this might be really useful for deciding on future research proposals, agendas, and grants. Good rationalists test their beliefs, and as such, we are doing a live-fire test to see if the theory works in practice.
We are going to apply this to select research projects for MARS 4.0, an AI safety upskilling program like MATS or SPAR, based in Cambridge UK. We have drafted a number of research proposals, and want the community to bet on how likely good outcomes are for each project (conditional on being selected). We will then choose the projects which are predicted to do best.
To our knowledge, this is the first time Futarchy will be publicly used to decide on concrete research projects.
Futarchy
For those familiar with Futarchy / decision markets, feel free to skip this section. Otherwise, we will do our best to explain how it works.
When you want to make a decision with Futarchy, you first need a finite set of possible actions to be taken, and a success metric, whose true value will be known about at some point in the future. Then, for each action, a prediction market is created to try and predict the future value of the success metric given that decision is taken. At some fixed time, the action with the highest predicted success is chosen, and all trades on the other markets are reverted. When the actual value of the success metric is finally known, the market for the chosen action is resolved, and those who predicted correctly are rewarded for their insights. This creates an incentive structure that rewards people who have good information or insights to trade on the markets, improving the predictions for taking each action, and overall causing you to make the decision that the pool of traders thinks will be best.
As a concrete example, consider a company deciding whether or not to fire a CEO, and using the stock price one year after the decision as the success metric. Two markets would be created, one predicting the stock price if they're fired, and one predicting the stock price if they're kept on. Then, whichever one is trading higher at decision time is used to make the decision.
For those interested in further reading about Futarchy, Robin Hanson has written extensively about it. Some examples include its foundations and motivation, speculation about when and where it might be useful, and why it can be important to let the market decide.
The Metrics
Unlike stock prices of a company, there's no clear single metric by which research can be judged. Because of this, we've decided on a small selection of binary outcomes that will each be predicted separately, and then we will use their average in order to make the final decisions. We're not claiming these are the best metrics to judge a research project by, but we think they will be appropriate for the MARS program and sufficient for this experiment. The outcomes are:
A LessWrong post is produced within 6 months and gains 50 upvotes or more within a month of posting.
If a LessWrong post is produced, it gains 150 upvotes or more within a month of posting.
A paper is produced and uploaded to arXiv within 9 months.
If a paper is produced, it is accepted to a top ML conference (ICLR, ICML, or NeurIPS) within 6 months of being uploaded to arXiv.
If a paper is produced, it receives 10 citations or more within one year of being uploaded to arXiv.
Clarifications:
Unless otherwise stated, timeframes are given from when the research begins, i.e. the start of the MARS program
Updates to posts and papers will be considered the same entity as the original for purposes of outcome resolution (i.e. If a paper is produced and uploaded to arXiv within 9 months, but it is edited after this before being accepted at a conference, (4) still resolves YES)
Some outcomes are conditional on others as follows: outcome (2) will resolve N/A if (1) resolves NO, outcomes (4)-(6) will resolve N/A if (3) resolves NO
The Projects
There are 3 groups of proposals, and from within each group we will choose the single predicted-best project. The groupings represent the mentors for that project. To read through each of the proposals and the bios for each mentor, please see the Google doc. Currently we plan to close the markets and choose the projects on Monday the 12th of October. Below we provide a quick overview of each and link to its respective prediction market.
Group 1: Cameron Tice
Goal Crystallisation: Investigating whether AI models that fake alignment perfectly during training retain their original goals or have them changed despite following through with deceptive alignment.
Post-training order and CoT Monitorability: Examining whether different post-training orders (applying RLHF after reasoning training) affect models' tendency to obfuscate deceptive reasoning in their chain-of-thought outputs.
Online Learning for Research Sabotage Mitigation: Attempting to use online learning to train AI systems away from research sabotage by creating deliberately underperforming model organisms and applying RL to try and improve their research performance.
Group 2: Puria Radmard & Shi Feng
Salient features of self-models: Testing whether LLMs have genuine self-models or just recognize stylistic patterns by examining if self-recognition training generalizes across different types of content.
Exploring more metacognitive capabilities of LLMs: Investigating whether LLMs can metacognitively monitor their own internal probability distributions and predictive models, with implications for deceptive alignment and AI safety.
Group 3: Lennie Wells
Model organisms resisting generalisation: Testing whether AI models learn the distribution of tasks under which humans have good oversight, and resist generalisation beyond this distribution.
Detection game: Running a ‘detection game’ to investigate how we can best prompt trusted monitors to detect research sabotage.
Research sabotage dataset: Creating a public dataset of tasks reflecting current and future AI safety research that can be used to study underelicitation and sandbagging.
Model Emulation: Can we use LLMs to predict other LLM’s capabilities?
Go trade!
We hope to use prediction markets to effectively choose which research projects we should pursue, as well as conducting a fun experiment on the effectiveness of Futarchy for real-world decision making. The incentive structure of a prediction market motivates those who have good research taste or insights to implicitly share with us their beliefs and knowledge, helping us make the best decision possible. That said, anyone is free to join in and trade, and the more people who do the better the markets perform. So we need your help! Please read through the proposals and vote on the markets, be a part of history by partaking in this experiment!