Will I think that scheming benchmarks are more promising than game theory benchmarks (for defining trusted) in 1y? | Manifold

Will I think that scheming benchmarks are more promising than game theory benchmarks (for defining trusted) in 1y?

Basic

4

Ṁ116

Dec 14

63%

chance

1D

1W

1M

ALL

Will I think that defining whether a model is trusted or untrusted is better done by measuring progress on a benchmark specifically designed to measure deception-related capabilities (i.e. choosing outputs so that they don't leak the goal, internally reasoning about how to not get caught by the oversight process...), as opposed to measuring progress on a benchmark with hard game theory/strategy questions which might but don't have to be related to deception (evaluated with no CoT) given the same budget for dataset construction?

Resolves on Dec 13, 2024?

"Best" means best to get actual epistemic evidence (in my opinion), not what is best from a public relations perspective.

Is about the distinction described here: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models

If it is very unclear to me which benchmark in the pair this market considers is better (e.g. because some other very different method for determining trusted vs untrusted is much better than both), resolves to NA.

This question is managed and resolved by Manifold.

Get

1,000

and

3.00

Related questions

Will entropy-based sampling improve Llama3.1 on reasoning benchmarks in 2024?

Will we see improvements in the TruthfulQA LLM benchmark in 2024?

Benchmark Gap #3: Once a model achieves superhuman performance on a competitive programming benchmark, will it be less than 2 years before there are "entry level" AI programmers in industry use?

Benchmark Gap #2: Once we have an algorithm with human level sample efficiency for major RL benchmarks, how many years will it be before there is an algorithm with human level sample efficiency on essentially all AAA video game tasks?

Will entropy-based sampling improve Llama3.2 on reasoning benchmarks in 2024?

An AI is trustworthy-ish on Manifold by 2030?

Will "When can we trust model evaluations?" make the top fifty posts in LessWrong's 2023 Annual Review?

Related questions

Will entropy-based sampling improve Llama3.1 on reasoning benchmarks in 2024?

Will entropy-based sampling improve Llama3.2 on reasoning benchmarks in 2024?

Will we see improvements in the TruthfulQA LLM benchmark in 2024?

An AI is trustworthy-ish on Manifold by 2030?

Benchmark Gap #3: Once a model achieves superhuman performance on a competitive programming benchmark, will it be less than 2 years before there are "entry level" AI programmers in industry use?

Will "When can we trust model evaluations?" make the top fifty posts in LessWrong's 2023 Annual Review?

Benchmark Gap #2: Once we have an algorithm with human level sample efficiency for major RL benchmarks, how many years will it be before there is an algorithm with human level sample efficiency on essentially all AAA video game tasks?

© Manifold Markets, Inc.•Terms + Mana-only Terms•Privacy•Rules