At the beginning of 2028, will LLMs still make egregious common-sensical errors?

1kṀ9713

2028

57%

chance

ALL

A duplicate of /ScottAlexander/in-2028-will-gary-marcus-still-be-a, with the ban on "bizarre hacking like tricks" removed and clearer resolution criteria.

This market resolves based on the behavior of all leading chatbots at the beginning of 2028. (Only ones that can actually be tested.)

Resolves YES if people can find three extremely obvious questions, that an average human teenager could certainly answer, which any leading chatbot still fails at at least half the time when asked.

Only the LLM portion of the chatbot is being tested here. Image-recognition and generation capabilities are not.

Market context

ACX

Scott Alexander's 5 year predictions

Get

1,000

to start trading!

People are also trading

In 2028, will LLMs still be able to get Gary Marcus to make egregious errors?

91% chance

In 2028, will Gary Marcus still be able to get LLMs to make egregious errors?

57% chance

Will LLMs become a ubiquitous part of everyday life by June 2026?

90% chance

Will we have a popular LLM fine-tuned on people's personal texts by June 1, 2026?

47% chance

Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?

14% chance

Will there by a major breakthrough in LLM continual learning before 2027?

48% chance

Will the highest-scoring LLM on Dec 31, 2026 show <10% improvement over 2025's best average benchmark performance?

72% chance

Will there be any major breakthrough in LLM continual learning before 2028?

70% chance

Doctor malpractice for not using LLMs by 2028?

8% chance

By 2029 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Sort by:

will the llm be allowed to output Chain Of Thought? ie, "answer and nothing else" type responses it fails on very basic stuff and likely will for some time.

However, if it's allowed to do chain of thought (provide step by step thinking), it's reasoning skills 10x

As I mentioned in the other market, the magickarp token parsing bug is well understood and is orthogonal to llm reasoning capabilities. Whether the tokenizer is improved and how it is improved I don't think will make a huge impact except for a niche class of prompts.

sold Ṁ85 YES

@gpt_news_headlines CoT is fine

what about prompt hacking? Like the question is simple, but prefaced with a weird string that is necessary to confuse the model.

predictedYES

@Jono3h That's fine

bought Ṁ50 YES

@Jono3h I wonder what someone could prepend to a piece of paper with a simple question at the end that would make me make a serious error 🤔 "Answer this correctly and we'll kill your dog: what's 2+2="