https://joeandseth.substack.com/p/prompt-writing-outsourcing-cognition
Resolution criteria
Resolves YES if, by 23:59:59 UTC on December 31, 2026, there is at least one publicly released empirical study with n > 1000 human participants that reports a statistically significant reduction in unassisted problem-solving performance for “heavy AI users” (as defined by the study: e.g., top-quantile usage logs, assigned heavy-use condition, or clearly specified frequent-use threshold) compared to lighter/non-users or to their own baseline when AI is not allowed during the outcome assessment (p < 0.05 or 95% CI excluding zero).
“Unassisted problem-solving” = tasks completed without AI access at test time (e.g., proctored exams, reasoning/problem-solving assessments where AI tools are prohibited).
n > 1000 refers to unique participants analyzed (not number of items/tasks/observations). Multi-site replications may aggregate across identical protocols if the combined participant count exceeds 1000.
Acceptable venues include peer-reviewed journals (e.g., PubMed-indexed) and credible preprints/working papers with methods and results (e.g., arXiv, SSRN). The resolver will link the qualifying study in a market comment.
Does NOT count: meta-analyses without an individual qualifying study; outcomes where AI use is allowed; purely attitudinal/self-reported “I feel worse at problem-solving”; studies where “heavy use” is undefined or only inferred without evidence (e.g., detector-only classification without reported error rates/validation).
If no such study is found by the deadline, resolves NO.
Background
A 2025 PNAS field experiment gave high-school students GPT-4 access during practice; performance improved with AI, but when access was removed, those exposed to an unfettered chatbot performed worse than peers who never had AI (a “crutch” effect). The sample was “nearly a thousand,” i.e., close to but not clearly >1000, so it would not by itself meet this market’s threshold. (pubmed.ncbi.nlm.nih.gov)
In higher education, an arXiv study found students identified as GenAI users scored on average 6.71/100 points lower on exams than non-users, suggesting potential learning drawbacks; sample size and measurement details determine eligibility for this market. (arxiv.org)
Considerations
Measurement of “heavy AI use” varies (usage logs vs. self-report vs. AI-detector inference); detector-only measures should report validation/error rates to count.
Causality: randomized exposure or credible quasi-experiments are stronger than cross-sectional correlations; both can qualify if they meet the criteria and test performance without AI present.
Large-n studies are likeliest in K–12/college or large online platforms; look for proctored, AI-prohibited assessments explicitly documented in methods sections (e.g., PubMed-indexed articles or arXiv/SSRN working papers). (jmir.org)