Top-100 universities are determined by the QS 2023 rankings: https://www.topuniversities.com/university-rankings/world-university-rankings/2023
“Technical AGI safety effort” can be demonstrated by either:
At least three research papers or blog posts in one year that discuss catastrophic risks (risks of harm much worse than any caused by AI systems to date, harming more than just the developer of the AI system and its immediate users) that are specific to human-level or superhuman AI.
orOne blog post, paper, tweet, or similar that clearly announces a new lab, institute, or center focused on the issues described above, presented in a way that implies that this team will involve multiple people working over multiple years under the leadership of a tenure-track faculty member (or equivalent as below).
Further details:
Papers or posts must credit a tenure-track faculty member (or someone of comparable status at institutions without a tenure system) as an author or as the leader of the effort.
The paper or post must discuss specific technical interventions to measure or address these risks or, if not, it must both be by a researcher who primarily does technical work related to AI and be clearly oriented toward an audience of technical researchers. Works that are primarily oriented toward questions of philosophy or policy don't count.
Citing typical work by authors like Nick Bostrom, Ajeya Cotra, Paul Christiano, Rohin Shah, Richard Ngo, or Eliezer Yudkowsky as part of describing the primary motivation for a project will typically suffice to show an engagement with catastrophic risks in the sense above.
A "lab, institute, or center" need not have any official/legal status, as long as it is led by a faculty member who fits the definition above.
I will certify individual labs/efforts as meeting these criteria if asked (within at most a few weeks), and will resolve YES early if we accumulate 15 of these.
Resolving no. I think there's probably something relevant going on at more than 15 of these universities, but well under 15 of them actually count for the purposes of this market, with the exact number depending on how you count marginal cases like Cornell. (Plus, I made a stupid copy-paste mistake and counted Cornell twice below, so even my list of leads is under 15.)
From an initial survey, I can see a plausible case for: Berkeley, Cambridge, Cornell, Cornell, ETH, ICL, MIT, NYU, Oxford, Princeton, Stanford, Toronto, Tsinghua, UBC, UChicago, so exactly 15. I'd guess that at least some of these won't quite meet the bar, though.
For what it's worth, I found this to be a useful source of leads, and I'm open counting very clear statements here as announcements, though many of them are borderline.
Are there any others I'm missing? Or would anyone like to make an explicit case against any of these? I'll check back in in a couple of weeks to resolve.
To look for surprises, I threw together a messy (partially AI-written) script to scrape Semantic Scholar for authors who frequently cite niche technical AGI safety papers:
import requests
from collections import defaultdict
API_KEY = "YOUR_API_KEY_HERE" # Currently not used, works reasonably fast without a key
# Representative papers that I expect to be cited mostly in technical AI safety works
PAPER_IDS = ["e86f71ca2948d17b003a5f068db1ecb2b77827f7", # Concrete problems
"7ee12d3bf8e0ce20d281b4550e39a1ee53839452", # Learned optimizers
"7bba95b3d145564025e26b49ca67f13f884f8560", # Superintelligence
"53a353ffff284536956fde8c51c306481d8e89c4", # Human Compatible
"6b93cedfe768eb8b5ece92612aac9cc8e986d12a", # Grace survey
"05c2e1ee203be217f100d2da05bdcc52004f00b6", # ML safety
"2302e014a3c363a2f39d61dd2ab62d87d044adad", # Critch TARSA
"7ac7b6dbcf5107c7ad0ce29161f60c2834a06795", # Critch + Yudkowsky
"a9c46dfd9a24c754a67386e02424ad68b1f4ab3b", # ARCHES
"99ca5162211a895a5dfbff9d7e36e21e09ca646e", # Scalable oversight
"7dc928f41e15f65f1267bd87b0fcfcc7e715cb56", # Turpin
"d51ebec3064f82ea4128fc1c3241003d4072c639", # Truthful
"7d6f17706cbcfcca55f08485bcbf8c82e00c9279", # Goal misgen
"2e0de9fe6dc58ec6e20a931ecde2bec2124d6e7f", # DL perspective
"46d4452eb041e33f1e58eab64ec8cf5af534b6ff", # Power seeking
"a6582abc47397d96888108ea308c0168d94a230d", # Basic AI drives
"00d385a359eda4845dab37efc7c12a9c0987e66b", # Bostrom advanced
"6d78d67d4f7f5fe2e66933778ab1faf119d21547", # Oracle AI
"5a5a1d666e4b7b933bc5aafbbadf179bc447ee67", # Debate
"0052b31f07eda7737b5e0e2bf3803c3a32f3f728", # Amplification
"8326258c0834cbb18a0db4b3537f92d867f91a89", # Extreme risks
]
def get_citing_authors(paper_id, year):
base_url = "https://api.semanticscholar.org/graph/v1/paper/"
headers = {} #"x-api-key": API_KEY}
params = {'fields': 'authors,year,isInfluential', 'limit': 1000}
citing_authors = defaultdict(int)
citations_by_author = defaultdict(list)
for paper in PAPER_IDS:
next = 0
while next is not None:
params = {'fields': 'authors,year,isInfluential,title', 'limit': 1000, 'offset': next}
response = requests.get(f"{base_url}{paper}/citations", headers=headers, params=params)
data = response.json()
if response.status_code == 200:
for citation in data['data']:
if "authors" in citation['citingPaper']:
authors = [author['authorId'] for author in citation['citingPaper']["authors"]]
for author in authors:
if author is not None:
citing_authors[author] += 1 + 4 * citation['isInfluential'] # Weight by Semantic Scholar's influence variable
citations_by_author[author].append(citation)
if 'next' in response:
next = response['next']
else:
next = None
return {author: count for author, count in citing_authors.items() if count >= 10}, citations_by_author
if __name__ == "__main__":
citing_authors, citations_by_author = get_citing_authors(PAPER_ID, TARGET_YEAR)
if citing_authors:
filtered_authors = {
author: count
for author, count in citing_authors.items()
if count >= 3
}
base_url = "https://api.semanticscholar.org/graph/v1/author/batch"
headers = {}#"x-api-key": API_KEY}
params = {'fields': 'name,affiliations,hIndex'}
response = requests.post(base_url, headers=headers, params=params, json={"ids": list(filtered_authors.keys())})
sorted_authors = sorted(response.json(), key=lambda author: filtered_authors[author['authorId']], reverse=True)
for author in sorted_authors:
citations = citations_by_author[author['authorId']]
sorted_citations = sorted(citations, key=lambda citing: (citing['citingPaper']['year'], citing['citingPaper']['title']))
print(f"{author['name']} {author['affiliations']} h-index: {author['hIndex']} weighted safety cite count: {filtered_authors[author['authorId']]}")
current_year = None
current_title = None
for citing in sorted_citations:
citing_paper = citing['citingPaper']
citing_year = citing_paper['year']
citing_title = citing_paper['title']
if citing_year != current_year:
print(f" {citing_year}:")
current_year = citing_year
current_title = None
ˇ
if citing_title != current_title:
print(f" {citing_title}")
current_title = citing_title
Current output:
Tom Everitt ['DeepMind'] h-index: 15 weighted safety cite count: 106
2015:
Sequential Extensions of Causal and Evidential Decision Theory
2016:
Avoiding Wireheading with Value Reinforcement Learning
Death and Suicide in Universal Artificial Intelligence
Practical Agents and Fundamental Challenges
Self-Modification of Policy and Utility Function in Rational Agents
Universal Artificial Intelligence-Practical Agents and Fundamental Challenges
2017:
A Game-Theoretic Analysis of the Off-Switch Game
AI Safety Gridworlds
2018:
AGI Safety Literature Review
Scalable agent alignment via reward modeling: a research direction
Towards Safe Artificial General Intelligence
2019:
A Causal Influence Diagram Perspective
Modeling AGI Safety Frameworks with Causal Influence Diagrams
Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective
Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings
2020:
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
REALab: An Embedded Perspective on Tampering
The Incentives that Shape Behaviour
2021:
Agent Incentives: A Causal Perspective
Alignment of Language Agents
How RL Agents Behave When Their Actions Are Modified
2022:
Discovering Agents
Path-Specific Objectives for Safer Agent Incentives
2023:
Characterising Decision Theories with Mechanised Causal Graphs
Roman V Yampolskiy ['University of Louisville'] h-index: 32 weighted safety cite count: 95
2011:
What to Do with the Singularity Paradox?
2012:
Artificial General Intelligence and the Human Mental Model
Safety Engineering for Artificial General Intelligence
2013:
Responses to Catastrophic AGI Risk : A Survey Kaj Sotala Machine Intelligence Research Institute
2014:
Responses to catastrophic AGI risk: a survey
The Universe of Minds
Utility function security in artificially intelligent agents
2016:
Artificial Fun: Mapping Minds to the Space of Fun
Taxonomy of Pathways to Dangerous Artificial Intelligence
Unethical Research: How to Create a Malevolent Artificial Intelligence
2017:
Diminishing Returns and Recursive Self Improving Artificial Intelligence
Guidelines for Artificial Intelligence Containment
High Performance Computing of Possible Minds
Modeling and Interpreting Expert Disagreement About Artificial Superintelligence
Responses to the Journey to the Singularity
Risks of the Journey to the Singularity
The Singularity May Be Near
2018:
BEYOND MAD ? : THE RACE FOR ARTIFICIAL GENERAL INTELLIGENCE
Building Safer AGI by introducing Artificial Stupidity
Superintelligence and the Future of Governance
2019:
Chapter 2 Risks of the Journey to the Singularity
Long-term trajectories of human civilization
Personal Universes: A Solution to the Multi-Agent Value Alignment Problem
Predictability : What We Can Predict – A Literature Review
Predicting future AI failures from historic examples
Unexplainability and Incomprehensibility of Artificial Intelligence
2020:
An AGI Modifying Its Utility Function in Violation of the Strong Orthogonality Thesis
Artificial General Intelligence: 13th International Conference, AGI 2020, St. Petersburg, Russia, September 16–19, 2020, Proceedings
Artificial Stupidity: Data We Need to Make Machines Our Equals
Chess as a Testing Grounds for the Oracle Approach to AI Safety
Human $\neq$ AGI.
On Controllability of AI
Special Issue “On Defining Artificial Intelligence”—Commentaries and Author’s Response
Transdisciplinary AI Observatory - Retrospective Analyses and Future-Oriented Contradistinctions
2021:
AI Risk Skepticism
Impossibility Results in AI: A Survey
Uncontrollability of Artificial Intelligence
Marcus Hutter [] h-index: 39 weighted safety cite count: 62
2015:
Sequential Extensions of Causal and Evidential Decision Theory
2016:
Avoiding Wireheading with Value Reinforcement Learning
Death and Suicide in Universal Artificial Intelligence
Practical Agents and Fundamental Challenges
Self-Modification of Policy and Utility Function in Rational Agents
Universal Artificial Intelligence-Practical Agents and Fundamental Challenges
2017:
A Game-Theoretic Analysis of the Off-Switch Game
2018:
AGI Safety Literature Review
2019:
A Causal Influence Diagram Perspective
Asymptotically Unambitious Artificial General Intelligence
Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective
2020:
Curiosity Killed or Incapacitated the Cat and the Asymptotically Optimal Agent
Curiosity Killed the Cat and the Asymptotically Optimal Agent
Pessimism About Unknown Unknowns Inspires Conservatism
2021:
Intelligence and Unambitiousness Using Algorithmic Information Theory
2022:
Advanced Artificial Agents Intervene in the Provision of Reward
Beyond Bayes-optimality: meta-learning what you know you don't know
Sam Bowman ['NYU'] h-index: 16 weighted safety cite count: 60
2021:
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail
2022:
Constitutional AI: Harmlessness from AI Feedback
Discovering Language Model Behaviors with Model-Written Evaluations
Language Models (Mostly) Know What They Know
Measuring Progress on Scalable Oversight for Large Language Models
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions
Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
2023:
Eight Things to Know about Large Language Models
Inverse Scaling: When Bigger Isn't Better
Measuring Faithfulness in Chain-of-Thought Reasoning
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
S. Legg [] h-index: 29 weighted safety cite count: 60
2017:
AI Safety Gridworlds
Deep Reinforcement Learning from Human Preferences
2018:
Measuring and avoiding side effects using relative reachability
Penalizing Side Effects using Stepwise Relative Reachability
Scalable agent alignment via reward modeling: a research direction
2019:
Learning Human Objectives by Evaluating Hypothetical Behavior
Modeling AGI Safety Frameworks with Causal Influence Diagrams
Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings
2020:
Avoiding Side Effects By Considering Future Tasks
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
Quantifying Differences in Reward Functions
REALab: An Embedded Perspective on Tampering
Special Issue “On Defining Artificial Intelligence”—Commentaries and Author’s Response
The Incentives that Shape Behaviour
2021:
Agent Incentives: A Causal Perspective
Causal Analysis of Agent Behavior for AI Safety
Model-Free Risk-Sensitive Reinforcement Learning
2022:
Beyond Bayes-optimality: meta-learning what you know you don't know
Safe Deep RL in 3D Environments using Human Feedback
David Krueger [] h-index: 18 weighted safety cite count: 53
2018:
Scalable agent alignment via reward modeling: a research direction
2019:
M ISLEADING META-OBJECTIVES AND HIDDEN INCENTIVES FOR DISTRIBUTIONAL SHIFT
2020:
AI Research Considerations for Human Existential Safety (ARCHES)
Hidden Incentives for Auto-Induced Distributional Shift
Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims
2021:
Goal Misgeneralization in Deep Reinforcement Learning
2022:
Broken Neural Scaling Laws
Defining and Characterizing Reward Hacking
2023:
Characterizing Manipulation from AI Systems
Harms from Increasingly Agentic Algorithmic Systems
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Dan Hendrycks ['UC Berkeley'] h-index: 29 weighted safety cite count: 51
2021:
A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges
Certified Adversarial Defenses Meet Out-of-Distribution Corruptions: Benchmarking Robustness and Simple Baselines
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
Unsolved Problems in ML Safety
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
2022:
A Spectral View of Randomized Smoothing Under Common Corruptions: Benchmarking and Improving Certified Robustness
Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks
Forecasting Future World Events with Neural Networks
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
OpenOOD: Benchmarking Generalized Out-of-Distribution Detection
Scaling Out-of-Distribution Detection for Real-World Settings
Supplementary Materials for PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
X-Risk Analysis for AI Research
2023:
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Natural Selection Favors AIs over Humans
A. Dafoe [] h-index: 24 weighted safety cite count: 50
2017:
When Will AI Exceed Human Performance? Evidence from AI Experts
2018:
Public Policy and Superintelligent AI : A Vector Field Approach 1 ( 2018 ) ver
Nope. It ain't happening. Can't be done. In fact, I'm not sure ai safety research even exists. "universities"? what is that? you can't convince them to do stuff if they don't exist. absolutely not possible. nobody who bets yes on this market will possibly be able to contribute to it happening. even running on spite from me attempting to throw down the gauntlet with an exaggerated, sarcastic, obviously miscalibrated no bet, there's no way yes bettors could possibly make this happen. you can't prove me wrong, can't be done