What % of alignment forum karma will be pro-interpetability vs anti this year?

1kṀ1154

resolved Sep 24

Resolved

YES

ALL

On 2024/09/13 I will uniformly sample from all post on the alignmentforum published between 2023/09/13 and 2024/09/13 that express an opinion on whether prosaic interpretability is net useful for aligning future, dangerous AI, weighted by their karma. (So a post with 4 karma is 2 times more likely to get picked than one with 2 karma)

If the sampled post contributes to prosaic interpretability or is in favor of past/future interpretability research, this question resolves to "yes".

I won't vote on this. I hope but do not guarantee to maintain the updated list of posts I'll sample over with their labels somewhere here.

AI Alignment

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ116
2		Ṁ33
3		Ṁ20
4		Ṁ15
5		Ṁ8

People are also trading

Will there be more alignmentforum posts from 2025 than 2024?

55% chance

Will "Takes on "Alignment Faking in Large Language ..." make the top fifty posts in LessWrong's 2024 Annual Review?

19% chance

Will "What Is The Alignment Problem?" make the top fifty posts in LessWrong's 2025 Annual Review?

15% chance

Will "How to replicate and extend our alignment fak..." make the top fifty posts in LessWrong's 2024 Annual Review?

14% chance

Will "Alignment Faking in Large Language Models" make the top fifty posts in LessWrong's 2024 Annual Review?

94% chance

In 5 years will I think the org Conjecture was net good for alignment?

57% chance

Will "LLMs for Alignment Research: a safety priority?" make the top fifty posts in LessWrong's 2024 Annual Review?

13% chance

Will "“Alignment Faking” frame is somewhat fake" make the top fifty posts in LessWrong's 2024 Annual Review?

19% chance

Will "Demystifying "Alignment" through a Comic" make the top fifty posts in LessWrong's 2024 Annual Review?

14% chance

Will "Making a conservative case for alignment" make the top fifty posts in LessWrong's 2024 Annual Review?

Sort by:

Looking back this question wasn't very thrilling.
I do think there is use in predicting next year's alignmentforum zeitgeist, but ideally I have better questions to ask.
Maybe I will also just repeat this exact question again though.

9mo

It's clearly positive on interp as it contributes to the field itself.
Also, it includes this snippet encouraging future work:
> It'd be good to investigate a better L1 penalty than L1(sqrt(x)). This can be done empirically by throwing lots of L1 loss terms at the wall, or there may be a more analytical solution. Let me know if you have any ideas! Comments and dms are welcome.

9mo

I selected text from the interpretability tag, edited out redundant information with vim and then had python chose the post I will read.
The chosen post is "Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features"

python script:

import random

with open("./manifold-afxai-list.txt", "r") as file:
    text = file.read()

weights = []
titles = []
for entry in text.split("\n\n"):
    title_weight = entry.split("\n")
    weights.append(int(title_weight[0]))
    titles.append(title_weight[1])

assert len(weights) == len(titles)
print("There are", len(titles), "new posts on AF with the interp tag since last year.")

# Setting the resolution date as a seed.
# this does not work if you anticipate I do this, but for this time you get a tiny bit of evidence that I'm not cherry picking a seed I like.
# In future instances it would be nice to use a public rng.
# Though, I'm not a 100% on your device getting the same result with the same seed...
random.seed(20240913)

# I never ran this before having chosen the above seed.
[title] = random.choices(titles, weights=weights, k=1)
print("The chosen post is:", title)

extracted and edited text from AF, above every title is the post's karma

13
AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

32
Showing SAE Latents Are Not Atomic Using Meta-SAEs

24
Measuring Structure Development in Algorithmic Transformers

7
Finding Deception in Language Models

30
Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

17
Extracting SAE task features for in-context learning

55
You can remove GPT2’s LayerNorm by fine-tuning for an hour

25
Self-explaining SAE features

94
The ‘strong’ feature hypothesis could be wrong

4
Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

25
Pacing Outside the Box: RNNs Learn to Plan in Sokoban

29
BatchTopK: A Simple Improvement for TopK-SAEs

25
Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

29
JumpReLU SAEs + Early Access to Gemma 2 SAEs

14
Truth is Universal: Robust Detection of Lies in LLMs

54
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

25
SAEs (usually) Transfer Between Base and Chat Models

20
An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

18
Stitching SAEs of different sizes

53
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

35
[Interim research report] Activation plateaus & sensitive directions in GPT2

40
Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

43
OthelloGPT learned a bag of heuristics

39
Interpreting Preference Models w/ Sparse Autoencoders

12
Representation Tuning

43
Compact Proofs of Model Performance via Mechanistic Interpretability

103
SAE feature geometry is outside the superposition hypothesis

18
Attention Output SAEs Improve Circuit Analysis

1
Analysing Adversarial Attacks with Linear Probing

10
SAEs Discover Meaningful Features in the IOI Task

59
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

20
Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.

42
Apollo Research 1-year update

22
Announcing Human-aligned AI Summer School

69
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

56
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

28
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

4
Visualizing neural network planning

28
Mechanistic Interpretability Workshop Happening at ICML 2024!

33
Transcoders enable fine-grained interpretable circuit analysis for language models

75
Refusal in LLMs is mediated by a single direction

33
Superposition is not "just" neuron polysemanticity

39
Improving Dictionary Learning with Gated Sparse Autoencoders

25
ProLU: A Nonlinearity for Sparse Autoencoders

40
[Full Post] Progress Update #1 from the GDM Mech Interp Team

36
[Summary] Progress Update #1 from the GDM Mech Interp Team

57
Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

144
Transformers Represent Belief State Geometry in their Residual Stream

44
Sparsify: A mechanistic interpretability research agenda

41
A Selection of Randomly Selected SAE Features

30
SAE-VIS: Announcement Post

51
SAE reconstruction errors are (empirically) pathological

43
Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

36
Stagewise Development in Neural Networks

13
AtP*: An efficient and scalable method for localizing LLM behaviour to components

10
Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features 

18
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

25
Understanding SAE Features with the Logit Lens

33
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

1
What’s in the box?! – Towards interpretability by distinguishing niches of value within neural networks.

81
Timaeus's First Four Months

6
Difficulty classes for alignment properties

45
Addressing Feature Suppression in SAEs

28
Attention SAEs Scale to GPT-2 Small

39
Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

84
Toward A Mathematical Framework for Computation in Superposition

35
Sparse Autoencoders Work on Attention Layer Outputs

13
Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

10
Mech Interp Challenge: January - Deciphering the Caesar Cipher Model

6
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)

12
Fact Finding: How to Think About Interpreting Memorisation (Post 4)

10
Fact Finding: Simplifying the Circuit (Post 2)

48
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

4
Assessment of AI safety agendas: think about the downside risk

13
Interpreting the Learning of Deceit

27
Finding Sparse Linear Connections between Features in LLMs

34
Refusal mechanisms: initial experiments with Llama-2-7b-chat

55
Deep Forgetting & Unlearning for Safely-Scoped LLMs

32
Intro to Superposition & Sparse Autoencoders (Colab exercises)

16
Incidental polysemanticity

29
Polysemantic Attention Head in a 4-Layer Transformer

32
Growth and Form in a Toy Model of Superposition

6
Mech Interp Challenge: November - Deciphering the Cumulative Sum Model

42
Charbel-Raphaël and Lucius discuss interpretability

13
Machine Unlearning Evaluations as Interpretability Benchmarks

75
Announcing Timaeus

29
Thoughts On (Solving) Deep Deception

1
Can we isolate neurons that recognize features vs. those which have some other role?Q

41
Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

42
Investigating the learning coefficient of modular addition: hackathon project

38
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small

37
Paper: Understanding and Controlling a Maze-Solving Policy Network

14
Attributing to interactions with GCPD and GWPD

36
You’re Measuring Model Complexity Wrong

56
Comparing Anthropic's Dictionary Learning to Ours

110
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

7
Ideation and Trajectory Modelling in Language Models

7
Mech Interp Challenge: October - Deciphering the Sorted List Model

16
New Tool: the Residual Stream Viewer

28
High-level interpretability: detecting an AI's objectives

9
Announcing the CNN Interpretability Competition

16
Impact stories for model internals: an exercise for interpretability researchers

63
Sparse Autoencoders Find Highly Interpretable Directions in Language Models

26
Interpretability Externalities Case Study - Hungry Hungry Hippos

23
Three ways interpretability could be impactful

11
Uncovering Latent Human Wellbeing in LLM Embeddings

14
Mech Interp Challenge: September - Deciphering the Addition Model

9mo

I'll probably go resolve this by sampling from posts with the interpretability tag https://www.alignmentforum.org/tag/interpretability-ml-and-ai?sortedBy=new. Since there are 84 posts between now and one year ago, I'll first sample and then read the single post the sample landed on and resolve this market based on its sentiment.
If it is not pro or anti, I'll resample.

Any objections? If not, expect me to resolve tomorrow or the day thereafter.