On 2024/09/13 I will uniformly sample from all post on the alignmentforum published between 2023/09/13 and 2024/09/13 that express an opinion on whether prosaic interpretability is net useful for aligning future, dangerous AI, weighted by their karma. (So a post with 4 karma is 2 times more likely to get picked than one with 2 karma)
If the sampled post contributes to prosaic interpretability or is in favor of past/future interpretability research, this question resolves to "yes".
I won't vote on this. I hope but do not guarantee to maintain the updated list of posts I'll sample over with their labels somewhere here.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ116 | |
2 | Ṁ33 | |
3 | Ṁ20 | |
4 | Ṁ15 | |
5 | Ṁ8 |
People are also trading
Looking back this question wasn't very thrilling.
I do think there is use in predicting next year's alignmentforum zeitgeist, but ideally I have better questions to ask.
Maybe I will also just repeat this exact question again though.
It's clearly positive on interp as it contributes to the field itself.
Also, it includes this snippet encouraging future work:
> It'd be good to investigate a better L1 penalty than L1(sqrt(x)). This can be done empirically by throwing lots of L1 loss terms at the wall, or there may be a more analytical solution. Let me know if you have any ideas! Comments and dms are welcome.
I selected text from the interpretability tag, edited out redundant information with vim and then had python chose the post I will read.
The chosen post is "Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features"
python script:
import random
with open("./manifold-afxai-list.txt", "r") as file:
text = file.read()
weights = []
titles = []
for entry in text.split("\n\n"):
title_weight = entry.split("\n")
weights.append(int(title_weight[0]))
titles.append(title_weight[1])
assert len(weights) == len(titles)
print("There are", len(titles), "new posts on AF with the interp tag since last year.")
# Setting the resolution date as a seed.
# this does not work if you anticipate I do this, but for this time you get a tiny bit of evidence that I'm not cherry picking a seed I like.
# In future instances it would be nice to use a public rng.
# Though, I'm not a 100% on your device getting the same result with the same seed...
random.seed(20240913)
# I never ran this before having chosen the above seed.
[title] = random.choices(titles, weights=weights, k=1)
print("The chosen post is:", title)
extracted and edited text from AF, above every title is the post's karma
13
AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
32
Showing SAE Latents Are Not Atomic Using Meta-SAEs
24
Measuring Structure Development in Algorithmic Transformers
7
Finding Deception in Language Models
30
Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
17
Extracting SAE task features for in-context learning
55
You can remove GPT2’s LayerNorm by fine-tuning for an hour
25
Self-explaining SAE features
94
The ‘strong’ feature hypothesis could be wrong
4
Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning
25
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
29
BatchTopK: A Simple Improvement for TopK-SAEs
25
Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
29
JumpReLU SAEs + Early Access to Gemma 2 SAEs
14
Truth is Universal: Robust Detection of Lies in LLMs
54
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
25
SAEs (usually) Transfer Between Base and Chat Models
20
An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
18
Stitching SAEs of different sizes
53
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
35
[Interim research report] Activation plateaus & sensitive directions in GPT2
40
Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
43
OthelloGPT learned a bag of heuristics
39
Interpreting Preference Models w/ Sparse Autoencoders
12
Representation Tuning
43
Compact Proofs of Model Performance via Mechanistic Interpretability
103
SAE feature geometry is outside the superposition hypothesis
18
Attention Output SAEs Improve Circuit Analysis
1
Analysing Adversarial Attacks with Linear Probing
10
SAEs Discover Meaningful Features in the IOI Task
59
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
20
Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.
42
Apollo Research 1-year update
22
Announcing Human-aligned AI Summer School
69
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
56
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
28
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
4
Visualizing neural network planning
28
Mechanistic Interpretability Workshop Happening at ICML 2024!
33
Transcoders enable fine-grained interpretable circuit analysis for language models
75
Refusal in LLMs is mediated by a single direction
33
Superposition is not "just" neuron polysemanticity
39
Improving Dictionary Learning with Gated Sparse Autoencoders
25
ProLU: A Nonlinearity for Sparse Autoencoders
40
[Full Post] Progress Update #1 from the GDM Mech Interp Team
36
[Summary] Progress Update #1 from the GDM Mech Interp Team
57
Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
144
Transformers Represent Belief State Geometry in their Residual Stream
44
Sparsify: A mechanistic interpretability research agenda
41
A Selection of Randomly Selected SAE Features
30
SAE-VIS: Announcement Post
51
SAE reconstruction errors are (empirically) pathological
43
Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
36
Stagewise Development in Neural Networks
13
AtP*: An efficient and scalable method for localizing LLM behaviour to components
10
Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features
18
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
25
Understanding SAE Features with the Logit Lens
33
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
1
What’s in the box?! – Towards interpretability by distinguishing niches of value within neural networks.
81
Timaeus's First Four Months
6
Difficulty classes for alignment properties
45
Addressing Feature Suppression in SAEs
28
Attention SAEs Scale to GPT-2 Small
39
Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
84
Toward A Mathematical Framework for Computation in Superposition
35
Sparse Autoencoders Work on Attention Layer Outputs
13
Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization
10
Mech Interp Challenge: January - Deciphering the Caesar Cipher Model
6
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)
12
Fact Finding: How to Think About Interpreting Memorisation (Post 4)
10
Fact Finding: Simplifying the Circuit (Post 2)
48
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
4
Assessment of AI safety agendas: think about the downside risk
13
Interpreting the Learning of Deceit
27
Finding Sparse Linear Connections between Features in LLMs
34
Refusal mechanisms: initial experiments with Llama-2-7b-chat
55
Deep Forgetting & Unlearning for Safely-Scoped LLMs
32
Intro to Superposition & Sparse Autoencoders (Colab exercises)
16
Incidental polysemanticity
29
Polysemantic Attention Head in a 4-Layer Transformer
32
Growth and Form in a Toy Model of Superposition
6
Mech Interp Challenge: November - Deciphering the Cumulative Sum Model
42
Charbel-Raphaël and Lucius discuss interpretability
13
Machine Unlearning Evaluations as Interpretability Benchmarks
75
Announcing Timaeus
29
Thoughts On (Solving) Deep Deception
1
Can we isolate neurons that recognize features vs. those which have some other role?Q
41
Revealing Intentionality In Language Models Through AdaVAE Guided Sampling
42
Investigating the learning coefficient of modular addition: hackathon project
38
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small
37
Paper: Understanding and Controlling a Maze-Solving Policy Network
14
Attributing to interactions with GCPD and GWPD
36
You’re Measuring Model Complexity Wrong
56
Comparing Anthropic's Dictionary Learning to Ours
110
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
7
Ideation and Trajectory Modelling in Language Models
7
Mech Interp Challenge: October - Deciphering the Sorted List Model
16
New Tool: the Residual Stream Viewer
28
High-level interpretability: detecting an AI's objectives
9
Announcing the CNN Interpretability Competition
16
Impact stories for model internals: an exercise for interpretability researchers
63
Sparse Autoencoders Find Highly Interpretable Directions in Language Models
26
Interpretability Externalities Case Study - Hungry Hungry Hippos
23
Three ways interpretability could be impactful
11
Uncovering Latent Human Wellbeing in LLM Embeddings
14
Mech Interp Challenge: September - Deciphering the Addition Model
I'll probably go resolve this by sampling from posts with the interpretability tag https://www.alignmentforum.org/tag/interpretability-ml-and-ai?sortedBy=new. Since there are 84 posts between now and one year ago, I'll first sample and then read the single post the sample landed on and resolve this market based on its sentiment.
If it is not pro or anti, I'll resample.
Any objections? If not, expect me to resolve tomorrow or the day thereafter.