2
Will I have access to a program that can reliably determine pronoun and verb referents by the end of 2023?
61
closes 2024
82%
chance

I want to be able to have a computer program change the gender of a person in a passage. This is a hard problem. Consider the sentence:

Alice went to the store because she was thirsty.

Changing the gender of the subject to male requires changing the pronoun "she" to "he".

Bob went to the store because he was thirsty.

Changing their gender to neutral is worse, because now the verb "was" must be conjugated differently as well.

Alex went to the store because they were thirsty.

The hard part is figuring out which pronouns refer to which people. (Pronoun coreference.) The best tool I've found for this so far is Huggingface's NeuralCoref 4.0, which, as with all contemporary AI tools being tasked with real-world problems, does just well enough to give you hope, then completely lets you down as soon as it encounters the slightest hiccup.

The other problem is similar: figuring out which verbs need to be conjugated and which people they refer to. (Subject-verb agreement.) In the sentence:

Alice was thirsty, so she went to the store and was dissapointed that apple juice was out of stock.

The first "was" refers to the noun "Alice", and would not need to change if Alice's gender changed to neutral. The next verb is "went", which also doesn't need to be conjugated. But the next "was" refers to the pronoun "she", and would need to change to "were" if that pronoun became the linguistically-plural "they", as in:

Alex was thirsty, so they went to the store and were dissapointed that apple juice was out of stock.

And then the last "was" refers to the noun "apple juice", and also does not need to be conjugated.

Both of these problems are quite challenging. In the general case, they require a semantic understanding of the sentence, not just a syntactic one, and this sort of problem is actually used as a test for AI intelligence. (/IsaacKing/will-ai-pass-the-winograd-schema-ch)

Luckily for me, the environment in which I need this system to run is quite restricted and non-adversarial. While I still doubt there are any simple rules that can do what I need without some form of machine learning involved, I can avoid giving the system any intentionally-challenging examples like Winograd schemas. I just need it to be able to handle the sort of phrases that are likely to show up in my MTG rules question database.

Will I be able to build or gain access to a system that can do this to my satisfaction by the end of 2023?

It doesn't need to be perfect, it just needs to be good enough that it saves me and the other question-writers some effort. (Right now we have to manually label each pronoun and verb with the person it refers to.)

Sort by:
JacyAnthis avatar
Jacy Reese Anthisbought Ṁ71 of NO

90% on this market seems crazy high to me. Coreference is a very hard NLP problem that even fine-tuned models are still barely above 80% on benchmarks, and keep in mind (i) real-world performance tends to be much worse than benchmarks, (ii) if @IsaacKing were to get even 95% accuracy, that still might not save question-writers much effort because they would still need to manually read each sentence. Maybe that would save effort if the model also output reliable reliability measures (e.g., "I'm 99.9% confident Sentence B is correctly labeled, but Sentences D and G need human checks." if 1 mistake in every 1000 questions is acceptable), but very few LLM applications have anywhere near that sort of reliability output.

JacyAnthis avatar
Jacy Reese Anthisbought Ṁ0 of YES

I think the best counterarguments to my position are that pronoun coreference and subject-verb agreement in the MTG rules question database might be particularly easy coreference problems and that human scanning of the 'first draft' LLM-edited passage is (much) easier than editing it by hand, even if the draft has errors more than 10% or 20% of the time. Those currently seem unconvincing. If someone put 100 MTG prompts into GPT-4 with a decent prompt and showed it got >90% of them right, I'd update my view by at least a few percentage points.

ShadowyZephyr avatar
ShadowyZephyris predicting YES at 91%

Reliable is very loosely defined here. I still want to see 1 example where GPT-4 fails, because I don't think it's been formally benchmarked on coreference resolution.

firstuserhere avatar
Jacy Reese Anthisbought Ṁ71 ofNO
firstuserhere

@JacyAnthis Wondering why NO when the market's constraints are loose enough to make GPT-4 a sufficient solution

JacyAnthis avatar
Jacy Reese Anthisis predicting NO at 93%

@firstuserhere I disagree that there is anything close to a sufficient solution on most reasonable operationalizations of the resolution criteria. I was writing a comment explaining my bet when you posted this comment, so there's more detail.

IsaacKing avatar
Isaac

We ended up switching to use entirely gender neutral pronouns in the software this question was originally about, so I'm not longer investigating this problem for that purpose. I can either resolve this market N/A, or leave it open and try to resolve it based on whether a program appears that would have been good enough for the task. What do traders think would be better?

ShadowyZephyr avatar
ShadowyZephyris predicting YES at 92%

@IsaacKing If you make a market, don’t just resolve it N/A if it was never listed in the criteria. You made a market, now stick to it, it doesn’t matter whether you care about the problem anymore.

Also, Unless you can provide an example where GPT-4 often fails no matter how the prompt is written, this should resolve to YES already.

firstuserhere avatar
firstuserhereis predicting YES at 92%

@ShadowyZephyr agreed. If you don't have access to gpt 4, feel free to send the lists of problems to me, i can share the responses

ShadowyZephyr avatar
ShadowyZephyris predicting YES at 93%

@firstuserhere

You don’t need api access, there about 10 different services that let you pay for it. Isaac definitely can get access

ShadowyZephyr avatar
ShadowyZephyrbought Ṁ20 of YES

Is GPT-4 not sufficient to resolve this to YES?

firstuserhere avatar
firstuserherebought Ṁ100 of YES

@ShadowyZephyr yeah, gpt-4 seems enough to resolve this market

JacyAnthis avatar
Jacy Reese Anthisbought Ṁ30 of NO

@ShadowyZephyr @firstuserhere The market is based on Isaac's opinion, so we can't really say without him, but for what it's worth, computer scientists do not yet see coreference resolution as a solved problem—and certainly not one solved by GPT-4. See, for example, this leaderboard.

firstuserhere avatar
firstuserhereis predicting YES at 94%

@JacyAnthis I'm buying based on - it doesn't need to be perfect, it just needs to be good enough that it saves me and the other question-writers some effort

JacyAnthis avatar
Jacy Reese Anthisbought Ṁ39 of NO

@firstuserhere One issue is that this is hard to partially automate. It seems like you still have to look through each sentence manually even if the AI does a first pass with, say, a 90% success rate. (I believe we're still below 90% on benchmarks, which often means 80% or lower in the real world because models overfit.)

ShadowyZephyr avatar
ShadowyZephyris predicting YES at 90%

@JacyAnthis Can you guys give an example where GPT-4 fails

IsaacKing avatar
Isaacsold Ṁ7 of NO

Here's GPT 3.5 hallucinating a new pronoun, then correctly identifying that that pronoun occurs nowhere in the sentence I asked about.

SG avatar
S Gis predicting YES at 90%

@IsaacKing It works with GPT-4.

ShadowyZephyr avatar
ShadowyZephyris predicting YES at 90%

@IsaacKing I'm sure some prompt tweaking could get this to work on 3.5. (Maybe even just giving an example). It seems to have misinterpreted pronouns as listing female pronouns because the speaker is female, like how people often would say "my pronouns are she/her"

Hedgehog avatar
🦔

@JacyAnthis Those leaderboards aren’t active and don’t have entries for recent LLMs. They also don’t shoe the ceiling on performance, which is well under 100 for typical NLP datasets. I’d bet that a well-designed prompt for GPT-4 is very close to ceiling and adequate for most purposes.

johnleoks avatar
johnleoks
Comment hidden
NoaNabeshima avatar
Noa Nabeshimabought Ṁ100 of YES

I'd like you to change the gender of a person in a passage.

Before: Alice went to the store because she was thirsty.

Command: Replace Alice with a Bob, male.

After: Bob went to the store because he was thirsty.

Before: Bob went to the store because he was thirsty.

Command: Replace Bob with Alex, gender-neutral.

After: Alex went to the store because they were thirsty.

Before: Alice was thirsty, so she went to the store and was dissapointed that apple juice was out of stock.

Command: Replace Alice with Alexei, gender-neutral.

After: Alexei was thirsty, so they went to the store and were disappointed that apple juice was out of stock.

Before: If you can give Isaac a prompt where ChatGPT consistently gets this correct on the sort of sentences he cares about, that will be sufficient to resolve this market to YES. But given Isaac's experience with ChatGPT, he highly doubts it'll be able to do that.

Command: Replace Isaac with Marilyn, gender-neutral.

After: If you can give Marilyn a prompt where ChatGPT consistently gets this correct on the sort of sentences they care about, that will be sufficient to resolve this market to YES. But given Marilyn's experience with ChatGPT, they highly doubt it'll be able to do that.


Last "After:" was generated. Try out this prompt with text-davinci-003, maybe?

NoaNabeshima avatar
Noa Nabeshimais predicting YES at 86%

@NoaNabeshima Also some spacing was dropped when I copied this, you might want to add spacing between examples.

Mira avatar
Mira

Not really a problem with your market, but there has to be a way to make an uncomputable sentence out of this requirement...

"Bob will go to the store. Alex's gender is male iff the Turing Machine described by the sentence '[...]' halts; he will go to the store.", and then figuring out if "he" binds Alex or Bob requires solving the halting problem.

Isaac228c avatar
Isaacis predicting YES at 83%

@Mira Magic is turing complete, so not impossible

IsaacKing avatar
Isaacis predicting NO at 84%

@Mira Nice try, but the gender of the participants is not a part of the sentence to be parsed, it's supplied externally.

SG avatar
S Gbought Ṁ30 of YES

Chat-GPT seems to work already:

IsaacKing avatar
Isaacbought Ṁ23 of NO

@SG It fails in that very screenshot! It changed the noun "Alice" into the pronoun "He". Nouns should never change, I only want pronouns and verbs to change.

IsaacKing avatar
Isaacis predicting NO at 50%

If you can give me a prompt where ChatGPT consistently gets this correct on the sort of sentences I care about, that will be sufficient to resolve this market to YES. But given my experience with ChatGPT, I highly doubt it'll be able to do that.

SG avatar
S Gis predicting YES at 50%
IsaacKing avatar
Isaacis predicting NO at 59%

@SG Hmm. Playing around with it a bit myself, it's better than I expected. I'll need to test it more extensively before resolving this to YES, but it's promising. I'd want some very consistent results before I'd feel safe putting any output from ChatGPT up on a production site.

IsaacKing avatar
Isaacsold Ṁ15 of NO

Oh, I'd also need there to be an API, which I don't think ChatGPT has yet.

But maybe the original GPT-3 can also do this, and that does have an API I can use.

L avatar
Lis predicting YES at 83%

@IsaacKing I don't think GPT3.5 is at the level you ask for, but I do think further iteration will get models that are.

IsaacKing avatar
Isaacis predicting NO at 83%

@L Have an example where GPT-3 fails?

toms avatar
Tоmbought Ṁ100 of YES

@IsaacKing Now that there's a ChatGPT API, is it sufficient to resolve this question YES with SG's prompt?

ShadowyZephyr avatar
ShadowyZephyris predicting YES at 89% (edited)

@IsaacKing GPT-3.5 has an API, and GPT-4 has an API that is waitlisted. However, the market description does not mention an API of any kind, only that you'll "gain access" to a system that can do this. Do you have an example where GPT-3.5 fails when specified to keep proper nouns?

Related markets

Will I believe that it's always ok to refer to people with gender-neutral pronouns at the end of 2023?84%
Will I consider my English pronunciation to be good by the end of 2023?34%
Will Conjecture produce work that I believe constitutes meaningful progress towards alignment by the end of 2023?50%
Will I be able to do fluent IPA transcriptions of english by the end of 2023?72%
Will Conjecture produce work that they believe constitutes meaningful progress towards alignment by the end of 2023?40%
Will I publish one relevant paper by the end of 2023?43%
Will I use a functional spontaneous-hangout-coordinating app of this kind by the end of 2023?68%
Will I be admitted to at least one PhD program by 2026?86%
Will I be able to reach a B2 proficiency in Japanese by the end of 2024?50%
Will my theory of writing direction and shoulder function be accepted by at least fifteen scientists by the end of 2024?16%
Will significant strides be made in cracking dolphin language by 2026?58%
Will we be able to communicate with dogs in English, by 2025?5%
Will I have a research position at Conjecture (Research Engineer included) by the end of 2025?15%
In 2023, will voice be more pervasive in SaaS across industries?49%
Will we see more "verified human" accounts + increased platform surveillance by the end of 2027?91%
Will I be regularly checking Mastodon around the end of 2023?16%
Will I appear on the author list of an article published in a scientific journal before 2024?37%
Will I end up with a positive record by the end of 202352%
Will I have a research position at Anthropic (Research Engineer included) by the end of 2025?16%
Will I be employed at any point in 2023?98%