Will I have access to a program that can reliably determine pronoun and verb referents by the end of 2023?
69
436
1.2K
resolved Apr 16
Resolved
NO

I want to be able to have a computer program change the gender of a person in a passage. This is a hard problem. Consider the sentence:

Alice went to the store because she was thirsty.

Changing the gender of the subject to male requires changing the pronoun "she" to "he".

Bob went to the store because he was thirsty.

Changing their gender to neutral is worse, because now the verb "was" must be conjugated differently as well.

Alex went to the store because they were thirsty.

The hard part is figuring out which pronouns refer to which people. (Pronoun coreference.) The best tool I've found for this so far is Huggingface's NeuralCoref 4.0, which, as with all contemporary AI tools being tasked with real-world problems, does just well enough to give you hope, then completely lets you down as soon as it encounters the slightest hiccup.

The other problem is similar: figuring out which verbs need to be conjugated and which people they refer to. (Subject-verb agreement.) In the sentence:

Alice was thirsty, so she went to the store and was dissapointed that apple juice was out of stock.

The first "was" refers to the noun "Alice", and would not need to change if Alice's gender changed to neutral. The next verb is "went", which also doesn't need to be conjugated. But the next "was" refers to the pronoun "she", and would need to change to "were" if that pronoun became the linguistically-plural "they", as in:

Alex was thirsty, so they went to the store and were dissapointed that apple juice was out of stock.

And then the last "was" refers to the noun "apple juice", and also does not need to be conjugated.

Both of these problems are quite challenging. In the general case, they require a semantic understanding of the sentence, not just a syntactic one, and this sort of problem is actually used as a test for AI intelligence. (/IsaacKing/will-ai-pass-the-winograd-schema-ch)

Luckily for me, the environment in which I need this system to run is quite restricted and non-adversarial. While I still doubt there are any simple rules that can do what I need without some form of machine learning involved, I can avoid giving the system any intentionally-challenging examples like Winograd schemas. I just need it to be able to handle the sort of phrases that are likely to show up in my MTG rules question database.

Will I be able to build or gain access to a system that can do this to my satisfaction by the end of 2023?

It doesn't need to be perfect, it just needs to be good enough that it saves me and the other question-writers some effort. (Right now we have to manually label each pronoun and verb with the person it refers to.)

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ463
2Ṁ315
3Ṁ311
4Ṁ226
5Ṁ168
Sort by:

Haven't found anything, and GPT-4 can't even do Winograd schemas consistently.

bought Ṁ10 of NO

Er, no-ope. I mean, at the current performance I'd go for experting the hell out of this program (i.e. rule-based) if I really needed it to work, but that's not something you're likely to get before 1st of January. So, unless your "good enough" is very weak, I predict no (and my overcautious 10 only reflects caution about your resolution criteria, if it were "good enough for my taste", I'd bet much more on NO).

bought Ṁ71 of NO

90% on this market seems crazy high to me. Coreference is a very hard NLP problem that even fine-tuned models are still barely above 80% on benchmarks, and keep in mind (i) real-world performance tends to be much worse than benchmarks, (ii) if @IsaacKing were to get even 95% accuracy, that still might not save question-writers much effort because they would still need to manually read each sentence. Maybe that would save effort if the model also output reliable reliability measures (e.g., "I'm 99.9% confident Sentence B is correctly labeled, but Sentences D and G need human checks." if 1 mistake in every 1000 questions is acceptable), but very few LLM applications have anywhere near that sort of reliability output.

bought Ṁ0 of YES

I think the best counterarguments to my position are that pronoun coreference and subject-verb agreement in the MTG rules question database might be particularly easy coreference problems and that human scanning of the 'first draft' LLM-edited passage is (much) easier than editing it by hand, even if the draft has errors more than 10% or 20% of the time. Those currently seem unconvincing. If someone put 100 MTG prompts into GPT-4 with a decent prompt and showed it got >90% of them right, I'd update my view by at least a few percentage points.

predicted YES

Reliable is very loosely defined here. I still want to see 1 example where GPT-4 fails, because I don't think it's been formally benchmarked on coreference resolution.

boughtṀ71NO

@JacyAnthis Wondering why NO when the market's constraints are loose enough to make GPT-4 a sufficient solution

predicted NO

@firstuserhere I disagree that there is anything close to a sufficient solution on most reasonable operationalizations of the resolution criteria. I was writing a comment explaining my bet when you posted this comment, so there's more detail.

We ended up switching to use entirely gender neutral pronouns in the software this question was originally about, so I'm not longer investigating this problem for that purpose. I can either resolve this market N/A, or leave it open and try to resolve it based on whether a program appears that would have been good enough for the task. What do traders think would be better?

predicted YES

@IsaacKing If you make a market, don’t just resolve it N/A if it was never listed in the criteria. You made a market, now stick to it, it doesn’t matter whether you care about the problem anymore.

Also, Unless you can provide an example where GPT-4 often fails no matter how the prompt is written, this should resolve to YES already.

predicted YES

@ShadowyZephyr agreed. If you don't have access to gpt 4, feel free to send the lists of problems to me, i can share the responses

predicted YES

@firstuserhere

You don’t need api access, there about 10 different services that let you pay for it. Isaac definitely can get access

bought Ṁ20 of YES

Is GPT-4 not sufficient to resolve this to YES?

bought Ṁ100 of YES

@ShadowyZephyr yeah, gpt-4 seems enough to resolve this market

bought Ṁ30 of NO

@ShadowyZephyr @firstuserhere The market is based on Isaac's opinion, so we can't really say without him, but for what it's worth, computer scientists do not yet see coreference resolution as a solved problem—and certainly not one solved by GPT-4. See, for example, this leaderboard.

predicted YES

@JacyAnthis I'm buying based on - it doesn't need to be perfect, it just needs to be good enough that it saves me and the other question-writers some effort

bought Ṁ39 of NO

@firstuserhere One issue is that this is hard to partially automate. It seems like you still have to look through each sentence manually even if the AI does a first pass with, say, a 90% success rate. (I believe we're still below 90% on benchmarks, which often means 80% or lower in the real world because models overfit.)

predicted YES

@JacyAnthis Can you guys give an example where GPT-4 fails

sold Ṁ7 of NO

Here's GPT 3.5 hallucinating a new pronoun, then correctly identifying that that pronoun occurs nowhere in the sentence I asked about.

predicted YES

@IsaacKing It works with GPT-4.

predicted YES

@IsaacKing I'm sure some prompt tweaking could get this to work on 3.5. (Maybe even just giving an example). It seems to have misinterpreted pronouns as listing female pronouns because the speaker is female, like how people often would say "my pronouns are she/her"

@JacyAnthis Those leaderboards aren’t active and don’t have entries for recent LLMs. They also don’t shoe the ceiling on performance, which is well under 100 for typical NLP datasets. I’d bet that a well-designed prompt for GPT-4 is very close to ceiling and adequate for most purposes.

bought Ṁ100 of YES

I'd like you to change the gender of a person in a passage.

Before: Alice went to the store because she was thirsty.

Command: Replace Alice with a Bob, male.

After: Bob went to the store because he was thirsty.

Before: Bob went to the store because he was thirsty.

Command: Replace Bob with Alex, gender-neutral.

After: Alex went to the store because they were thirsty.

Before: Alice was thirsty, so she went to the store and was dissapointed that apple juice was out of stock.

Command: Replace Alice with Alexei, gender-neutral.

After: Alexei was thirsty, so they went to the store and were disappointed that apple juice was out of stock.

Before: If you can give Isaac a prompt where ChatGPT consistently gets this correct on the sort of sentences he cares about, that will be sufficient to resolve this market to YES. But given Isaac's experience with ChatGPT, he highly doubts it'll be able to do that.

Command: Replace Isaac with Marilyn, gender-neutral.

After: If you can give Marilyn a prompt where ChatGPT consistently gets this correct on the sort of sentences they care about, that will be sufficient to resolve this market to YES. But given Marilyn's experience with ChatGPT, they highly doubt it'll be able to do that.


Last "After:" was generated. Try out this prompt with text-davinci-003, maybe?

predicted YES

@NoaNabeshima Also some spacing was dropped when I copied this, you might want to add spacing between examples.

Not really a problem with your market, but there has to be a way to make an uncomputable sentence out of this requirement...

"Bob will go to the store. Alex's gender is male iff the Turing Machine described by the sentence '[...]' halts; he will go to the store.", and then figuring out if "he" binds Alex or Bob requires solving the halting problem.

predicted YES

@Mira Magic is turing complete, so not impossible

predicted NO

@Mira Nice try, but the gender of the participants is not a part of the sentence to be parsed, it's supplied externally.

Comment hidden

More related questions