I want to be able to have a computer program change the gender of a person in a passage. This is a hard problem. Consider the sentence:
Alice went to the store because she was thirsty.
Changing the gender of the subject to male requires changing the pronoun "she" to "he".
Bob went to the store because he was thirsty.
Changing their gender to neutral is worse, because now the verb "was" must be conjugated differently as well.
Alex went to the store because they were thirsty.
The hard part is figuring out which pronouns refer to which people. (Pronoun coreference.) The best tool I've found for this so far is Huggingface's NeuralCoref 4.0, which, as with all contemporary AI tools being tasked with real-world problems, does just well enough to give you hope, then completely lets you down as soon as it encounters the slightest hiccup.
The other problem is similar: figuring out which verbs need to be conjugated and which people they refer to. (Subject-verb agreement.) In the sentence:
Alice was thirsty, so she went to the store and was dissapointed that apple juice was out of stock.
The first "was" refers to the noun "Alice", and would not need to change if Alice's gender changed to neutral. The next verb is "went", which also doesn't need to be conjugated. But the next "was" refers to the pronoun "she", and would need to change to "were" if that pronoun became the linguistically-plural "they", as in:
Alex was thirsty, so they went to the store and were dissapointed that apple juice was out of stock.
And then the last "was" refers to the noun "apple juice", and also does not need to be conjugated.
Both of these problems are quite challenging. In the general case, they require a semantic understanding of the sentence, not just a syntactic one, and this sort of problem is actually used as a test for AI intelligence. (/IsaacKing/will-ai-pass-the-winograd-schema-ch)
Luckily for me, the environment in which I need this system to run is quite restricted and non-adversarial. While I still doubt there are any simple rules that can do what I need without some form of machine learning involved, I can avoid giving the system any intentionally-challenging examples like Winograd schemas. I just need it to be able to handle the sort of phrases that are likely to show up in my MTG rules question database.
Will I be able to build or gain access to a system that can do this to my satisfaction by the end of 2023?
It doesn't need to be perfect, it just needs to be good enough that it saves me and the other question-writers some effort. (Right now we have to manually label each pronoun and verb with the person it refers to.)
90% on this market seems crazy high to me. Coreference is a very hard NLP problem that even fine-tuned models are still barely above 80% on benchmarks, and keep in mind (i) real-world performance tends to be much worse than benchmarks, (ii) if @IsaacKing were to get even 95% accuracy, that still might not save question-writers much effort because they would still need to manually read each sentence. Maybe that would save effort if the model also output reliable reliability measures (e.g., "I'm 99.9% confident Sentence B is correctly labeled, but Sentences D and G need human checks." if 1 mistake in every 1000 questions is acceptable), but very few LLM applications have anywhere near that sort of reliability output.
I think the best counterarguments to my position are that pronoun coreference and subject-verb agreement in the MTG rules question database might be particularly easy coreference problems and that human scanning of the 'first draft' LLM-edited passage is (much) easier than editing it by hand, even if the draft has errors more than 10% or 20% of the time. Those currently seem unconvincing. If someone put 100 MTG prompts into GPT-4 with a decent prompt and showed it got >90% of them right, I'd update my view by at least a few percentage points.
Reliable is very loosely defined here. I still want to see 1 example where GPT-4 fails, because I don't think it's been formally benchmarked on coreference resolution.
@JacyAnthis Wondering why NO when the market's constraints are loose enough to make GPT-4 a sufficient solution
@firstuserhere I disagree that there is anything close to a sufficient solution on most reasonable operationalizations of the resolution criteria. I was writing a comment explaining my bet when you posted this comment, so there's more detail.
We ended up switching to use entirely gender neutral pronouns in the software this question was originally about, so I'm not longer investigating this problem for that purpose. I can either resolve this market N/A, or leave it open and try to resolve it based on whether a program appears that would have been good enough for the task. What do traders think would be better?
@IsaacKing If you make a market, don’t just resolve it N/A if it was never listed in the criteria. You made a market, now stick to it, it doesn’t matter whether you care about the problem anymore.
Also, Unless you can provide an example where GPT-4 often fails no matter how the prompt is written, this should resolve to YES already.
@ShadowyZephyr agreed. If you don't have access to gpt 4, feel free to send the lists of problems to me, i can share the responses
You don’t need api access, there about 10 different services that let you pay for it. Isaac definitely can get access
Is GPT-4 not sufficient to resolve this to YES?
@ShadowyZephyr yeah, gpt-4 seems enough to resolve this market
@ShadowyZephyr @firstuserhere The market is based on Isaac's opinion, so we can't really say without him, but for what it's worth, computer scientists do not yet see coreference resolution as a solved problem—and certainly not one solved by GPT-4. See, for example, this leaderboard.
@JacyAnthis I'm buying based on - it doesn't need to be perfect, it just needs to be good enough that it saves me and the other question-writers some effort
@firstuserhere One issue is that this is hard to partially automate. It seems like you still have to look through each sentence manually even if the AI does a first pass with, say, a 90% success rate. (I believe we're still below 90% on benchmarks, which often means 80% or lower in the real world because models overfit.)
@JacyAnthis Can you guys give an example where GPT-4 fails
Here's GPT 3.5 hallucinating a new pronoun, then correctly identifying that that pronoun occurs nowhere in the sentence I asked about.
@IsaacKing It works with GPT-4.
@IsaacKing I'm sure some prompt tweaking could get this to work on 3.5. (Maybe even just giving an example). It seems to have misinterpreted pronouns as listing female pronouns because the speaker is female, like how people often would say "my pronouns are she/her"
@JacyAnthis Those leaderboards aren’t active and don’t have entries for recent LLMs. They also don’t shoe the ceiling on performance, which is well under 100 for typical NLP datasets. I’d bet that a well-designed prompt for GPT-4 is very close to ceiling and adequate for most purposes.
I'd like you to change the gender of a person in a passage.
Before: Alice went to the store because she was thirsty.
Command: Replace Alice with a Bob, male.
After: Bob went to the store because he was thirsty.
Before: Bob went to the store because he was thirsty.
Command: Replace Bob with Alex, gender-neutral.
After: Alex went to the store because they were thirsty.
Before: Alice was thirsty, so she went to the store and was dissapointed that apple juice was out of stock.
Command: Replace Alice with Alexei, gender-neutral.
After: Alexei was thirsty, so they went to the store and were disappointed that apple juice was out of stock.
Before: If you can give Isaac a prompt where ChatGPT consistently gets this correct on the sort of sentences he cares about, that will be sufficient to resolve this market to YES. But given Isaac's experience with ChatGPT, he highly doubts it'll be able to do that.
Command: Replace Isaac with Marilyn, gender-neutral.
After: If you can give Marilyn a prompt where ChatGPT consistently gets this correct on the sort of sentences they care about, that will be sufficient to resolve this market to YES. But given Marilyn's experience with ChatGPT, they highly doubt it'll be able to do that.
Last "After:" was generated. Try out this prompt with text-davinci-003, maybe?
@NoaNabeshima Also some spacing was dropped when I copied this, you might want to add spacing between examples.
Not really a problem with your market, but there has to be a way to make an uncomputable sentence out of this requirement...
"Bob will go to the store. Alex's gender is male iff the Turing Machine described by the sentence '[...]' halts; he will go to the store.", and then figuring out if "he" binds Alex or Bob requires solving the halting problem.
Chat-GPT seems to work already:
If you can give me a prompt where ChatGPT consistently gets this correct on the sort of sentences I care about, that will be sufficient to resolve this market to YES. But given my experience with ChatGPT, I highly doubt it'll be able to do that.
@SG Hmm. Playing around with it a bit myself, it's better than I expected. I'll need to test it more extensively before resolving this to YES, but it's promising. I'd want some very consistent results before I'd feel safe putting any output from ChatGPT up on a production site.
Oh, I'd also need there to be an API, which I don't think ChatGPT has yet.
But maybe the original GPT-3 can also do this, and that does have an API I can use.
@IsaacKing I don't think GPT3.5 is at the level you ask for, but I do think further iteration will get models that are.
@IsaacKing Now that there's a ChatGPT API, is it sufficient to resolve this question YES with SG's prompt?
@IsaacKing GPT-3.5 has an API, and GPT-4 has an API that is waitlisted. However, the market description does not mention an API of any kind, only that you'll "gain access" to a system that can do this. Do you have an example where GPT-3.5 fails when specified to keep proper nouns?