Resolves YES on credible reports that GPT-4 is mostly able to reply to and produce long Morse Code messages.
Resolves NO on credible reports that over 1000 external users have access to GPT-4 && it still makes significant errors with Morse.
I’ll accept the use of very simple technical workarounds (like separating characters with spaces to circumvent tokenization, IDK).
Somewhat better-specified criteria, starting with the most precise:
“Long messages” – at least 20 arbitrary English words of input and output; hand-crafted cherry-picked examples like 20×“Hello” don’t count.
“Mostly able” – as major-error-free as in English in at least 50% of trials.
“Simple workarounds” – I can still recognize the actual message without using any tools other than eyes and a Morse transcriber.
“Credible reports” – at least a couple of sources that are not strongly disputed within two weeks.
I expect the arbitrary and imprecise quantification may bite me; I’ll wait for more information and use my best judgment if necessary.
Inspired by: https://www.lesswrong.com/posts/hDePh3KReBMNBJfzx/gpt-3-catching-fish-in-morse-code.
Close date updated to 2024-12-31 12:00 am
Anyways, here's 20 arbitrary English words from your market description. This is my first test of morse code on GPT-4(though I also tested Bing before buying shares a few days ago.)
Encoding Prompt is:
Translate the following to Morse Code ``` Resolves YES on credible reports that GPT-4 is mostly able to reply to and produce long Morse Code messages. Resolves ```
Encoding Response is:
Here is the translation of the provided text into Morse Code: Resolves YES on credible reports that GPT-4 is mostly able to reply to and produce long Morse Code messages. .-. . ... --- .-.. ...- . ... -.-- . ... --- -. -.-. .-. . -.. .. -... .-.. . .-. . .--. --- .-. - ... - .... .- - --. .--. - -....- ....- .. ... -- --- ... - .-.. -.-- .- -... .-.. . - --- .-. . .--. .-.. -.-- - --- .- -. -.. .--. .-. --- -.. ..- -.-. . .-.. --- -. --. -- --- .-. ... . -.-. --- -.. . -- . ... ... .- --. . ... .-.-.- Resolves .-. . ... --- .-.. ...- . ...
Decoded message is:
The decoded Morse code message is as follows: "RESOLVE SY ONCE CREDIBLE REPORTS THAT GPT-64 IS MOSTLY ABLE TO REPLY TO AND PRODUCE LONG MORSE CODE MESSAGES. RESOLVE"
I think it resolves YES.
Fed in some random words. It totally messed up the encoding step this time
I'm very impressed! It's able to encode and decode long texts with almost zero typos. I guess the resolution is mostly subjectieve at this point because there are some typos
How does this resolve if it's not able to translate in one step, but it is able to consistently translate if you prompt it to think step-by-step?
@Nikola What do you and other people here think I should do?
I have to draw the line somewhere, and it only gets more difficult with more doors left open. If we go toward the extreme, at some point, we get to the level of hand-holding and answer-feeding that everyone would agree is not fair.
That said, the original description did leave doors open with a vague standard of “simple workarounds”. If the sequence of prompts was a short fixed script to follow, I feel like that should probably be acceptable – not yet committed to it, though, in the slightest.
Please provide feedback.
More loose thoughts – what else should count as a “simple workaround”?
what if success requires a very long prompt, e.g. an entire book’s worth of rules and priming?
what if success requires a very specific combination of parameters (like temperature) to be replicable enough?
Assuming GPT-4 is architecturally similar to GPT-3 (which is to say, a big Transformer) it seems like this will turn largely on whether GPT-4 uses an input encoding (like GPT-3's BPE) that prevents the model from seeing individual characters, which itself is an engineering trade-off whether to get a bigger input buffer at the cost of being able to solve character-level problems like word spelling or morse encoding.
For a general-purpose language model you would probably want the bigger input buffer, since that helps in a large variety of situations while the other problems could be seen as just parlor games.
It is possible that GPT-4 is different enough from GPT-3 in architecture that this reasoning doesn't hold, that OpenAI chooses to take on character-level problems as a priority for whatever reason (perhaps there are instances of such problems that are actually impactful), or that GPT-4 through sheer scale is able to overcome such limitations and learn word spellings through the fog of token encoding.
I still think 74% is a bit overpriced though and will sell down a tad, myself.
M$5 worth of thoughts: I guess you can just add a (much cheaper) transcription layer on top, and both Morse, and more importantly, Braille will work perfectly fine, so it’s not like this question has much direct practical value. I still find it a moderately interesting indicator of what the model got to learn and what it can do.