Will the next major LLM by OpenAI use a new tokenizer?
Basic
43
1.2k
2025
76%
chance
  1. The GPT-2 model used r50k_base: vocab size = 50k

  2. The GPT-3 model used r50k_base: vocab size = 50k

  3. The GPT-3.5 model used cl100k_base: vocab size = 100k

  4. The GPT-4 model used cl100k_base: vocab size = 100k

Get Ṁ1,000 play money
Sort by:
bought Ṁ50 YES

What if there are significantly more new tokens, e.g. representing images or audio, but the tokens representing text are pretty much unchanged?

@firstuserhere So YES if there's a GPT-4.5/5 that uses a tokeniser not on this list, and NO if there's a GPT-4.5/5 that uses a tokeniser that is on this list?

Do you consider GPT-4-turbo to be a new iteration? What do you quantify as "next major LLM"

@oh No, GPT-4 turbo is part of the same family, does not qualify as the next major LLM release