![](/_next/image?url=https%3A%2F%2Fstorage.googleapis.com%2Fmantic-markets.appspot.com%2Fcontract-images%2FSss19971997%2F878dfa87a206.jpg&w=3840&q=75)
Will Llama 3-multimodal be natively mixed-multimodal? (VQ-VAE+next token prediction)
Basic
2
Ṁ2002025
50%
chance
1D
1W
1M
ALL
Vision Language Models currently has two common paradigms
The first one is LLaVA, where one assembles a CLIP-like vision block with a LLM through projection.
The second approach is Gemini/LVM, where one uses a VQ-VAE to compress pictures into discrete tokens, then simply do autoregressive next token prediction. It is suspected that GPT-4o is also trained this way, which explains why it can generate images with excellent text rendering.
Note that meta has just announced Chameleon: Mixed-Modal Early-Fusion Foundation Models
Will Llama 3 multi-modal or Llama 3 vision be trained in the second approach?
Get Ṁ600 play money
Related questions
Related questions
Will Llama-3 be multimodal?
77% chance
Top 3 Multimodal Vision2Language Model by EOY 2024? (by Organization/Company)
Will a Mamba 7b model trained on 2 trillion tokens outperform Llama2-13B
66% chance
Will OpenAI's next major LLM release support video input?
55% chance
By 2030 will we have video-to-video where an LLM can continue any video prompt in any way you like?
76% chance
Will Llama 3 use Mixture of Experts?
3% chance
Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)
85% chance
Will a SOTA open-sourced LLM forecasting system make major use of quasilinguistic neural reps (QNRs) before 2027?
22% chance
Will Llama-3 (or next open Meta model) be obviously good in its first-order effects on the world?
87% chance
Will Meta release a Llama 3 405B multi-modal open source before the end of 2024?
99% chance