
Will Llama 3-multimodal be natively mixed-multimodal? (VQ-VAE+next token prediction)
2
1kṀ200Jan 1
50%
chance
1H
6H
1D
1W
1M
ALL
Vision Language Models currently has two common paradigms
The first one is LLaVA, where one assembles a CLIP-like vision block with a LLM through projection.
The second approach is Gemini/LVM, where one uses a VQ-VAE to compress pictures into discrete tokens, then simply do autoregressive next token prediction. It is suspected that GPT-4o is also trained this way, which explains why it can generate images with excellent text rendering.
Note that meta has just announced Chameleon: Mixed-Modal Early-Fusion Foundation Models
Will Llama 3 multi-modal or Llama 3 vision be trained in the second approach?
This question is managed and resolved by Manifold.
Market context
Get
1,000 to start trading!