Will Llama 3-multimodal be natively mixed-multimodal? (VQ-VAE+next token prediction)

Vision Language Models currently has two common paradigms

The first one is LLaVA, where one assembles a CLIP-like vision block with a LLM through projection.

The second approach is Gemini/LVM, where one uses a VQ-VAE to compress pictures into discrete tokens, then simply do autoregressive next token prediction. It is suspected that GPT-4o is also trained this way, which explains why it can generate images with excellent text rendering.

Note that meta has just announced Chameleon: Mixed-Modal Early-Fusion Foundation Models

Will Llama 3 multi-modal or Llama 3 vision be trained in the second approach?

Get Ṁ600 play money

More related questions