Will Llama 3-multimodal be natively mixed-multimodal? (VQ-VAE+next token prediction) | Manifold

Will Llama 3-multimodal be natively mixed-multimodal? (VQ-VAE+next token prediction)

2

1kṀ200

Jan 1

50%

chance

1H

6H

1D

1W

1M

ALL

Vision Language Models currently has two common paradigms

The first one is LLaVA, where one assembles a CLIP-like vision block with a LLM through projection.

The second approach is Gemini/LVM, where one uses a VQ-VAE to compress pictures into discrete tokens, then simply do autoregressive next token prediction. It is suspected that GPT-4o is also trained this way, which explains why it can generate images with excellent text rendering.

Note that meta has just announced Chameleon: Mixed-Modal Early-Fusion Foundation Models

Will Llama 3 multi-modal or Llama 3 vision be trained in the second approach?

Meta (Facebook)

Get

1,000

to start trading!

People are also trading

Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)

Will OpenAI's next major LLM release support video input?

By 2030 will we have video-to-video where an LLM can continue any video prompt in any way you like?

Will a SOTA open-sourced LLM forecasting system make major use of quasilinguistic neural reps (QNRs) before 2027?

Related questions

Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)

Will OpenAI's next major LLM release support video input?

By 2030 will we have video-to-video where an LLM can continue any video prompt in any way you like?

Will a SOTA open-sourced LLM forecasting system make major use of quasilinguistic neural reps (QNRs) before 2027?

© Manifold Markets, Inc.•Terms•Privacy