The Transformer architecture, introduced by Vaswani et al. (2017), has become the dominant approach in natural language processing tasks due to its effectiveness in capturing long-range dependencies, utilizing self-attention mechanisms, and enabling massive parallelization. Since then, various adaptations of the architecture have emerged, such as GPT, which removed the encoder portion of the model. The question to be answered is whether the first robotic Artificial General Intelligence (AGI) will be developed using a transformer-like architecture, considering the original design as well as its potential modifications.
Before January 1st, 2100, will the first robotic AGI be developed using a transformer-like architecture at its core, as determined by credible reports or academic publications?
This question will resolve positively if, before January 1st, 2100, credible reports or academic publications provide evidence that the first robotic AGI has been developed using a transformer-like architecture as its main cognitive component, which includes the original design as well as any variants thereof, as defined below.
A "robotic AGI" is defined by this other question on Manifold Markets. The "first" robotic AGI will be the first unified single AI system that is clearly, credibly, and near-uncontroversially documented to meet the criteria described in the linked question.
The robotic AGI must at its core utilize an architecture that meets the following requirements, including any combination of the modifications and alterations listed:
Core architecture components: The architecture must consist of an encoder, a decoder, or both, with multi-head self-attention mechanisms, position-wise feed-forward networks, and positional encoding.
Self-attention mechanism: The architecture must use a self-attention mechanism or a close approximation thereof. This includes:
a. Modifications to the matrix algebra to address the quadratic complexity, such as kernelized attention, while retaining the core self-attention functionality.
b. Approximation techniques to compute self-attention, such as sparse attention, low-rank approximation, or other methods that preserve the essential characteristics of the self-attention mechanism and maintain a similar mathematical form.
c. Variations in the multi-head attention mechanism, such as incorporating dynamic weights or adaptive computation time.
Encoder and decoder alterations: Variations of the architecture that retain the core functionality, such as:
a. Removing the encoder, as seen in GPT, or removing the decoder, as seen in BERT.
b. Modifying the encoder or decoder layers while maintaining the core structure, including but not limited to layer normalization, gating mechanisms, or attention routing.
c. Incorporating additional layers or components, such as memory or state layers, external memory access, or recurrent connections.
d. Employing depthwise or pointwise convolutions in place of, or in addition to, fully connected layers.
e. Utilizing different layer types, such as convolutional layers, recurrent layers, or capsule networks, in combination with self-attention mechanisms.
f. Introducing non-autoregressive methods for parallel decoding in the decoder portion of the architecture.
Other minor modifications: The architecture may include additional modifications, provided they do not fundamentally alter the core components of the transformer architecture. Examples include but are not limited to:
a. Changes to the activation functions, such as using variants of ReLU, sigmoid, or other nonlinear functions.
b. Alterations to the normalization techniques, such as using weight normalization, layer normalization, or group normalization.
c. Adjustments to the layer connectivity patterns, including skip connections, dense connections, or other topological changes.
d. Variations in the positional encoding methods, such as learned positional encoding, relative positional encoding, or sinusoidal encoding with modified frequencies.
e. Adaptations to the optimization algorithms, including changes to the learning rate schedules, adaptive optimizers, or regularization techniques.
If credible reports or academic publications provide strong evidence that the first robotic AGI has been developed using a transformer-like architecture meeting the criteria specified before January 1st, 2100, the question will resolve positively. If no such evidence is provided by the deadline, the question will resolve negatively.
What does "at its core" mean? There are many possible architectures where you have a transformer (or multiple transformers) making up a larger system, possibly including non-transformer pieces as well, with no meaningful "core" of the overall architecture.