[RetNet paper](https://arxiv.org/abs/2307.08621). They claim some pretty cool
results in the small model range (up to 6.7B parameters). Will anyone attempt to generalize that to a large model?
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ100 | |
2 | Ṁ43 | |
3 | Ṁ37 | |
4 | Ṁ14 | |
5 | Ṁ9 |
From @kipply here:
An attempt at a new architecture, but immediately opens with a plot showing they couldn’t scale their baseline transformer properly, an inspirational quote and an impossible triangle that is used as a diagram?
This makes me think that it’s not that promising.
There’s also this theory that someone floated on Twitter that it’s suspicious that they stopped scaling at 7B which happens to be when outlier features appear, I’m confused about why that’s relevant.
I would love me some Pytorch code that runs that ha ha. GCP can offer you compute to investigate for free, if you apply.
But it is a waste of money and CO2 emissions to train it unless you have really good data and "LLMOps" (I shouldn't say that lol as I am betting NO on that market)
@1a3orn To follow up on this: I'm happy to wait a month or two for additional info to come out, but won't wait longer than that.