Note: GNU AGPLv3. Industry labs won’t touch this with a hundred foot pole. Given that they’re the only ones with access to serious resources, it could be a while before we see a large model of this architecture
There's a really easy way around this, and that is to offer high salaries to the authors to join the company and reimplement their work.
For researchers, in all honestly that is a very, very good reason to go GPL. If someone wants to profit off of it, it's not that they can't use the code commercially, they are just forced to hire you or pay you to dual-license it.
There's no reason why a company whose stock goes up $10B due to your model can't cut you a few million of that.
These researchers are being recruited regardless of what license they put on their academic code. In fact, I really doubt anyone in the industry cares about the license for this work. It's not a patent.
Yes, that’s why it’ll take time. There’s so much stuff competing for researchers’ attention, and experimentation with this takes so much time and $$$, that if it wasn’t for Sepp Hochreiter on the list of authors this could get ignored entirely. IOW it’s not the seller’s market for novel architectures right now.
You can't outspend the industry labs given the compute inflation in transformer architectures (unless you are ridiculously well connected in the venture/sovereign funding communities).
And realistically, do we need another GPT4 evaluation paper?
That is by far not the only thing industry labs are working on currently. I work in one. My group might be unusual, but I can’t name a single currently active project here that is not a departure from Transformers one way or another. I expect a ton of such efficiency oriented work in the next 4-5 years. We can’t be burning money as inefficiently as we do it right now.
IANAL, and this is not legal advice, I don’t think it really impacts anything for academic research, but Legal usually has a major fit when AGPL is even peripherally mentioned.
Interestingly, the AGPL has been something of a "boogey-man" to some commercial entities, going back 20+ years now. The GPL too, albeit to a lesser extent. Anyway, this may well be a great opportunity for a firm who bothers to look a bit deeper and say "OK, maybe the AGPL isn't something to be scared of after all". Just comply with the terms and "no harm, no foul".
That's only true IF you're thinking specifically of the GNU AGPL. The AGPL - more broadly construed - has a pre-GNU history that goes back further. That said, the GNU AGPL evolved out of the original AGPL. See:
In March 2002, Affero, Inc. published the original Affero General Public License (AGPLv1) for use with the Affero project and made the new license available for use by other software-as-a-service developers.
This is exciting because it is an architecture that had so much promise, but we could never solve the gradient/parallelization problems better than transformers.
This code will allow people yo experiment and see if it is a viable architecture at foundation/frontier model scale.
Could someone provide a quick summary where they stand compared to transformer architectures? Do they have real world scale results that are competitive?
- They outperform transformers at lower parameter counts. Time will tell if that hold up for more parameters
- They scale linearly in complexity, which means with a longer context window they will be faster and cheaper than transformers
- It's been mostly academic as far as I know, only just recently being published. I don't think there's been an opportunity to use them at 'real world scale' yet, although tbh I'm a little uncertain what you mean by it.
It does seem to match transformers, but I wouldn't say it meaningfully outperforms them in terms of quality vs parameters.
Model: #Params (M), SlimPajama (15B) ppl ↓
- GPT-3: 356M, 14.26
- Llama: 407M, 14.25
- H3: 420M, 18.23
- Mamba: 423M, 13.70
- Hyena: 435M, 17.59
- RWKV-4: 430M, 15.62
- RWKV-5: 456M, 16.53
- RWKV-6: 442M, 17.40
- RetNet: 431M, 16.23
- HGRN: 411M, 21.83
- GLA: 412M, 19.56
- HGRN2: 411M, 16.77
- xLSTM[1:0]: 409M, 13.43
- xLSTM[7:1]: 408M, 13.48
There are more detailed perplexity and task benchmarks in the paper. Overall, all the architectures perform very similarly on every benchmark, sometimes xLSTM is slightly ahead but not always, and the difference is not really meaningful.
This is great news though, it means we are not losing anything by switching to xLSTM and we get important advantages like the scalable context window.
I'm quite excited about this because we can potentially have the LLM remember what you say and do few-shot persistent learning from user interaction (updating "itself", the state vector). It would be very interesting if LLMs were no longer static. Although I'm sure it will be a challenge to train the model to keep such learnings in its memory long-term.
I'm not clear on what advantage this architecture has over mamba/Griffin. They also have the linear scaling, better sequence parallelism and are competitive in performance with transformers.
Are there any studies on predicting neural architecture scaling? E.g. a small training dataset which indicates performance on a large training dataset?