xLSTM code release by NX-AI

ein0p · on June 5, 2024

Note: GNU AGPLv3. Industry labs won’t touch this with a hundred foot pole. Given that they’re the only ones with access to serious resources, it could be a while before we see a large model of this architecture

dheera · on June 5, 2024

There's a really easy way around this, and that is to offer high salaries to the authors to join the company and reimplement their work.

For researchers, in all honestly that is a very, very good reason to go GPL. If someone wants to profit off of it, it's not that they can't use the code commercially, they are just forced to hire you or pay you to dual-license it.

There's no reason why a company whose stock goes up $10B due to your model can't cut you a few million of that.

kadushka · on June 6, 2024

These researchers are being recruited regardless of what license they put on their academic code. In fact, I really doubt anyone in the industry cares about the license for this work. It's not a patent.

jerpint · on June 6, 2024

Most “serious” companies would probably rewrite it from scratch anyway

striking · on June 5, 2024

Reimplementation from paper is pretty common, though, no?

ein0p · on June 5, 2024

Yes, that’s why it’ll take time. There’s so much stuff competing for researchers’ attention, and experimentation with this takes so much time and $$$, that if it wasn’t for Sepp Hochreiter on the list of authors this could get ignored entirely. IOW it’s not the seller’s market for novel architectures right now.

htrp · on June 5, 2024

You can't outspend the industry labs given the compute inflation in transformer architectures (unless you are ridiculously well connected in the venture/sovereign funding communities).

And realistically, do we need another GPT4 evaluation paper?

ein0p · on June 5, 2024

That is by far not the only thing industry labs are working on currently. I work in one. My group might be unusual, but I can’t name a single currently active project here that is not a departure from Transformers one way or another. I expect a ton of such efficiency oriented work in the next 4-5 years. We can’t be burning money as inefficiently as we do it right now.

nurple · on June 5, 2024

How does AGPLv3 impact a lab's ability to do research on an implementation?

ein0p · on June 5, 2024

IANAL, and this is not legal advice, I don’t think it really impacts anything for academic research, but Legal usually has a major fit when AGPL is even peripherally mentioned.

mindcrime · on June 5, 2024

Interestingly, the AGPL has been something of a "boogey-man" to some commercial entities, going back 20+ years now. The GPL too, albeit to a lesser extent. Anyway, this may well be a great opportunity for a firm who bothers to look a bit deeper and say "OK, maybe the AGPL isn't something to be scared of after all". Just comply with the terms and "no harm, no foul".

stonogo · on June 5, 2024

You might be thinking of a different license, since the AGPL is not yet twenty years old.

mindcrime · on June 5, 2024

> the AGPL is not yet twenty years old.

That's only true IF you're thinking specifically of the GNU AGPL. The AGPL - more broadly construed - has a pre-GNU history that goes back further. That said, the GNU AGPL evolved out of the original AGPL. See:

https://en.wikipedia.org/wiki/GNU_Affero_General_Public_Lice...

In March 2002, Affero, Inc. published the original Affero General Public License (AGPLv1) for use with the Affero project and made the new license available for use by other software-as-a-service developers.

josephcsible · on June 6, 2024

It doesn't. They're able to; they just choose not to because of an irrational fear/hatred of the license.

htrp · on June 5, 2024

This is exciting because it is an architecture that had so much promise, but we could never solve the gradient/parallelization problems better than transformers.

This code will allow people yo experiment and see if it is a viable architecture at foundation/frontier model scale.

dang · on June 5, 2024

Recent and related:

xLSTM: Extended Long Short-Term Memory - https://news.ycombinator.com/item?id=40294650 - May 2024 (73 comments)

pietz · on June 5, 2024

Could someone provide a quick summary where they stand compared to transformer architectures? Do they have real world scale results that are competitive?

barrell · on June 5, 2024

- They outperform transformers at lower parameter counts. Time will tell if that hold up for more parameters

- They scale linearly in complexity, which means with a longer context window they will be faster and cheaper than transformers

- It's been mostly academic as far as I know, only just recently being published. I don't think there's been an opportunity to use them at 'real world scale' yet, although tbh I'm a little uncertain what you mean by it.

oersted · on June 5, 2024

It does seem to match transformers, but I wouldn't say it meaningfully outperforms them in terms of quality vs parameters.

Model: #Params (M), SlimPajama (15B) ppl ↓

- GPT-3: 356M, 14.26

- Llama: 407M, 14.25

- H3: 420M, 18.23

- Mamba: 423M, 13.70

- Hyena: 435M, 17.59

- RWKV-4: 430M, 15.62

- RWKV-5: 456M, 16.53

- RWKV-6: 442M, 17.40

- RetNet: 431M, 16.23

- HGRN: 411M, 21.83

- GLA: 412M, 19.56

- HGRN2: 411M, 16.77

- xLSTM[1:0]: 409M, 13.43

- xLSTM[7:1]: 408M, 13.48

There are more detailed perplexity and task benchmarks in the paper. Overall, all the architectures perform very similarly on every benchmark, sometimes xLSTM is slightly ahead but not always, and the difference is not really meaningful.

This is great news though, it means we are not losing anything by switching to xLSTM and we get important advantages like the scalable context window.

I'm quite excited about this because we can potentially have the LLM remember what you say and do few-shot persistent learning from user interaction (updating "itself", the state vector). It would be very interesting if LLMs were no longer static. Although I'm sure it will be a challenge to train the model to keep such learnings in its memory long-term.

The paper: https://arxiv.org/abs/2405.04517

piecerough · on June 5, 2024

> It would be very interesting if LLMs were no longer static.

Little bit of a nightmare too. Instructions keep piling up for you that you no longer openly can access and remove

3abiton · on June 5, 2024

Linear scaling for context is also a bit deal. Flash attention partially solved this for TF, but xLSTM seems promising!

wantsanagent · on June 5, 2024

Deeper dive by Yannic if you want it: https://www.youtube.com/watch?v=0OaEv1a5jUM

trextrex · on June 5, 2024

I'm not clear on what advantage this architecture has over mamba/Griffin. They also have the linear scaling, better sequence parallelism and are competitive in performance with transformers.

lalaland1125 · on June 5, 2024

The whole field seems to be having issues with comparisons right now.

We really don't even know how Mamba vs Griffin compare.

wave_1 · on June 5, 2024

state tracking...

ganzuul · on June 5, 2024

Are there any studies on predicting neural architecture scaling? E.g. a small training dataset which indicates performance on a large training dataset?

brcmthrowaway · on June 5, 2024

Congrats to the x.AI team!

tripplyons · on June 6, 2024

This release was not from xAI, it was from NXAI.