Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
xLSTM code release by NX-AI (github.com/nx-ai)
123 points by badlogic on June 5, 2024 | hide | past | favorite | 28 comments


Note: GNU AGPLv3. Industry labs won’t touch this with a hundred foot pole. Given that they’re the only ones with access to serious resources, it could be a while before we see a large model of this architecture


There's a really easy way around this, and that is to offer high salaries to the authors to join the company and reimplement their work.

For researchers, in all honestly that is a very, very good reason to go GPL. If someone wants to profit off of it, it's not that they can't use the code commercially, they are just forced to hire you or pay you to dual-license it.

There's no reason why a company whose stock goes up $10B due to your model can't cut you a few million of that.


These researchers are being recruited regardless of what license they put on their academic code. In fact, I really doubt anyone in the industry cares about the license for this work. It's not a patent.


Most “serious” companies would probably rewrite it from scratch anyway


Reimplementation from paper is pretty common, though, no?


Yes, that’s why it’ll take time. There’s so much stuff competing for researchers’ attention, and experimentation with this takes so much time and $$$, that if it wasn’t for Sepp Hochreiter on the list of authors this could get ignored entirely. IOW it’s not the seller’s market for novel architectures right now.


You can't outspend the industry labs given the compute inflation in transformer architectures (unless you are ridiculously well connected in the venture/sovereign funding communities).

And realistically, do we need another GPT4 evaluation paper?


That is by far not the only thing industry labs are working on currently. I work in one. My group might be unusual, but I can’t name a single currently active project here that is not a departure from Transformers one way or another. I expect a ton of such efficiency oriented work in the next 4-5 years. We can’t be burning money as inefficiently as we do it right now.


How does AGPLv3 impact a lab's ability to do research on an implementation?


IANAL, and this is not legal advice, I don’t think it really impacts anything for academic research, but Legal usually has a major fit when AGPL is even peripherally mentioned.


Interestingly, the AGPL has been something of a "boogey-man" to some commercial entities, going back 20+ years now. The GPL too, albeit to a lesser extent. Anyway, this may well be a great opportunity for a firm who bothers to look a bit deeper and say "OK, maybe the AGPL isn't something to be scared of after all". Just comply with the terms and "no harm, no foul".


You might be thinking of a different license, since the AGPL is not yet twenty years old.


> the AGPL is not yet twenty years old.

That's only true IF you're thinking specifically of the GNU AGPL. The AGPL - more broadly construed - has a pre-GNU history that goes back further. That said, the GNU AGPL evolved out of the original AGPL. See:

https://en.wikipedia.org/wiki/GNU_Affero_General_Public_Lice...

In March 2002, Affero, Inc. published the original Affero General Public License (AGPLv1) for use with the Affero project and made the new license available for use by other software-as-a-service developers.


It doesn't. They're able to; they just choose not to because of an irrational fear/hatred of the license.


This is exciting because it is an architecture that had so much promise, but we could never solve the gradient/parallelization problems better than transformers.

This code will allow people yo experiment and see if it is a viable architecture at foundation/frontier model scale.


Recent and related:

xLSTM: Extended Long Short-Term Memory - https://news.ycombinator.com/item?id=40294650 - May 2024 (73 comments)


Could someone provide a quick summary where they stand compared to transformer architectures? Do they have real world scale results that are competitive?


- They outperform transformers at lower parameter counts. Time will tell if that hold up for more parameters

- They scale linearly in complexity, which means with a longer context window they will be faster and cheaper than transformers

- It's been mostly academic as far as I know, only just recently being published. I don't think there's been an opportunity to use them at 'real world scale' yet, although tbh I'm a little uncertain what you mean by it.


It does seem to match transformers, but I wouldn't say it meaningfully outperforms them in terms of quality vs parameters.

Model: #Params (M), SlimPajama (15B) ppl ↓

- GPT-3: 356M, 14.26

- Llama: 407M, 14.25

- H3: 420M, 18.23

- Mamba: 423M, 13.70

- Hyena: 435M, 17.59

- RWKV-4: 430M, 15.62

- RWKV-5: 456M, 16.53

- RWKV-6: 442M, 17.40

- RetNet: 431M, 16.23

- HGRN: 411M, 21.83

- GLA: 412M, 19.56

- HGRN2: 411M, 16.77

- xLSTM[1:0]: 409M, 13.43

- xLSTM[7:1]: 408M, 13.48

There are more detailed perplexity and task benchmarks in the paper. Overall, all the architectures perform very similarly on every benchmark, sometimes xLSTM is slightly ahead but not always, and the difference is not really meaningful.

This is great news though, it means we are not losing anything by switching to xLSTM and we get important advantages like the scalable context window.

I'm quite excited about this because we can potentially have the LLM remember what you say and do few-shot persistent learning from user interaction (updating "itself", the state vector). It would be very interesting if LLMs were no longer static. Although I'm sure it will be a challenge to train the model to keep such learnings in its memory long-term.

The paper: https://arxiv.org/abs/2405.04517


> It would be very interesting if LLMs were no longer static.

Little bit of a nightmare too. Instructions keep piling up for you that you no longer openly can access and remove


Linear scaling for context is also a bit deal. Flash attention partially solved this for TF, but xLSTM seems promising!


Deeper dive by Yannic if you want it: https://www.youtube.com/watch?v=0OaEv1a5jUM


I'm not clear on what advantage this architecture has over mamba/Griffin. They also have the linear scaling, better sequence parallelism and are competitive in performance with transformers.


The whole field seems to be having issues with comparisons right now.

We really don't even know how Mamba vs Griffin compare.


state tracking...


Are there any studies on predicting neural architecture scaling? E.g. a small training dataset which indicates performance on a large training dataset?


Congrats to the x.AI team!


This release was not from xAI, it was from NXAI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: