Add RWKV2 (fast)

Question

Add RWKV2 (fast)

leondz opened this issue 2 years ago · 72 comments

Model description

I would like to implement a new model architecture.

Short description

RWKV v2 is an "RNN with transformer-level performance, without using attention. Similar to Apple's Attention Free Transformer. All trained models open-source. Inference is very fast (even on CPUs) and might work on cell phones. There's also a GPT-type implementation." -- (Hochreiter's description)

RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. RWKV can leverage GPUs, but doesn't need to.

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

Implementation and weights

There's an implementation at BlinkDL/RWKV-LM which also gives a detailed description of the model internals and some performance benchmarks. Model weights currently are being trained for a few datasets, including the Pile (see e.g. BlinkDL/RWKV-v2-RNN-Pile) and Danish Gigaword by me. Both will be openly available - some checkpoints for the Pile already are, even though it's an ongoing process.

Status

The model seems quite exciting and I'm able to replicate preliminary results. I'm already talking with @BlinkDL about the implementation. I'm happy to implement/port the model architecture (for both RNN and GPT variants), tokenizer, and tests myself (and have already started) and would appreciate help and advice.

Answer 1 · 2022-05-16T06:49:52.000Z

-- on second thoughts: it's not immediately clear to me how many people will use this particular model, or how it will perform. What I'd really like to do is implement and develop it on Hub, and see if it's useful/popular there. I spent an amount of time with the docs, and the route to adding new model architectures seems to preferentially support adding directly to transformers. Tooling for new model architectures that worked on Hub (e.g. cookiecutter, class organisation, and tests) would be super neat. Is that something there's any interest in?

Answer 2 · 2022-05-16T12:11:11.000Z

-- on second thoughts: it's not immediately clear to me how many people will use this particular model, or how it will perform.

To answer your question: If it performs better than the other CausalLM models out there, it will most likely get used. Make a PR, build an initial version that can be run on HF, and see if any of the HF devs are willing to chime in. I am interested in this work, particularly because it solves a problem I haven't seen before: Be able to run CasualLM models on CPU. And my work stretches beyond the KoboldAI team, I know there are more out there that seem to benefit from the usage of CPU models because of the high prices that GPU models currently have.

Answer 3 · 2022-05-20T19:41:23.000Z

Work is going OK. We're porting the GPT-like part to Transformers first, for training and induction, and will work out the fast RNN induction-only part after the GPT part passes tests.

Answer 4 · 2022-06-26T12:33:40.000Z

Where is your work at? I have worked on this model and would like to contribute. I'm also experienced now at troubleshooting the parts of this model (mostly inference accuracy though), and have spent time understanding the cuda kernels. I have some experience with adjusting new codebases to unexpected featureset combinations.

Answer 5 · 2022-09-09T01:08:37.000Z

I'm also curious how this one is coming along. (I just saw the original paper today. Not sure how I missed it...)

Answer 6 · 2022-10-04T17:33:15.000Z

@leondz are you guys still working on this? I am looking to get into this if this can work on edge devices

Answer 7 · 2022-10-05T05:24:17.000Z

Some time ago I looked a little into continuing this, but other things came up.
After that experience, I would recommend that future implementers start a new fork, rather than working off the existing one, because very little has been done, so it can take extra effort to learn the existing situation without much return.
For the record:
leondz's branch is at https://github.com/leondz/transformers/tree/rwkv-v2 .
I added smidges to it at https://github.com/xloem/transformers/tree/rwkv-v2 and https://github.com/xloem/transformers/tree/rwkv-v2-disable_non_clm_for_now .

Since that work, RWKV is on version 4 now (although the changes between versions are not generally complex): https://github.com/BlinkDL/RWKV-LM

Answer 8 · 2022-11-21T09:17:27.000Z

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

Answer 9 · 2022-11-21T09:35:42.000Z

You could ask the same about any model or technology near the top of a leaderboard. Things happen because people do the work or make the business decisions behind them happening. There are scads and scads of things better than the original transformer paper, but they're not normative yet.

Answer 10 · 2022-11-22T13:25:06.000Z

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

This is better but GPT is good enough for most applications.
I will just keep training larger models. RWKV 14B release soon.

Answer 11 · 2022-11-22T17:12:39.000Z

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

It's not presented well and clearly, I am working on a fork or huggingface integration that answers questions, this is pretty much a breakthrough model imo, I am just making sure the runtimes are true. It still in R and D phase adoption phase comes soon after

Answer 12 · 2022-11-22T17:43:13.000Z

I spent about a month working on this but the code wasn't stable and wasn't version controlled in the normal way, which made refactoring really tricky. Then time ran out. I think if the engineering side of things is fixed, and there's a stable release, it's a great model - definitely more data-efficient than competitors, which is really the core factor now.

Answer 13 · 2022-11-30T17:55:29.000Z

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

For our own project we have kind of basic support for it workarounded in with the original base, but the reason we don't finetune it or don't support it properly is because Huggingface support is missing and we are tightly integrated with huggingface. I assume other providers / projects have the same issue. For adoption I'd love to see RWKV land in huggingface so we can begin to offer it to our users the proper way, without them relying on manual steps, and without missing features for this model.

Answer 14 · 2022-11-30T18:27:37.000Z

Yeah but why doesn't OpenAI literally just spend one month on this with 10 guys and use this? It think this has some drawback but no one can tell me what it is... It's feel reasonable that all new papers from Google, OpenAI should use this. Den ons 30 nov. 2022 18:55henk717 ***@***.***> skrev:

…

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc? For our own project we have kind of basic support for it workarounded in with the original base, but the reason we don't finetune it or don't support it properly is because Huggingface support is missing and we are tightly integrated with huggingface. I assume other providers / projects have the same issue. For adoption I'd love to see RWKV land in huggingface so we can begin to offer it to our users the proper way, without them relying on manual steps, and without missing features for this model. — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTWSJQDOOINSE5GVFUDWK6IJZANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 15 · 2022-11-30T20:45:58.000Z

Yeah but why doesn't OpenAI literally just spend one month on this with 10 guys and use this? It think this has some drawback but no one can tell me what it is... It's feel reasonable that all new papers from Google, OpenAI should use this.

There are a number of papers with similar "exponential moving average" design now.

For example, S4D is using slightly fancier kernels: https://github.com/HazyResearch/state-spaces (while I find simple kernels are enough).

RWKV is weaker at LAMBADA (comparing with GPT) when the model is small (< 3B), but I find adding one single tiny QKV attention is enough to solve it (helps a small model to copy words in prompt).

Moreover, it's reasonable to expect a competitive linear-time attention model, because when human novelists write very long stories the speed is consistent (except GRRM lol).

Answer 16 · 2022-12-01T16:01:01.000Z

I don't think this project is well known, theres a huge eco system based of just what works right now i.e T5 and GPTx. For example percievers io, and percievers AR by deepmind seems to do something similar to get linear attention. To get this project to that level of popularity we have to build various production level proofs, most people already understand the challenges of T5 and GPTx series. Second the models from a product perspective isn't as important, it's the data that is important. People are making the bets that its smarter to deploy a product with shitty AI and wait for the improvement before investing in the R and D. They build the product and make it easy to replace the AI portion of it in 10 minutes. These factors make it difficult to get projects and indepdent researchers to get the spotlight they need.

Answer 17 · 2022-12-01T16:45:14.000Z

I understand. But this is the only architecture that has infinite context length. Den tors 1 dec. 2022 17:01Michael Chung ***@***.***> skrev:

…

I don't think this project is well known, theres a huge eco system based of just what works right now i.e T5 and GPT*x. For example percievers, and percievers AR by deepmind seems to do something similar to get linear attention. To get this project to that level of popularity we have to build various production level proofs, most people already understand the challenges of T5 and GPT*x series. Second the models from a product perspective isn't as important, it's the data that is important. People are making the bets that its smarter to deploy a product with shitty AI and wait for the improvement before investing in the R and D. These factors make it difficult to get projects and indepdent researchers to get the spotlight they need. — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTTICKMR7YJCRZTPKO3WLDDUTANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 18 · 2022-12-01T17:18:44.000Z

"...this is the only architecture that has infinite context length."

Wait, really?... How did I miss that? I thought it was just a faster, more efficient approach.

Answer 19 · 2022-12-01T17:24:26.000Z

"So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding."

https://www.reddit.com/r/MachineLearning/comments/umq908/_/

Den tors 1 dec. 2022 18:18jbm ***@***.***> skrev:

…

"...this is the only architecture that has infinite context length." Wait, really?... How did I miss that? I thought it was just a faster, more efficient approach? — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTTLSQHYBSKLYA5BRX3WLDMYBANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 20 · 2022-12-01T17:27:49.000Z

The context length is presently limited by the accuracy of the floating point representation, due to the heavily simplified and unified architecture. RWKV is a strong combination of speed and long-context.

Answer 21 · 2022-12-01T17:27:52.000Z

Right, okay. Well, that's pretty compelling, for sure...

Answer 22 · 2022-12-01T22:25:01.000Z

The context length is presently limited by the accuracy of the floating point representation, due to the heavily simplified and unified architecture. RWKV is a strong combination of speed and long-context.

I think its also limited by the memory as well

Answer 23 · 2022-12-02T09:55:28.000Z

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

Answer 24 · 2022-12-02T16:01:20.000Z

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

So you are telling me, that the context is effectively encoded into the state. I am reffering to the context length of the model consumes. I guess what you are trying to say is that because we have a state, the model can look into that state for any context size? as a result it has an infinite context length? I looked into the code and it says

  T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!]

so it appears to have a limit based off memory @BlinkDL can you clearify ?

Answer 25 · 2022-12-02T16:18:35.000Z

I should let Blink clarify, but regarding T_MAX: https://github.com/BlinkDL/RWKV-LM/blob/a268cd2e40351ee31c30c5f8a5d1266d35b41829/RWKV-v4neo/src/model.py#L34

Answer 26 · 2022-12-03T02:25:20.000Z

Since the model support for this stalled, perhaps someone on HF's side such as @younesbelkada can help get this model supported?

Answer 27 · 2022-12-03T05:25:49.000Z

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

So you are telling me, that the context is effectively encoded into the state. I am reffering to the context length of the model consumes. I guess what you are trying to say is that because we have a state, the model can look into that state for any context size? as a result it has an infinite context length? I looked into the code and it says
  T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!]
so it appears to have a limit based off memory @BlinkDL can you clearify ?

I am not using the correct method to train it because I am lazy. But you can always finetune the model to support longer ctxlen. For example, fine-tuned to 4096 here:

https://huggingface.co/BlinkDL/rwkv-4-pile-3b

With the correct training method, I estimate the effective ctx_len can at least be 100K.

Answer 28 · 2022-12-03T08:51:32.000Z

So it doesn't have "infinite" ctx_len. Den lör 3 dec. 2022 06:26PENG Bo ***@***.***> skrev:

…

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation. So you are telling me, that the context is effectively encoded into the state. I am reffering to the context length of the model consumes. I guess what you are trying to say is that because we have a state, the model can look into that state for any context size? as a result it has an infinite context length? I looked into the code and it says T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!] so it appears to have a limit based off memory @BlinkDL <https://github.com/BlinkDL> can you clearify ? I am not using the correct method to train it because I am lazy. But you can always finetune the model to support longer ctxlen. For example, fine-tuned to 4096 here: https://huggingface.co/BlinkDL/rwkv-4-pile-3b — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTTHM2NCFZJFFG4JF63WLLKWTANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 29 · 2022-12-03T09:58:43.000Z

I suspect technically if you used a rational number representation rather than floating point it would have infinite context length.

Aside: I’m not an ML researcher, but I don’t know why downscaling like this doesn’t get more attention. It seems context length could be fully infinite by re-encoding past information for what is helpful for future states, and a network wired to discover its own architecture would quickly find this.

Answer 30 · 2022-12-03T15:05:24.000Z

So it doesn't have "infinite" ctx_len. Den lör 3 dec. 2022 06:26PENG Bo @.***> skrev:

RNN has infinite ctx_len if you use correct training & inference method.

I am just being lazy because when the model is small it can't even generate perfect result for 1024 ctxlen.

So I will improve it only after the 50B params model.

Answer 31 · 2022-12-03T15:09:18.000Z

I suspect technically if you used a rational number representation rather than floating point it would have infinite context length.

Correct. And you can use FP64 to make it practically infinite.

Answer 32 · 2022-12-03T15:35:28.000Z

So then I ask again. Why hasn't this architecture shown wider adoption? Den lör 3 dec. 2022 16:09PENG Bo ***@***.***> skrev:

…

I suspect technically if you used a rational number representation rather than floating point it would have infinite context length. Correct. And you can use FP64 to make it practically infinite. — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTX44AE7G56WBMEPFE3WLNPCVANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 33 · 2022-12-03T16:06:06.000Z

So then I ask again. Why hasn't this architecture shown wider adoption?

Why not try the model first lol. I believe 99.9+% researchers haven't even tried it.

Some results, and user feedback:

Answer 34 · 2022-12-03T16:08:47.000Z

So then I ask again. Why hasn't this architecture shown wider adoption? Den lör 3 dec. 2022 16:09PENG Bo @.> skrev:
…
I suspect technically if you used a rational number representation rather than floating point it would have infinite context length. Correct. And you can use FP64 to make it practically infinite. — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHYLDTX44AE7G56WBMEPFE3WLNPCVANCNFSM5V275BWA . You are receiving this because you commented.Message ID: @.>

go and try and use the model and you might see why or not depending on who you are

Answer 35 · 2022-12-03T16:30:29.000Z

So then I ask again. Why hasn't this architecture shown wider adoption?

Really simple: it's not on HF. I am waiting for this being implemented in HF, so I can use my trainer on it.

Answer 36 · 2022-12-04T16:22:34.000Z

Hey guys I am working on this https://github.com/ArEnSc/Production-RWKV, it uses a hugging face like interface and setup RWKV quickly, right now only supports greedy decoding, it's very early days I have not published the package just yet, I am gonna support a few more samplers later, and it only loads the 1.5 B model right now because I need to write some configs for json. The project aims to stay close to the research while providing an avenue for production. We will get some optimizations in and as well a tables showing the results.

Answer 37 · 2022-12-05T09:09:16.000Z

Hi everyone!
Happy too see that the community is very excited about this addition, and thanks @ArEnSc for the great repo which will make the integration process definitely smoother.
Would someone mind opening a Pull Request to add this model (even if it's still on draft), and we'll be more than happy to help you with @ArthurZucker on the conversion process - seems that all the building blocks (architecture, tokenizer + model weights) are here so the conversion should be quite easy.

Answer 38 · 2022-12-05T09:51:23.000Z

Hi everyone! Happy too see that the community is very excited about this addition, and thanks @ArEnSc for the great repo which will make the integration process definitely smoother. Would someone mind opening a Pull Request to add this model (even if it's still on draft), and we'll be more than happy to help you with @ArthurZucker on the conversion process - seems that all the building blocks (architecture, tokenizer + model weights) are here so the conversion should be quite easy.

Great :) For optimal inference, we need to support both the RNN mode and the GPT mode of RWKV.

The idea is, we can use the GPT mode to process a very long prompt and generate the correct hidden state for the RNN mode, such that the RNN mode can efficiently continue from it.

Answer 39 · 2022-12-05T09:57:54.000Z

I would personally propose a hybrid mode that can do GPT-style extended contexts in an RNN way. This provides for training on very long contextual data if float64 is used, by processing the parts in sequence.

Answer 40 · 2022-12-05T10:22:33.000Z

I would personally propose a hybrid mode that can do GPT-style extended contexts in an RNN way. This provides for training on very long contextual data if float64 is used, by processing the parts in sequence.

Yeah I will begin with "x% probability to extend the last sample" when training :)

So:

1-x prob. of chunkLen
(1-x)^2 prob. of chunkLen*2
(1-x)^3 prob. of chunkLen*3
...

Answer 41 · 2022-12-05T12:27:28.000Z

It seems that @leondz has already a working branch and @ArEnSc has the code refactored to make it easy to use.
@ArEnSc can you maybe open a PR and tag @leondz and add him as a co-author (together maybe with all co-authors involved in the integration) so that we can move the discussion there? This plan seems to be the most efficient path towards faster integration except if I am missing something here (I did not followed the integration issues faced by @leondz ). Let me know if you need any help here

Answer 42 · 2022-12-05T14:51:52.000Z

It looked to me like the leondz work did not get past stubbing.

I ran into a similar issue trying to integrate; I'm simply not familiar with all the repository's in-depth norms for naming and testing.

Answer 43 · 2022-12-05T15:53:45.000Z

Feel free to still open a draft PR and I can review the early stage to give you pointers!
The transformers-cli add-model-like is very good if a model similar exist but I am not sure that is our case 😅 So just push anything and we'll help on the missing files, naming etc! 🤗

Answer 44 · 2022-12-05T16:05:43.000Z

@younesbelkada @ArthurZucker ok ill make a new PR sometime today.
I am going to pivot my repo to be a light weight version with little dependencies featuring optimizations from the community sorta like a FastT5 implementation variant, considering we have hugging face dev support for this. The only issues I had prior doing a transformers integration was running tests.

Answer 45 · 2022-12-05T19:31:13.000Z

Super cool! We'll be more than happy to help you regarding tests 💪

Answer 46 · 2022-12-05T19:43:27.000Z

Hey,

This integration went fine, until two snags wre hit:

the code for reading input couldn't be reproduced
the code for training couldn't be reproduced

I would love to see these stable & independent in their own branch. There was no hope of getting RWKV2 to pass the HF model implementation requirements (esp. the model weights precisely matching!) without these being established, but maybe things are better now.

Re: uptake - this model kicks ass, imo the restrictions have only been the difficulty of re-using/reproducing the codebase while it was under development, and that the paper hadn't been written. The math all checks out (I even wrote some tutorial slides for teaching the model) and the implementations have been elegant, it's just engineering issues in the way. Once a reproducible training codebase & paper are out, it's 🚀 time!

-- also would be super cool to have integrated the fast RNN inference if that's still working, but again the implementation and interface was fluid last time I tried to integrate this, and you can't integrate a moving implementation.

Answer 47 · 2022-12-05T19:46:23.000Z

Wow very cool @leondz !
Would be also very keen to have a look at the tutorial you made, we can also ultimately have them on the HF blog to announce the release of this architecture (ofc once we figure out everything about the integration & happy to help you on the post too), how does that sound?

Answer 48 · 2022-12-05T19:51:32.000Z

It's absolutely BlinkDL's project, so up to them and they get the headline credit, but that sounds lovely - I'm down :)

Answer 49 · 2022-12-05T20:19:17.000Z

It's absolutely BlinkDL's project, so up to them and they get the headline credit, but that sounds lovely - I'm down :)

Can you share your slides? :)

Consider this a community project, and we can build an ecosystem on top of RWKV, like what happens to Stable Diffusion.

I will focus on improving the algorithm & model - now training RWKV-4a with one single tiny extra attention (just a few extra lines comparing with RWKV-4) to further improve some difficult zeroshot tasks (such as LAMBADA) for smaller models.

Answer 50 · 2022-12-05T20:20:08.000Z

Hey,

This integration went fine, until two snags wre hit:

the code for reading input couldn't be reproduced

the code for training couldn't be reproduced

I would love to see these stable & independent in their own branch. There was no hope of getting RWKV2 to pass the HF model implementation requirements (esp. the model weights precisely matching!) without these being established, but maybe things are better now.

Re: uptake - this model kicks ass, imo the restrictions have only been the difficulty of re-using/reproducing the codebase while it was under development, and that the paper hadn't been written. The math all checks out (I even wrote some tutorial slides for teaching the model) and the implementations have been elegant, it's just engineering issues in the way. Once a reproducible training codebase & paper are out, it's 🚀 time!

-- also would be super cool to have integrated the fast RNN inference if that's still working, but again the implementation and interface was fluid last time I tried to integrate this, and you can't integrate a moving implementation.

Can I also get the slides perhaps a google docs link for them would be the quickest there are a few parts of this architecture that are still fuzzy to me

Answer 51 · 2022-12-05T20:52:14.000Z

the code for reading input couldn't be reproduced

the code for training couldn't be reproduced

I wasn’t aware. It’s too bad we didn’t take these things farther; I was having the opposite issue. @ArEnSc , please let us know if there are any snags preventing opening a PR so somebody else can step in too.

Answer 52 · 2022-12-05T20:55:12.000Z

the code for reading input couldn't be reproduced

the code for training couldn't be reproduced

I wasn’t aware. It’s too bad we didn’t take these things farther; I was having the opposite issue. @ArEnSc , please let us know if there are any snags preventing opening a PR so somebody else can step in too.

It's important to say that this was due to the pace and mode of development, not the model's quality!

Answer 53 · 2022-12-05T21:31:04.000Z

Might not be fully helpful, but I have a repository with a bunch of different variations on inference

https://github.com/harrisonvanderbyl/rwkv_chatbot/blob/main/src/model_run_onnx.py for example is a file where I have made the code compatible with onnx, tensorflow, and Iree inference converters (with only some minor tweaking)

Answer 54 · 2022-12-07T17:18:28.000Z

@ArthurZucker
Hey I am getting issues setting up the dev environment.
I am on python 3.8.10, updated to the latest pip3. I create a venv using 3.8.10 and then run this command
I am on OSX Monterey, M1 Pro.
Which version of python should I be developing on ?

 pip3 install -e ".[dev]"
ERROR: Could not find a version that satisfies the requirement tensorflow-text; extra == "dev" (from transformers[dev]) (from versions: none)
ERROR: No matching distribution found for tensorflow-text; extra == "dev

Answer 55 · 2022-12-08T10:55:26.000Z

Hi @ArEnSc
Indeed it's a bit tricky to install dev environment on a MAC M1.
Could you please replace your setup.py by this one: https://gist.github.com/younesbelkada/ce24f0b517db46502792c4b638d4f5b9 and run your command again

After that, you need to run pip3 install numpy --upgrade and everything should work fine

Answer 56 · 2022-12-10T15:53:54.000Z

@younesbelkada

(.env) michaelchung@michaels-mbp transformers % pip install -e ".[dev]"

Obtaining file:///Users/michaelchung/Code/transformers
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Collecting packaging>=20.0
  Using cached packaging-22.0-py3-none-any.whl (42 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.2.tar.gz (359 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting requests
  Using cached requests-2.28.1-py3-none-any.whl (62 kB)
Collecting numpy>=1.17
  Using cached numpy-1.23.5-cp38-cp38-macosx_11_0_arm64.whl (13.3 MB)
Collecting tqdm>=4.27
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting regex!=2019.12.17
  Using cached regex-2022.10.31-cp38-cp38-macosx_11_0_arm64.whl (287 kB)
Collecting filelock
  Using cached filelock-3.8.2-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Using cached huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0-cp38-cp38-macosx_12_0_arm64.whl
Collecting pytest-xdist
  Using cached pytest_xdist-3.1.0-py3-none-any.whl (36 kB)
Collecting rjieba
  Using cached rjieba-0.1.11-cp36-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (5.7 MB)
Collecting unidic>=1.0.2
  Using cached unidic-1.1.0.tar.gz (7.7 kB)
  Preparing metadata (setup.py) ... done
Collecting phonemizer
  Using cached phonemizer-3.2.1-py3-none-any.whl (90 kB)
Collecting jaxlib<=0.3.6,>=0.1.65
  Using cached jaxlib-0.3.5-cp38-none-macosx_11_0_arm64.whl (61.3 MB)
Collecting codecarbon==1.2.0
  Using cached codecarbon-1.2.0-py3-none-any.whl (135 kB)
Collecting pyctcdecode>=0.4.0
  Using cached pyctcdecode-0.4.0-py2.py3-none-any.whl (45 kB)
Collecting flake8>=3.8.3
  Using cached flake8-6.0.0-py2.py3-none-any.whl (57 kB)
Collecting sacremoses
  Using cached sacremoses-0.0.53.tar.gz (880 kB)
  Preparing metadata (setup.py) ... done
Collecting tensorflow-metal
  Using cached tensorflow_metal-0.7.0-cp38-cp38-macosx_12_0_arm64.whl (1.4 MB)
Collecting GitPython<3.1.19
  Using cached GitPython-3.1.18-py3-none-any.whl (170 kB)
Collecting datasets!=2.5.0
  Using cached datasets-2.7.1-py3-none-any.whl (451 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.2.0-cp38-cp38-macosx_12_0_arm64.whl (8.2 MB)
Collecting sudachidict-core>=20220729
  Using cached SudachiDict-core-20221021.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... done
Collecting sacrebleu<2.0.0,>=1.4.12
  Using cached sacrebleu-1.5.1-py3-none-any.whl (54 kB)
Collecting Pillow
  Using cached Pillow-9.3.0-cp38-cp38-macosx_11_0_arm64.whl (2.9 MB)
Collecting tf2onnx
  Using cached tf2onnx-1.13.0-py3-none-any.whl (442 kB)
Collecting sentencepiece!=0.1.92,>=0.1.91
  Using cached sentencepiece-0.1.97-cp38-cp38-macosx_11_0_arm64.whl (1.1 MB)
Collecting evaluate>=0.2.0
  Using cached evaluate-0.3.0-py3-none-any.whl (72 kB)
Collecting fugashi>=1.0
  Using cached fugashi-1.2.1.tar.gz (337 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [8 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/private/var/folders/jn/8d33s3c55jv5pctdc6wdnm2h0000gn/T/pip-install-xf18599w/fugashi_18a210c9f68f4c1fb6ece4f85f9f7479/setup.py", line 15, in <module>
          output, data_files = check_libmecab()
        File "/private/var/folders/jn/8d33s3c55jv5pctdc6wdnm2h0000gn/T/pip-install-xf18599w/fugashi_18a210c9f68f4c1fb6ece4f85f9f7479/fugashi_util.py", line 58, in check_libmecab
          raise RuntimeError("Could not configure working env. Have you installed MeCab?")
      RuntimeError: Could not configure working env. Have you installed MeCab?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(.env) michaelchung@michaels-mbp transformers %

closer! but still problems

Answer 57 · 2022-12-10T22:20:39.000Z

I think here you need to install Mecab through brew - can you try to run:

brew install mecab
brew install mecab-ipadic

and re-run pip install -e "[dev]" again?

Answer 58 · 2022-12-12T08:38:12.000Z

I had the same issue when installing, you should make sure to install fugashi==1.1.2a6 ( ignore the mecab part).
You can also follow the short guide from #18355

Answer 59 · 2022-12-12T11:15:24.000Z

Is a full dev environment needed to start with? Personally it would be quite inspiring to see a PR even if it didn't pass tests.

Answer 60 · 2022-12-12T11:19:58.000Z

@ArEnSc did you managed to open a PR? I think it's ok to leave it as a draft even if the test does not even pass (i.e. eventually no need to install the dev env, at least for the beginning, we can in the worst case take over the PR if any). Let us know what do you think!

Answer 61 · 2022-12-12T15:56:16.000Z

Yeah hey sorry guys! probably sometime this week or today, my day job is iOS Development, it isn't in MLE. I just a moon light side job in NLP and Speech Synthesis in the media creation domain. Looking to transition eventually, hopefully this PR will proof of my capabilities so I won't abandon it =)

Answer 62 · 2022-12-12T17:10:28.000Z

#20737 here is the draft, probably generating all the scaffolding soon

Answer 63 · 2023-01-05T11:00:28.000Z

There is recent active work for interfacing multiple backends to rwkv at https://github.com/harrisonvanderbyl/rwkv_chatbot/blob/main/src/rwkvops.py#L914 (list down at end of file)
EDIT: dev discussion happens in the rwkv discord, where unfortunately I am not active

Answer 64 · 2023-01-05T14:35:15.000Z

yeah we will be looking into that as soon as I figure out how the architecture works from a high level I might have some questions but Iam tracing the model now

Answer 65 · 2023-03-05T13:43:58.000Z

I have made a very simple and dumb wrapper for RWKV including RWKVModel.from_pretrained and RWKVModel.generate functions that could maybe serve as inspiration: RWKV.py

This depends on the rwkv library: pip install rwkv==0.0.6

I'd like to tag @zphang. He recently implemented LLaMA support in transformers. Maybe adding RWKV would interest him as well.

Answer 66 · 2023-04-10T18:18:20.000Z

this is by far some of the best models right now, the performance of 7B is outstanding.
How come the best model is not supported by HF ?

Answer 67 · 2023-04-10T19:40:52.000Z

Because nobody tried implementing it?

Answer 68 · 2023-04-11T06:05:21.000Z

We want to have a positive impact on the AI field. We think the direction of more responsible AI is through openly sharing models, datasets, training procedures, evaluation metrics and working together to solve issues. We believe open source and open science bring trust, robustness, reproducibility, and continuous innovation. With this in mind, we are leading [BigScience](https://bigscience.huggingface.co/), a collaborative workshop around the study and creation of very large language models gathering more than 1,000 researchers of all backgrounds and disciplines.

Thats HF mission, so I was wondering how come HF has missed the best model in the industry. Making me think about bias behind what this "Open" platform says vs what they do.

And because of that, i was wondering how come HF teams are not giving a hand to port this in.
I saw LlaMA integration going in at flash speed with HF coverage.. and why this hasnt ??

Answer 69 · 2023-04-11T07:46:18.000Z

There is already an open PR by @ArEnSc

Answer 70 · 2023-04-11T08:00:23.000Z

Two things:
If there are open PR, mention their number so we can keep track of what is stale, duplicate etc.

Llama was so fast because people actively wanted to use it. Meta releases something, HF jumps in line and puts a PR together to support it. Since RWKV is not that big, no support. I am waiting eagerly for support...

Answer 71 · 2023-04-11T08:18:46.000Z

Hi there,
I am also super excited about this model, I think that PR will go on stale as there has been no activity since a while. If someone wants to take the lead on it, I would be happy to assist with @ArthurZucker !

Answer 72 · 2023-04-11T09:07:25.000Z

Well, I won't go into politics wether big or not big company should get community support or not.. having in mind their resources and manpower.

Projects like this, which are highly relevant, gets unsupported. Its the trending of github.. what else are we looking for ?

#17230
#20809
#21875
#20737