-
-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
关于RUN_CUDA_RWKV6这部分,最好用pytorch实现,否则不方便移植 #252
Comments
谢谢关注,推理不需要cuda(虽然有cuda会prefill更快): https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v6_demo.py 以及这是聊天demo(用\n\n作为终止符,因为我会将用户输入内容中的\n\n全部替换为\n) |
hi, This repo is currently support by me, and I'm currently working with RWKV team. Therefore you could treat it as an official version. The whole model is still in Pytorch, except for wkv kernel. If you take consider of transformer, attention kernel is wrote by cuda/c in Pytorch or in triton. It's same for both RWKV and Transformer, because if we use native torch to achieve the same computation, it will be really slow, about 50x timers more. Because if you look into torch's eager mode, it will launch about 10000 more small kernels when a 4096 prefill are being made. It's necessary to write a fused function in CUDA or Triton. And you can move forward rwkv-fla for more details. Thank you! By the way, rwkv-kit can initialize rwkv 0x60 from scratch. |
You can take consider of rwkv.cpp/llama.cpp, we also provide onnx and pure torch code. https://github.com/TorchRWKV/flash-linear-attention/blob/main/fla/ops/rwkv6/recurrent_naive.py |
看了下论文的方向,挺棒的,但是整个设计对实际想进一步研究的人非常不友好,因为想用这个框架的,都是希望移植到边缘端,可是核心代码,用的又是cuda实现的,移植起来非常麻烦,还要自己手动对齐,好像除了1代都是这么干的?
我也去测试了demo,感觉对终止符的推荐也不是很好,建议这么好的理论框架,最好能够设计的更方便大家去实验,才有机会被真正落地用起来。
仅供参考。
The text was updated successfully, but these errors were encountered: