通义千问Qwen模型运行异常解决记录：FlashAttention only supports Ampere GPUs or newer

发布时间：2024年01月18日

通过langchain调用Qwen/Qwen-1_8B-Chat模型时，对话过程中出现报错提示：

ERROR: object of type 'NoneType' has no len()
Traceback (most recent call last):
File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain/chains/base.py", line 385, in acall
    raise e
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain/chains/base.py", line 379, in acall
    await self._acall(inputs, run_manager=run_manager)
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain/chains/llm.py", line 275, in _acall
    response = await self.agenerate([inputs], run_manager=run_manager)
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain/chains/llm.py", line 142, in agenerate
    return await self.llm.agenerate_prompt(
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 506, in agenerate_prompt
    return await self.agenerate(
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 466, in agenerate
    raise exceptions[0]
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 569, in _agenerate_with_cache
    return await self._agenerate(
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_community/chat_models/openai.py", line 519, in _agenerate
    return await agenerate_from_stream(stream_iter)
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 85, in agenerate_from_stream
    async for chunk in stream:
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_community/chat_models/openai.py", line 490, in _astream
    if len(chunk["choices"]) == 0:
TypeError: object of type 'NoneType' has no len()

很疑惑，其他LLM模型都能正常运行，唯独Qwen不行。
查了很多资料，众说纷纭，未解决。
于是仔细看报错信息，最后一行报错说 File “/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_community/chat_models/openai.py”, line 490有问题，那就打开490行附近，看看源码：

if not isinstance(chunk, dict):
   chunk = chunk.dict()
if len(chunk["choices"]) == 0:
   continue
choice = chunk["choices"][0]

应该就是这个chunk里面没有choices导致的报错。
那我们把这个chunk打印一下，看看他里面有些什么，于是修改这个文件代码为：

if not isinstance(chunk, dict):
   chunk = chunk.dict()
print(f'chunk:{chunk}')
if len(chunk["choices"]) == 0:
   continue
choice = chunk["choices"][0]

再次运行，看到chunk的输出为：

chunk:{'id': None, 'choices': None, 'created': None, 'model': None, 'object': None, 'system_fingerprint': None, 'text': '**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**\n\n(FlashAttention only supports Ampere GPUs or newer.)', 'error_code': 50001}

终于看到真正的错误信息了：NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE：FlashAttention only supports Ampere GPUs or newer。
看样子真正出问题的点在flash-attention上。
翻看huggingface上通义千问的安装说明：

依赖项（Dependency）
运行Qwen-1.8B-Chat，请确保满足上述要求，再执行以下pip命令安装依赖库
pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed

另外，推荐安装flash-attention库（当前已支持flash attention 2），以实现更高的效率和更低的显存占用。
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 下方安装可选，安装可能比较缓慢。
# pip install csrc/layer_norm
# pip install csrc/rotary

按照文档，flash-attention是安装好了的，问题应该不是出在安装上面。
在qwenlm的issue上看到说要卸载flash-atten：https://github.com/QwenLM/Qwen/issues/438
然后在huggingface社区看到对这个问题的解释：https://huggingface.co/Qwen/Qwen-7B-Chat/discussions/37：

flash attention是一个用于加速模型训练推理的可选项，且仅适用于Turing、Ampere、Ada、Hopper架构的Nvidia GPU显卡（如H100、A100、RTX 3090、T4、RTX 2080），您可以在不安装flash attention的情况下正常使用模型进行推理。

再一核对我自己的GPU，了然了，原来是我的GPU不适用于flash attention！
所以，解决方案就是：

pip uninstall flash-atten

文章来源:https://blog.csdn.net/AJian759447583/article/details/135679821
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！