huggingface中Trainer设置了compute_metric后爆显存

发布时间：2023年12月25日

1.问题描述

我使用huggingface的Trainer，利用Lora微调Llama2模型，在我设置了compute_metrics属性后，出现Out of memory

trainer=transformers.Trainer(
    model=model,
    args=train_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

2.原因

huggingface在设定了compute_metrics后，会把测试集上所有数据的模型输出（例如logits等）都cat成一个张量，而这个过程是在GPU完成的，最后才会把这些巨大无比的张量放到cpu上，很多情况下还没到转移到cpu那一步，就已经爆显存了

3.解决方案

(1)在TrainingArguments中设置eval_accumulation_steps，它代表多久一次将tensor搬到cpu，官方的文档是这样说的：

eval_accumulation_steps (int, optional) — Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If left unset, the whole predictions are accumulated on GPU/NPU/TPU before being moved to the CPU (faster but requires more memory).

?(2)在Trainer中设置preprocess_logits_for_metrics方法，它代表你要在每一个eval step后怎么处理这些张量，如果你并不需要所有的logits（例如我只想知道它到底属于哪一类），那么你可以在这个方法中定义，从而减小合并的时候占用的显存，官方的文档是这样说的：

preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor], optional) — A function that preprocess the logits right before caching them at each evaluation step. Must take two tensors, the logits and the labels, and return the logits once processed as desired. The modifications made by this function will be reflected in the predictions received by compute_metrics.

?本文的内容借鉴了https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941

文章来源:https://blog.csdn.net/weixin_44902962/article/details/135198185
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！