【报错】使用 AutoDL 复现实验时遇到 RuntimeError: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
报错:
显卡是 RTX 3090 24G,软件环境参照 instruct-pix2pix 的 environment.yaml;
【原因】执行 nvidia-smi
指令获取 GPU 相关信息,包括驱动版本、CUDA 版本和一些设备信息:
按照报错提示,访问 http://www.nvidia.com/Download/index.aspx 查询合适的 GPU 驱动版本,发现确实是驱动版本太低 1 2,至少需要 535.146.02 版本的驱动程序,而服务器上的仅有 515.76:
【解决办法】参考 AutoDL私有云 | GPU驱动 更新驱动,但第一部卸载当前驱动无法执行,可以按照 How can I uninstall a nvidia driver completely ? 中卸载驱动。
卸载驱动后,安装新驱动:wget https://us.download.nvidia.com/XFree86/Linux-x86_64/535.98/NVIDIA-Linux-x86_64-535.98.run
;
最后一步遇到 ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
报错:
查阅大量资料也未能解决 3。
因为是远程服务器无法本地安装驱动,建议换一台版本更高的。