Hi, I own GeForce GTX TITAN X (bought directly from 10). 我拥有 GeForce GTX TITAN X(直接从 10 购买)。 nvidia-smi is giving me a warning: nvidia-smi 向我发出警告:

WARNING: infoROM is corrupted at gpu 0000:03:00.0

any suggestions how could i check what might be happening ? 任何建议我如何检查可能发生的情况?

Mon May  6 19:54:56 2019       
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX TIT...  Off  | 00000000:03:00.0  On |                  N/A |
| 22%   45C    P8    19W / 250W |    808MiB / 12212MiB |      4%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      1704      G   /usr/libexec/Xorg                             26MiB |
|    0      2053      G   /usr/bin/gnome-shell                          47MiB |
|    0      3578      G   /usr/libexec/Xorg                            279MiB |
|    0      7864      G   ...uest-channel-token=1###################    40MiB |
|    0     11693      G   ...uest-channel-token=17##################    44MiB |
|    0     24755      G   /opt/zoom/zoom                                32MiB |
|    0     25935    C+G   /opt/hfs17.0.416/bin/happrentice-bin         326MiB |
WARNING: infoROM is corrupted at gpu 0000:03:00.0

The inforom is a non-volatile storage device on the GPU. It is used to store various data. There is no public specification for its contents. inforom 是 GPU 上的非易失性存储设备。它用于存储各种数据。其内容没有公开规范。

Corrupted means the inforom did not pass some sort of sanity check (e.g. checksum). Therefore the GPU driver won’t use or trust its contents. Corrupted 表示 inforom 没有通过某种健全性检查(例如校验和)。因此,GPU 驱动程序不会使用或信任其内容。

There is no publicly available utility to fix this. The card is damaged. Unless it is under warranty, there isn’t anything you can do to repair it. However, as you are aware, some aspects of the card functionality are still operational. There is no public specification for the behavior of the card with a corrupted inforom. 没有公开可用的实用程序来解决此问题。卡已损坏。除非它在保修期内,否则您无法采取任何措施来修复它。但是,如您所知,卡功能的某些方面仍在运行。没有关于信息损坏的卡的行为的公开规范。

Thanks for reply! 谢谢你的回复!

After some tests - this warning ONLY appears after linux hibernation and indeed after i wake up computer i am getting some corrupted UI elements on few applications. 经过一些测试 - 此警告仅在 linux 休眠后出现,实际上在我唤醒计算机后,我在少数应用程序上看到一些损坏的 UI 元素。 BUT before hibernation, card works 100% correct and no warning is displayed prior hibernation. 但在休眠之前,卡工作 100% 正确,并且在休眠前不显示警告。

Is there a way i could do card stress test and confirm that indeed it’s hardware corrupted or maybe just simply there is a bug that corrupts memory address during hibernation process ? 有没有办法进行卡压力测试并确认它确实是硬件损坏,或者只是在休眠过程中存在损坏内存地址的错误?

I don’t have anything to suggest. It sounds like a software defect if it completely disappears when you reboot the system. 我没有什么可建议的。如果重新启动系统时它完全消失,这听起来像是软件缺陷。

Yeah it does sound like that - therefore it might be just an issue with cuda/nvidia driver itself. In such case, where should i submit bug report ? 是的,听起来确实是这样的 - 因此可能只是 CUDA/NVIDIA 驱动程序本身的问题。在这种情况下,我应该在哪里提交错误报告?

I had this issue on CentOS, Fedora 26, 27, 28, 29 and multiple different Nvidia/Cuda drivers (Gui corrupted after hibernation - which might be related to that mentioned warning) 我在 CentOS、Fedora 26、27、28、29 和多个不同的 Nvidia/Cuda 驱动程序上遇到了这个问题(Gui 在休眠后损坏 - 这可能与上述警告有关)

Configuration Setup - CentOS Linux release 7.6.1810 (Core) on system Precision T7610 + Driver 418.39 + NVIDIA Corporation GP102 TITAN Xp 配置设置 - CentOS Linux 版本 7.6.1810(核心)在系统 Precision T7610 + 驱动程序 418.39 + NVIDIA Corporation GP102 TITAN Xp

There is no warning message observed in nvidia-smi output. 在 nvidia-smi 输出中没有观察到警告消息。

Steps Taken to Attempt for repro - 尝试重现所采取的步骤 -

Open vscode Hibernate System Powered on Back 开机背面 Ran nvidia-smi and found no warning message

[root@dhcp-10-24-141-60 ~]# nvidia-smi
Thu May  9 01:32:33 2019
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  TITAN Xp            Off  | 00000000:03:00.0  On |                  N/A |
| 23%   33C    P8    11W / 250W |    216MiB / 12192MiB |      0%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0     24537      G   /usr/bin/X                                   107MiB |
|    0     24761      G   /usr/bin/gnome-shell                          74MiB |
|    0     25492      G   …-token=8175359F555DE6C90C3E6E049C993347    28MiB |
|    0     25621      G   gnome-control-center                           3MiB |
[root@dhcp-10-24-141-60 ~]#

Request you to provide nvidia bug report(which should be generated once you hit with issue) and detailed steps to repro issue locally. 要求您提供 nvidia 错误报告(在您遇到问题时应该生成)和在本地重现问题的详细步骤。

