The other day, I was rudely interrupted by a VIDEO_TDR_FAILURE BSOD on my 32-bit Vista laptop and with all the other things I needed to get done, I did not get a chance to investigate the failure until Santa gave us all, a much needed break.
One of the most impressive things to note about the new Vista Windows Display Driver Model (WDDM) drivers is that the display driver can crash and restart without the system going down. I was shocked (and impressed at the same time) when I first heard about this. According to OCA data received by Microsoft, Vista is able to recover desktop 93% of the time a Graphics Processing Unit (GPU) hang is detected.
TDR standing for Timeout Detection and Recovery is a feature in DirectX graphics kernel subsystem (dxgkrnl.sys), where if GPU, which is the heart of graphics card, could not be preempted by the Video Scheduler for 2 seconds (can be overridden by TdrDelay registry value under HKLM\System\CurrentControlSet\Control\GraphicsDrivers) from its current task, is considered hung. After the GPU is determined to be hung, system collects state information necessary for post-mortem analysis. Then system gives 5 seconds (can be overridden by TdrDdiDelay registry value) for threads to leave the driver. If the timeout is not sufficient, VIDEO_TDR_FAILURE (116) bugcheck is reported. Otherwise GPU is reset and system attempts to restore desktop to the same state as it was before the GPU hang. All this happens without reboot and Vista logs event 4101 (Display driver <yourdisplaydriver> stopped responding and has successfully recovered) and shows the message in the system tray.
In any case, windbg analysis of dump produced the following
One of the first things to note here is that FAULTING_IP is located in nvlddmkm but nvlddmkm is nowhere in the thread stack. This is picked up from the second bugcheck parameter. The optional TDR_RECOVERY_CONTEXT pointer is present in the first parameter, but unfortunately this structure is not documented. In my case however, the context address was missing from the minidump, so it would not have been very useful even if the structure was known.
The third parameter, which is the error code of the last failed operation is set to 0xc00000b5 or STATUS_IO_TIMEOUT. That seems to be typical of VIDEO_TDR_FAILURE BSODs. The driver had a version of 18.104.22.16863 with a timestamp of Fri Jun 15 18:20:38 2007. After searching at Dell's site I was pleased to see the following display driver update.
I ignored the "Optional" criticality and upgraded. The screen went blank during the installation and I feared I might be hitting this bug. But thankfully installer came back and prompted for reboot. After reboot I checked the driver version and it was all the same !! So I did have the latest driver from nvidia, just like Vista driver update told me. I am surprised that nvidia's driver installer did not detect that and warn me about it.
Then I downloaded TechPowerUp GPU-Z 0.1.5 from here. GPU-Z reported my GPU to be G72M connected to PCI-Express with 128MB memory. I then went ahead and downloaded nvidia nTune 5.05.47.0 to see if I can use anything in there to diagnose my problem [note that the version is 5.05.54.0 on the top of the download page, but the actual version on the installer is set to 5.05.47.0, so I do not know which one to believe]. In the nvidia Control Panel application I ran Perform Stability Test (located under System Stability) on all system components for 10 minutes. While it is not clear what exactly these tests do, but a 3D game application was launched and shutdown several times during the test and system passed the test. nTune also reported the GPU temperature and RPMs of various fans.
Then I found myself looking at Video Scheduler counters in Performance Monitor (under Computer Management->System Tools->Reliability and Performance->Monitoring Tools).
After ogling at the data for some time, I realized I had no idea what was OK in those counters for my system. While I was thinking about what I should try next, I came across some posts in nvidia forums that suggest tweaking TdrDelay and TdrDdiDelay to 20 seconds and disable GPU hang detection by setting TdrLevel to 0 (TdrLevelOff). While that is certainly a workaround, I do not wish to go down those paths because a GPU hang is a critical failure that I do not wish to mask off.
I am changing my dumping options, so when VIDEO_TDR_FAILURE BSOD happens next time I have more memory to poke at. I am also trying to find a way to look at the GPU state in a live debugger or in a dump. Meanwhile if you are experiencing blue screens similar to this, make sure to send them to Microsoft, so that this is not ignored for long.
Merry Christmas and happy holidays.