[cudaErrorIllegalAddress] Karma XPU error

   4004   31   3
User Avatar
Member
14 posts
Joined: May 2018
Offline
To make a long story short, I'm working on a quite big project right now and realized that my GPU (RTX 4080 16GB) isn't being used on render time at all. Instead, when I first render after restarting Houdini I get this error:

"KarmaXPU: device Type:Optix ID:0 has registered a critical error , so will now stop functioning. Future error messages will be suppressed"



After this error shows up once my GPU isn't even showing up in the list of XPU devices when rendering. Just 100% CPU usage. This remains until I restart Houdini, at which point it attempts to use my GPU only the first time initializing a render and eventually gives me the error above.

I've tried switching from Nvidia Game Ready Drivers to the latest Studio Driver (560.81). For the rest of my system, I'm on Windows 11 and using an AMD 7800X3D CPU, as well as 128GB of RAM. Full PC restart also doesn't fix the issue, and I'm getting the same error in the current daily Houdini build (20.5.328).

I don't have time right now to troubleshoot and find the exact source of the problem, so I'm throwing a hail mary here: Has anyone had the same issue, and if so, did you find a fix for it?
Edited by MCJamZam - Aug. 16, 2024 17:39:50
User Avatar
Staff
528 posts
Joined: Aug. 2019
Offline
If you open the display device and render stats in the viewport, you may get more details as to why it's failing. See "Display device and render stats in the viewport" here: https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#howto [www.sidefx.com]
User Avatar
Member
14 posts
Joined: May 2018
Offline
johnmather
If you open the display device and render stats in the viewport, you may get more details as to why it's failing. See "Display device and render stats in the viewport" here: https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#howto [www.sidefx.com]

Thank you, this is helping - my guess is it has something to do with running out of VRAM. My scene is using just barely more than 16GB of memory just for geometry instances, which is of course more than the VRAM on my GPU.

If you don't mind me asking, do you know if there's a way to break these render stats down further? Now that I know I'm using too much memory on geometry, I need to find which geometry exactly is using the most amount of memory. Rather than painstakingly going through my entire scene testing objects one by one, it'd be nice to just get a list of memory usage per object, or something along those lines. (this might be something fairly basic in Solaris I've totally missed so far, seeing as I'm still learning)
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
I'm frequently encountering the same problem. I think I'm experiencing it since I upgraded from the current 278 production build to daily 328. It doesn't seem to be related to GPU running out of VRAM, because according to nvidia-smi, around the time when the Optix device fails there's still about 35% free VRAM available on it.

I'm using nvidia-driver550.54.15 and RTX 3070 running on Debian Bookworm.

Is it possible to restart the Optix device without resorting to restarting the whole program and reloading the scene? This would save me a lot of time.
Edited by ajz3d - Aug. 18, 2024 17:22:13
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
This shows right after the illegal address error:
KarmaXPU: Unable to create CUDA context for device 0 [CUDA_ERROR_MISALIGNED_ADDRESS] (maybe old driver? requires 535+)
Edited by ajz3d - Aug. 20, 2024 17:22:11
User Avatar
Staff
530 posts
Joined: May 2019
Offline
ajz3d
I think I'm experiencing it since I upgraded from the current 278 production build to daily 328.

Are you able to confirm that for us?

ajz3d
Is it possible to restart the Optix device without resorting to restarting the whole program and reloading the scene?

It depends on the error. Sadly cudaErrorIllegalAddress requires a full restart of Houdini
Have you tried restarting XPU (ie at the topright of the viewport, click the dropdown and choose "restart")

ajz3d
I'm frequently encountering the same problem

Are you able to reliably reproduce this?
It would be great to get a repro scene + clear repro steps from you, so we can investigate.
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
brians
ajz3d
I think I'm experiencing it since I upgraded from the current 278 production build to daily 328.

Are you able to confirm that for us?
Hi Brians. Yes, I confirm this. I installed daily 332 yesterday (which is now the new production build) while filling out a bug report about this particular issue, and this build also crashes OptiX on my end. So I reverted back to 278, and there are no crashes at all, no matter what I do and how hard I try. Rock solid in this dept.

brians
ajz3d
Is it possible to restart the Optix device without resorting to restarting the whole program and reloading the scene?

It depends on the error. Sadly cudaErrorIllegalAddress requires a full restart of Houdini
Have you tried restarting XPU (ie at the topright of the viewport, click the dropdown and choose "restart")
Naturally, but t doesn't restart it. I also tried one of your suggestions from some other thread. That is, to switch to CPU and then back to XPU. The result is the same, unfortunately.

brians
ajz3d
I'm frequently encountering the same problem

Are you able to reliably reproduce this?
It would be great to get a repro scene + clear repro steps from you, so we can investigate.
Yes I can reproduce it every single time. I'm preparing the package and will send it to support.
Edited by ajz3d - Aug. 21, 2024 10:14:21
User Avatar
Member
92 posts
Joined: Aug. 2017
Offline
brians
Sadly cudaErrorIllegalAddress requires a full restart of Houdini
Have you tried restarting XPU (ie at the topright of the viewport, click the dropdown and choose "restart")



Is there any chance that the need to restart houdini with cudaErrorIllegalAddress will go away any time soon? This is one of the few small annoyances using xpu.
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
Brians, if you would like to inspect the scene and a video depicting the crash, the ticket is #156425.
User Avatar
Member
3 posts
Joined: June 2019
Offline
I noticed an interactivity issue while moving around the viewport with xpu when theres at least 50% of vram used, but only on the newest production build .332, the .278 doesn't have it
(my very scientific method is to duplicate the test pig without instancing until it chugs in xpu)
A friend with a different setup and 2x 4070 also has the same issue

Also I would like to point out that xpu on 20.5 takes about 25% more vram compared to h20 with only textures/materials. I'm trying to narrow down the issue before sending a ticket but i'm not making sense of the inconsistency
for example i have a scene taking 5gb of vram without materials on both versions, it goes up to 6.2gb with materials on h20.0 and 8gb on h20.5 while having the same look
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
Brains, did you by any chance had the opportunity to look into this issue and the scene I sent? I haven't heard from the support for two weeks now and the issue discussed here is pretty critical for me, because I can't upgrade to anything beyond the initial 278 production build without experiencing OptiX crashes. Which of course means that I have to struggle, on a daily basis, with a variety of bugs that were already fixed in newer Houdini builds.
Edited by ajz3d - Sept. 8, 2024 15:42:18
User Avatar
Staff
530 posts
Joined: May 2019
Offline
We've had trouble reproducing your issue on our end.

ajz3d
I think I'm experiencing it since I upgraded from the current 278 production build to daily 328

If we could binary-search to find the exact version of Houdini that caused the issue, it would make it much easier to track down. Are you in a position to do that?
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
I can do that. But where can I find archived builds for download? Download page provides only two most recent dailies and two latest production builds.

EDIT:
Checked the FTP too, but there's only 20.5.333 available on it.
Edited by ajz3d - Sept. 10, 2024 07:20:56
User Avatar
Staff
530 posts
Joined: May 2019
Offline
It can be done via the launcher (specifically the --version argument)
https://www.sidefx.com/docs/houdini/ref/utils/launcher.html [www.sidefx.com]

Let me know if this works for you, if not I'll check with our release/web guys
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
brians
It can be done via the launcher (specifically the --version argument)
https://www.sidefx.com/docs/houdini/ref/utils/launcher.html [www.sidefx.com]

Let me know if this works for you, if not I'll check with our release/web guys
If it's not a problem, I'd rather like to use legacy installers.
User Avatar
Staff
530 posts
Joined: May 2019
Offline
We're working on a solution to make this happen. Might be in contact over email, give us a few days in anycase. thanks!
User Avatar
Staff
530 posts
Joined: May 2019
Offline
I think we'll organize some kind of FTP
To be sure Artur, you're on Linux?
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
Yes, on Debian Bookworm.
User Avatar
Staff
530 posts
Joined: May 2019
Offline
It's been arranged that you can download the builds via FTP. I've given details on the bug ticket. Let me know if they don't make their way through to you or if you have an issue with FTP. And best to communicate via the bug/ticket system heading forward.
Thanks Artur
User Avatar
Member
570 posts
Joined: Aug. 2014
Offline
Thank you Brian. I just checked my e-mail, but the details didn't reach me yet. I'll check again tomorrow, on workday.
  • Quick Links