Karma XPU - dual 4090 RTX setup - performance issues

   3628   16   0
User Avatar
Member
28 posts
Joined: 10月 2015
Online
Im running 2 * RTX 4090 + 1 * 128 cores threadripper but compared to running the same system with only 1 * RTX 4090 GPU Im seeing only a slight 50% increase in speed.. Is this normal or is there anything I can do to squeeze more juice out of it?? Currently running 19.5.716 Houdini version.

cheers
Edited by timjan - 2023年11月12日 05:28:10
User Avatar
スタッフ
531 posts
Joined: 5月 2019
Offline
That is a beast of a CPU

You can enable/disable different types of devices
https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#disablingdevices [www.sidefx.com]

It would be great to get some render times from you for (eg)...
- CPUdevice=on GPUdevice0=off GPUdevice1=off
- CPUdevice=off GPUdevice0=on GPUdevice1=off
- CPUdevice=off GPUdevice0=off GPUdevice1=on
- CPUdevice=off GPUdevice0=on GPUdevice1=on
- CPUdevice=on GPUdevice0=on GPUdevice1=on

This way we can verify that the added performance gain using 1 or 2 GPU is correct/expected.

thanks
User Avatar
Member
28 posts
Joined: 10月 2015
Online
Yes its indeed powerful for all things CPU related, Im very pleased with the system Thanks for the update, will check it out!
User Avatar
Member
14 posts
Joined: 1月 2017
Offline
If it works like Radeon ProRender only the primary card (an RTX4090 in my case, secondary is a 7900XTX, and yes, surprisingly that actually works) was allowed to do work on samples above the minimum since the adaptive samples require on the spot decisions based on rays / photon casts from the rest of the scene depending on how it's being done whereas the non-adaptive samples only require the data resulting from those which isn't really needed until all the samples are mixed together (so probably doesn't need to be transfered to the other card at all). It's too slow to be beneficial without something like NVLink which NVidia conveniently killed off on everything below the $8500 L40 in Ada cards, would be my guess. That or that's simply as much work as can be offloaded from CPU. The 4090 only has 256 total fp64 cores out of the 16000 something cuda cores and they're spread out amongst all SMs which makes it impossible to get any kind of cache locality working with them so anything that needs higher than fp32 precision is probably ending up on the CPU.
User Avatar
Member
14 posts
Joined: 2月 2022
Offline
GnomeToys
If it works like Radeon ProRender only the primary card (an RTX4090 in my case, secondary is a 7900XTX, and yes, surprisingly that actually works) was allowed to do work on samples above the minimum since the adaptive samples require on the spot decisions based on rays / photon casts from the rest of the scene depending on how it's being done whereas the non-adaptive samples only require the data resulting from those which isn't really needed until all the samples are mixed together (so probably doesn't need to be transfered to the other card at all). It's too slow to be beneficial without something like NVLink which NVidia conveniently killed off on everything below the $8500 L40 in Ada cards, would be my guess. That or that's simply as much work as can be offloaded from CPU. The 4090 only has 256 total fp64 cores out of the 16000 something cuda cores and they're spread out amongst all SMs which makes it impossible to get any kind of cache locality working with them so anything that needs higher than fp32 precision is probably ending up on the CPU.

Referring to the above quote, could anyone help me out with the info, as where and under what circumstances Karma XPU is using higher than fp32 precision? Sorry, maybe its basic knowledge, but I'm a bit lost due to this layer

Thanks in advance if anyone can drop some info!
User Avatar
スタッフ
531 posts
Joined: 5月 2019
Offline
Polybud
Referring to the above quote, could anyone help me out with the info, as where and under what circumstances Karma XPU is using higher than fp32 precision?

I got lost reading GnomeToys reply sorry, but I’ll try to cover XPUs gpu/multi-device architecture briefly, which will hopefully clarify any understanding.

XPU treats each device (including the CPU device) as a separate entity. There is no memory sharing between devices. They do not know about each other or communicate. They each have a separate copy of the scene data.

Xpu instructs each of them to render separate passes of the image (some will do this faster than others), which it receives and blends into the final image in whatever order they arrive.

This is a failsafe architecture because it doesn’t matter what combination of devices someone has, or if (eg) one of them fails or whatever, we still end up with the same final result.

For this to work, each type of device needs to produce the EXACT same result (including the cpu device). So to this end we only use fp32 calculations across all devices.
Edited by brians - 2024年2月24日 02:19:45
User Avatar
Member
3 posts
Joined: 1月 2016
Offline
Hello!

I recently added a second identical GPU to my rig. Like @timjan, my Karma XPU renders appear to be only ~60% faster. I had expected that they would double in speed as this is the behavior I've seen in Octane. Is this expected? CUDA usage is maxed out for both cards in performance monitor.

Reading @brians response - is there a setting to allow for each card to be responsible for its own frame in a multi-frame sequence, rather than combining passes as described? Would this provide the expected performance gain?

Thanks,

Ry
User Avatar
Member
20 posts
Joined: 2月 2017
Offline
Hi Brian,

I have a similar setup with 2x4090(identical cards) and AMD 64 core 7980x. Both 4090 are on x16 slots. Rendering GPU only(embree device disabled via env. var). I'm also seeing reduced performance on one of the cards.

It manifest gradually as scenes are getting more complex. Simple scenes will render 50:50 or 49:51, render almost same number of passes. A bit more complex scenes will render 45:55, and really complex scenes will render 35:65. One card can do 100 passes and other just 60.

I looked into the issue fair bit, but could not find why. It`s not OS W11 ralated, same issue on Linux. Same on different drivers. It`s not PCIE slot/mobo/hardware level issue, because rendering on each card separatly(with optix device env. var), will result is same rendertime/passes. This issue only happens when both of them are rendering. Thermals are not and issue, both cards are watercooled hovering around 70c under load. GPU utilisation is not and issue, when i check with gpuZ, both cards are being fully utilized. It`s a mistery to me, how a 2 cards both being utilized at 100% can produce 100 and 60 passes. Is it possible that Karma stats reporting is broken? But why would that vary with scenes? too many questionn...

This performance loss seems to be specific to 20.5.x version of Karma XPU. I can render a heavy scene on 20.0.653 with 48:52 utilisation, and same file with 20.5.307 with 32:68 utilisation.

Is there anything else worth trying that will help us locate what`s causing this?

Thanks!
Edited by psanitra - 2024年8月1日 08:30:40
User Avatar
スタッフ
531 posts
Joined: 5月 2019
Offline
psanitra
Is there anything else worth trying that will help us locate what`s causing this?

Sure.
Try disabling adaptive sampling.
So that could be done by setting the Pixel Oracle to "uniform" in the karma render settings. Or you could use this undocumented environment variable
KARMA_XPU_DISABLE_PIXEL_ORACLE=1

Try that and see if you get different behavior.
After trying that, you can also try assigning additional threads to the CPU blending stage
So that means increasing this envvar from its default of 1, to 2 or 3 (or whatever)
eg
KARMA_XPU_NUM_PER_DEVICE_BLENDING_THREADS=3
User Avatar
Member
20 posts
Joined: 2月 2017
Offline
Thank you, I tried those env. vars and others(listed in docs) and found out what the issue is. It seems like it's connected to viewport/live render, meaning rendering directly in Houdini, Solaris desktop for example. It seems performance realy tanks when rendering like that. As a test scene i took the Warehouse from content library, should be easy to replicate.

Rendering with same settings and samples, to viewport 7:33 and to mplay 4:14 . Utilisation to mplay is perfect 50:50, but rendering in viewport the performance is 31:68 , or 160 v 352 passes. That`s close to half speed only for one for the cards. Images for both renders attached. Hope that helps to find out what the issue is.


Thanks!
Edited by psanitra - 2024年8月6日 15:39:06

Attachments:
LiveRender.jpg (649.4 KB)
RenderToMplay.jpg (480.4 KB)

User Avatar
Member
96 posts
Joined: 8月 2017
Offline
It's also interesting to see that the mplay render is significantly faster - thats not just one card doing 50% less than the other one - this would not result in an 1.8x speed increase.
User Avatar
スタッフ
531 posts
Joined: 5月 2019
Offline
ronald_a
It's also interesting to see that the mplay render is significantly faster

IPR is typically slower to render than offline (ie via mplay) for a few reasons
- IPR has less threads available to it, and set to a lower priority (to allow Houdini to keep functioning)
- IPR has mechanisms in place to allow for fast viewport updates
- etc...

One thing, try setting the "IPR DownSample Factor" display option from 2 to 0, it may help with IPR rendering speeds.

psanitra
mplay 4:14 . Utilisation to mplay is perfect 50:50

This at least shows it has nothing to do with the internal XPU rendering code.

psanitra
rendering in viewport the performance is 31:68 , or 160 v 352 passes.

I don't know why, but my guess is that OpenGL/Vulkan/UI is doing stuff on one of the GPUs.
User Avatar
Member
64 posts
Joined: 3月 2012
Offline
Do you use husk command to render?

Reading @brians response - is there a setting to allow for each card to be responsible for its own frame in a multi-frame sequence, rather than combining passes as described? Would this provide the expected performance gain?

As subsampler mentioned, if you could render images using one frame/oneGPU total rendering time may be decreased.
User Avatar
Member
96 posts
Joined: 8月 2017
Offline
brians
One thing, try setting the "IPR DownSample Factor" display option from 2 to 0, it may help with IPR rendering speeds.



where does one find that setting?
User Avatar
スタッフ
531 posts
Joined: 5月 2019
Offline
ronald_a
where does one find that setting?

In IPR (in solaris)
- make sure XPU is active
- click on the viewport
- press "d" to bring up the display options dialog
- the setting should be visible

Attachments:
ipr_downsample_factor.PNG (54.8 KB)

User Avatar
Member
20 posts
Joined: 2月 2017
Offline
brians
I don't know why, but my guess is that OpenGL/Vulkan/UI is doing stuff on one of the GPUs.

I can confirm that Vulcan viewport(default in 20.5) is the cause of the performance loss when rendering in Solaris. Switching Houdini back to OpenGL cards are running at 49:50 utilisation(warehouse scene).
User Avatar
Member
172 posts
Joined: 5月 2021
Offline
Just to note, Karma is doing A LOT better after updates on Mac Silicon aside from the lack of Viewport COPS.

psanitra
I can confirm that Vulcan viewport(default in 20.5) is the cause of the performance loss when rendering in Solaris.

Khronos has another API standard, other than Vulkan, called ANARI that Solaris would really benefit from. It seems to be the long-term API for Rendering with a lot of adoption from National Labs, Kitware, AMD, NVIDIA, Intel, and there are a few USD "Devices" as well as experimental USD Hydra Delegates.

Meant to scale from laptop to HPC Clusters.

https://www.khronos.org/anari/ [www.khronos.org]

Approx quote: " In the Keynote on Vulkan, yesterday, it was talked about that Vulkan is not for the faint of Heart. So that cued me up to say "ANARI, for the faint of Heart".

Edited by PHENOMDESIGN - 2024年8月7日 12:42:04
PHENOM(enological) DESIGN;
Experimental phenomenology (study of experience) is a category of philosophy evidencing intentional variations of subjective human experiencing where both the independent and dependent variable are phenomenological. Lundh 2020
  • Quick Links