Mantra vs AMD Threadripper 2990wx
37621 29 11- Priest_kod
- Member
- 6 posts
- Joined: 10月 2013
- Offline
First of all - PC spec: Threadripper 2990wx, 128 Gb memory, GTX titan X. All devices not overclocked. Only stock.
Problem - mantra render work slower when all 32cores (64 threads) is active. But! When active only 16 cores (32 thread) mantra render work faster. 32 cores - 2.54 min, 16 cores 2.11 min. At same time Arnold render, in Maya 2018 update 4, work more predictably. More cores - more speed. 32 cores 1.26 min, 16 cores 2.17 min. Screenshot and test scenes in attach.
UPD. Interesting situation with SMT (simultaneous multithreading) technology. When SMT off with 32 active cores, mantra work much faster - 1.53 min vs 2.54 (SMT on).
When active 16 cores and SMT off - 2.30 min. vs 2.11 min. (SMT on).
Bottom line.
Mantra - 16 core, SMT on - 2.11 min.
Mantra - 16 core, SMT off - 2.30 min.
Mantra - 32 core, SMT on - 2.54 min.
Mantra - 32 core, SMT off - 1.53 min.
And after some tweaks in bios (not direct CPU overclocking)
Mantra - 32 core, SMT on - 2.19 min.
Mantra - 32 core, SMT off - 1.35 min.
Problem - mantra render work slower when all 32cores (64 threads) is active. But! When active only 16 cores (32 thread) mantra render work faster. 32 cores - 2.54 min, 16 cores 2.11 min. At same time Arnold render, in Maya 2018 update 4, work more predictably. More cores - more speed. 32 cores 1.26 min, 16 cores 2.17 min. Screenshot and test scenes in attach.
UPD. Interesting situation with SMT (simultaneous multithreading) technology. When SMT off with 32 active cores, mantra work much faster - 1.53 min vs 2.54 (SMT on).
When active 16 cores and SMT off - 2.30 min. vs 2.11 min. (SMT on).
Bottom line.
Mantra - 16 core, SMT on - 2.11 min.
Mantra - 16 core, SMT off - 2.30 min.
Mantra - 32 core, SMT on - 2.54 min.
Mantra - 32 core, SMT off - 1.53 min.
And after some tweaks in bios (not direct CPU overclocking)
Mantra - 32 core, SMT on - 2.19 min.
Mantra - 32 core, SMT off - 1.35 min.
Edited by Priest_kod - 2018年11月17日 22:11:28
- tomsvfx
- Member
- 54 posts
- Joined: 8月 2011
- Offline
- AdamJ
- Member
- 268 posts
- Joined: 7月 2005
- Offline
- malexander
- スタッフ
- 5207 posts
- Joined: 7月 2005
- Offline
There's a few things going on here that affect performance.
First is that SMT is not a doubling of performance. You might get an extra 10% as the 2 threads are able to utilize the CPU better, filling in the idle units of one thread with work from another (maybe). The downside is that a thread also takes memory and memory bandwith, so having twice as many threads means you have less cache to work with per thread - more cache misses, so more higher-latency calls to main memory. And the more misses, the more memory bandwidth each thread is using, and you're already using twice as much.
Second is that a threaded job is made of parts - the threaded part (A), and the single threaded part (B). So if you're looking at total render time, it's
As the number of threads increases from 1, the time taken to do A halves, halves again, etc. Assuming ideal scaling (more on that in a moment), at 32 threads it's taking 3% of the time of 1 thread. And at 64, it's 1.5% (but not really, ‘cause SMT isn’t 2x faster). What happens is that the total time begins approaching time(B), the single threaded stuff. You'll get a nice exponential falloff settling onto some plateau where it doesn't really improve much anymore. This is known as Amdahl's law, and the only way to get around it is to optimize B as much as possible.
The last problem is thread contention. Anything that needs exclusive temporary access to a resource can chip away at the speedup as threads wait. Even getting rid of all exclusive resources in the code will still have waits at the system level. This is a problem which gets worse as the # threads increase as well.
As a user, you can do a few things to fix this. 1) run the optimum number of threads for the job (often needs a bit of testing to determine this). 2) Run multiple mantra jobs with a lower thread count, which reduces the performance loss from the third point (but still runs into performance issues from the first. 3) Disable SMT.
Hope that helps!
First is that SMT is not a doubling of performance. You might get an extra 10% as the 2 threads are able to utilize the CPU better, filling in the idle units of one thread with work from another (maybe). The downside is that a thread also takes memory and memory bandwith, so having twice as many threads means you have less cache to work with per thread - more cache misses, so more higher-latency calls to main memory. And the more misses, the more memory bandwidth each thread is using, and you're already using twice as much.
Second is that a threaded job is made of parts - the threaded part (A), and the single threaded part (B). So if you're looking at total render time, it's
total time = time(A)/#threads + time(B)
As the number of threads increases from 1, the time taken to do A halves, halves again, etc. Assuming ideal scaling (more on that in a moment), at 32 threads it's taking 3% of the time of 1 thread. And at 64, it's 1.5% (but not really, ‘cause SMT isn’t 2x faster). What happens is that the total time begins approaching time(B), the single threaded stuff. You'll get a nice exponential falloff settling onto some plateau where it doesn't really improve much anymore. This is known as Amdahl's law, and the only way to get around it is to optimize B as much as possible.
The last problem is thread contention. Anything that needs exclusive temporary access to a resource can chip away at the speedup as threads wait. Even getting rid of all exclusive resources in the code will still have waits at the system level. This is a problem which gets worse as the # threads increase as well.
As a user, you can do a few things to fix this. 1) run the optimum number of threads for the job (often needs a bit of testing to determine this). 2) Run multiple mantra jobs with a lower thread count, which reduces the performance loss from the third point (but still runs into performance issues from the first. 3) Disable SMT.
Hope that helps!
- jsmack
- Member
- 8041 posts
- Joined: 9月 2011
- Offline
https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review/8 [www.anandtech.com]
The 2990WX shows paradoxical performance in some tests vs the 2950X (16 vs 32 core). This may be due to what twod said, but there are also factors related to threadrippers power delivery and core interconnect that may result in less than ideal performance. The 2990 has a 250W TDP which for this chip acts like a power budget shared between cores.
https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review/12 [www.anandtech.com]
The more cores that are busy, the less power available to each core.
There might be something worth investigating further with mantra, since most rendering tests did not show a paradoxical relationship with core count and SMT enabled. I have heard TR under-performing on windows with more than 32threads. Are you able to test the platform under linux?
Thanks for your testing, this information is helpful.
The 2990WX shows paradoxical performance in some tests vs the 2950X (16 vs 32 core). This may be due to what twod said, but there are also factors related to threadrippers power delivery and core interconnect that may result in less than ideal performance. The 2990 has a 250W TDP which for this chip acts like a power budget shared between cores.
https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review/12 [www.anandtech.com]
The more cores that are busy, the less power available to each core.
There might be something worth investigating further with mantra, since most rendering tests did not show a paradoxical relationship with core count and SMT enabled. I have heard TR under-performing on windows with more than 32threads. Are you able to test the platform under linux?
Thanks for your testing, this information is helpful.
- Tom Freitag
- Member
- 84 posts
- Joined: 7月 2013
- Offline
thanks for posting your 2990wx experiences and for then indepth answers
i also would deciding between 2950 and 2990. what I also interested in is heavy ram load(geometry) with mantra. also how it would perform with pyro/flip with low seperation size(ram heavy simulations). most render benchmark a very lightweight.
i also would deciding between 2950 and 2990. what I also interested in is heavy ram load(geometry) with mantra. also how it would perform with pyro/flip with low seperation size(ram heavy simulations). most render benchmark a very lightweight.
Edited by Tom Freitag - 2018年10月3日 11:26:59
- Priest_kod
- Member
- 6 posts
- Joined: 10月 2013
- Offline
- Lyr
- Member
- 66 posts
- Joined:
- Offline
- RobW
- Member
- 147 posts
- Joined: 3月 2014
- Offline
- The3dcreator
- Member
- 2 posts
- Joined: 7月 2015
- Offline
RobW, was it this you mean?
Dynamic Local Mode available starting October 29th, besides games I read somewhere that it would fix some earlier memory access issues.
https://community.amd.com/community/gaming/blog/2018/10/05/previewing-dynamic-local-mode-for-the-amd-ryzen-threadripper-wx-series-processors [community.amd.com]
Dynamic Local Mode available starting October 29th, besides games I read somewhere that it would fix some earlier memory access issues.
https://community.amd.com/community/gaming/blog/2018/10/05/previewing-dynamic-local-mode-for-the-amd-ryzen-threadripper-wx-series-processors [community.amd.com]
Edited by The3dcreator - 2018年10月22日 01:10:56
- anon_user_40689665
- Member
- 648 posts
- Joined: 7月 2005
- Offline
- Heileif
- Member
- 180 posts
- Joined: 1月 2015
- Offline
- psanitra
- Member
- 20 posts
- Joined: 2月 2017
- Offline
I have 2990wx 128gb ram and been using it for about 2 months now, 24/7 sims and rendering. It`s beastly, but not for every user. Windows performance, compared to Linux is much worse, pretty much in every task. I get 10-30% better speeds in pyro/flip/grain etc on Linux Mint compared to win 10. Mantra speed is scaling so much better on linux too.
I have seen absolutely zero speed improvement after Dynamic Local Mode update using Ryzen Master. That goes for benchmarks on sims, or rendering(mantra, vray,corona,)
Now to mantra. Scaling with cores is bad in windows mostly. Pretty much no difference between 32 and 64 cores. And i`ll be bold, lets not blame this on completly on CPU. We know it has its limits on mem bandwidth,latency, and terrible windows scheduler, but that`s it. Other render engines scale much better with this CPU in very same environment, but mantra don`t….
I use vray for 13 years, it scales this CPU pretty much linearly on normal scenes. And with somewhat degraded performance on super heavy scenes(hundreds milions poly + displacements etc).
Small benchmark scene with 6 cylinders like this should not see performance degradation at all. I wish devs can look into this and optimize Mantra for multi NUMA CPUs more, it seems multi NUMA nodes might become the new trend, simple because rising core freq. has become very problematic.
btw, your scene renders on Linux Mint in 55 seconds on my 2990WX at 3.8Gz all cores
I have seen absolutely zero speed improvement after Dynamic Local Mode update using Ryzen Master. That goes for benchmarks on sims, or rendering(mantra, vray,corona,)
Now to mantra. Scaling with cores is bad in windows mostly. Pretty much no difference between 32 and 64 cores. And i`ll be bold, lets not blame this on completly on CPU. We know it has its limits on mem bandwidth,latency, and terrible windows scheduler, but that`s it. Other render engines scale much better with this CPU in very same environment, but mantra don`t….
I use vray for 13 years, it scales this CPU pretty much linearly on normal scenes. And with somewhat degraded performance on super heavy scenes(hundreds milions poly + displacements etc).
Small benchmark scene with 6 cylinders like this should not see performance degradation at all. I wish devs can look into this and optimize Mantra for multi NUMA CPUs more, it seems multi NUMA nodes might become the new trend, simple because rising core freq. has become very problematic.
btw, your scene renders on Linux Mint in 55 seconds on my 2990WX at 3.8Gz all cores
Edited by psanitra - 2018年11月12日 16:35:05
- Priest_kod
- Member
- 6 posts
- Joined: 10月 2013
- Offline
- pbowmar
- Member
- 7046 posts
- Joined: 7月 2005
- Offline
On Linux (Centos 7.5) with 24 core (48 virtual) Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz rendering the file to a local disk to avoid network contention.
All times “Total Wall Clock”, all load times 3/4 of a second. Tile size 16 for all.
-48 threads: 01:11
-32 threads: 01:16
-24 threads: 01:22 0.293 secs/thread
-16 threads: 01:52 0.143 secs/thread
-12 threads: 02:27 0.08 secs/thread
In theory, running 2 frames on my local box is optimal with “hyperthreading” giving almost no help.
FWIW
Cheers,
Peter B
All times “Total Wall Clock”, all load times 3/4 of a second. Tile size 16 for all.
-48 threads: 01:11
-32 threads: 01:16
-24 threads: 01:22 0.293 secs/thread
-16 threads: 01:52 0.143 secs/thread
-12 threads: 02:27 0.08 secs/thread
In theory, running 2 frames on my local box is optimal with “hyperthreading” giving almost no help.
FWIW
Cheers,
Peter B
Cheers,
Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
- pbowmar
- Member
- 7046 posts
- Joined: 7月 2005
- Offline
At home, Suse 42.3, with 4 core machine (8 virtual) Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (i.e. it's ancient) I get:
-8 threads: 5:18 .02 secs/thread
So, it's cheaper for me to buy 5 more of these processors and a cheap mobo/ram combo than one of the E5-2687w cpus bare
Unless I'm miscalculating something but I don't think I am…
Cheers,
Peter B
-8 threads: 5:18 .02 secs/thread
So, it's cheaper for me to buy 5 more of these processors and a cheap mobo/ram combo than one of the E5-2687w cpus bare
Unless I'm miscalculating something but I don't think I am…
Cheers,
Peter B
Cheers,
Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
- anon_user_40689665
- Member
- 648 posts
- Joined: 7月 2005
- Offline
HeileifG.Skill Trident Z F4-3200C16Q-64GTZ 64GB (4x16GB) PC4-25600 (3200MHz)
What is your ram speed?
interesting what a difference memory can make, here's two 1950x threadrippers, same speed, almost the same amount of ram, different memory speed:
link [browser.geekbench.com]
- Noboru_Garcia
- Member
- 75 posts
- Joined: 7月 2013
- Offline
pbowmar
On Linux (Centos 7.5) with 24 core (48 virtual) Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz rendering the file to a local disk to avoid network contention.
All times “Total Wall Clock”, all load times 3/4 of a second. Tile size 16 for all.
-48 threads: 01:11
-32 threads: 01:16
-24 threads: 01:22 0.293 secs/thread
-16 threads: 01:52 0.143 secs/thread
-12 threads: 02:27 0.08 secs/thread
In theory, running 2 frames on my local box is optimal with “hyperthreading” giving almost no help.
FWIW
Cheers,
Peter B
I thought changing the number of buckets to the number of threads was better for performance. Have you experimented?
- “spooky action at a distance”. Albert Einstein
- pbowmar
- Member
- 7046 posts
- Joined: 7月 2005
- Offline
Noboru Garcia
I thought changing the number of buckets to the number of threads was better for performance. Have you experimented?
Yes I tried a few different bucket sizes, it maybe made a second or two's difference, not enough to bother writing it down
Cheers,
Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
- Mjag007
- Member
- 8 posts
- Joined: 2月 2015
- Offline
-
- Quick Links