Mantra vs AMD Threadripper 2990wx

   37728   29   11
User Avatar
Member
6 posts
Joined: 10月 2013
Offline
First of all - PC spec: Threadripper 2990wx, 128 Gb memory, GTX titan X. All devices not overclocked. Only stock.
Problem - mantra render work slower when all 32cores (64 threads) is active. But! When active only 16 cores (32 thread) mantra render work faster. 32 cores - 2.54 min, 16 cores 2.11 min. At same time Arnold render, in Maya 2018 update 4, work more predictably. More cores - more speed. 32 cores 1.26 min, 16 cores 2.17 min. Screenshot and test scenes in attach.

UPD. Interesting situation with SMT (simultaneous multithreading) technology. When SMT off with 32 active cores, mantra work much faster - 1.53 min vs 2.54 (SMT on).
When active 16 cores and SMT off - 2.30 min. vs 2.11 min. (SMT on).

Bottom line.
Mantra - 16 core, SMT on - 2.11 min.
Mantra - 16 core, SMT off - 2.30 min.
Mantra - 32 core, SMT on - 2.54 min.
Mantra - 32 core, SMT off - 1.53 min.

And after some tweaks in bios (not direct CPU overclocking)
Mantra - 32 core, SMT on - 2.19 min.
Mantra - 32 core, SMT off - 1.35 min.
Edited by Priest_kod - 2018年11月17日 22:11:28

Attachments:
mantra_16core_default.png (628.1 KB)
mantra_32core_default.png (625.0 KB)
arnold_16core_default.png (1.6 MB)
arnold_32core_default.png (1.6 MB)
2990wx_arnold_test.mb (104.9 KB)
2990wx_mantra_test.hip (1.5 MB)

User Avatar
Member
54 posts
Joined: 8月 2011
Offline
Have you tried different bucket sizes?
That sometimes also can change performance speed( i have noticed that in other render engines)
User Avatar
Member
268 posts
Joined: 7月 2005
Offline
Thanks for posting this! Looking into getting the 32core and it's great to see how it performs with Houdini.
User Avatar
スタッフ
5212 posts
Joined: 7月 2005
Offline
There's a few things going on here that affect performance.

First is that SMT is not a doubling of performance. You might get an extra 10% as the 2 threads are able to utilize the CPU better, filling in the idle units of one thread with work from another (maybe). The downside is that a thread also takes memory and memory bandwith, so having twice as many threads means you have less cache to work with per thread - more cache misses, so more higher-latency calls to main memory. And the more misses, the more memory bandwidth each thread is using, and you're already using twice as much.

Second is that a threaded job is made of parts - the threaded part (A), and the single threaded part (B). So if you're looking at total render time, it's

total time = time(A)/#threads + time(B)

As the number of threads increases from 1, the time taken to do A halves, halves again, etc. Assuming ideal scaling (more on that in a moment), at 32 threads it's taking 3% of the time of 1 thread. And at 64, it's 1.5% (but not really, ‘cause SMT isn’t 2x faster). What happens is that the total time begins approaching time(B), the single threaded stuff. You'll get a nice exponential falloff settling onto some plateau where it doesn't really improve much anymore. This is known as Amdahl's law, and the only way to get around it is to optimize B as much as possible.

The last problem is thread contention. Anything that needs exclusive temporary access to a resource can chip away at the speedup as threads wait. Even getting rid of all exclusive resources in the code will still have waits at the system level. This is a problem which gets worse as the # threads increase as well.

As a user, you can do a few things to fix this. 1) run the optimum number of threads for the job (often needs a bit of testing to determine this). 2) Run multiple mantra jobs with a lower thread count, which reduces the performance loss from the third point (but still runs into performance issues from the first. 3) Disable SMT.

Hope that helps!
User Avatar
Member
8045 posts
Joined: 9月 2011
Offline
https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review/8 [www.anandtech.com]

The 2990WX shows paradoxical performance in some tests vs the 2950X (16 vs 32 core). This may be due to what twod said, but there are also factors related to threadrippers power delivery and core interconnect that may result in less than ideal performance. The 2990 has a 250W TDP which for this chip acts like a power budget shared between cores.

https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review/12 [www.anandtech.com]

The more cores that are busy, the less power available to each core.

There might be something worth investigating further with mantra, since most rendering tests did not show a paradoxical relationship with core count and SMT enabled. I have heard TR under-performing on windows with more than 32threads. Are you able to test the platform under linux?

Thanks for your testing, this information is helpful.
User Avatar
Member
84 posts
Joined: 7月 2013
Offline
thanks for posting your 2990wx experiences and for then indepth answers
i also would deciding between 2950 and 2990. what I also interested in is heavy ram load(geometry) with mantra. also how it would perform with pyro/flip with low seperation size(ram heavy simulations). most render benchmark a very lightweight.
Edited by Tom Freitag - 2018年10月3日 11:26:59
User Avatar
Member
6 posts
Joined: 10月 2013
Offline
jsmack
Are you able to test the platform under linux?
A have only Windows 10 system.
Edited by Priest_kod - 2018年10月3日 21:20:51
User Avatar
Member
66 posts
Joined:
Offline
Windows shits the bed with high thread processors.
You should really consider switching to Linux as your processor is being wasted in Windows.
Edited by Lyr - 2018年10月4日 00:11:14
User Avatar
Member
147 posts
Joined: 3月 2014
Offline
Didn't AMD release a software suite for this? Basically sidestepping the Windows Taskmanager?
Pretty sure I read this somewhere….

rob
Apprentice Attribute / Houdini 17.0.381 / GTX 970 - driver 411.63
User Avatar
Member
2 posts
Joined: 7月 2015
Offline
RobW, was it this you mean?

Dynamic Local Mode available starting October 29th, besides games I read somewhere that it would fix some earlier memory access issues.

https://community.amd.com/community/gaming/blog/2018/10/05/previewing-dynamic-local-mode-for-the-amd-ryzen-threadripper-wx-series-processors [community.amd.com]
Edited by The3dcreator - 2018年10月22日 01:10:56
User Avatar
Member
648 posts
Joined: 7月 2005
Offline
I'm getting 1m44s in Linux, with an older 16-core 1950x, no OC.
User Avatar
Member
183 posts
Joined: 1月 2015
Offline
cpb
I'm getting 1m44s in Linux, with an older 16-core 1950x, no OC.
What is your ram speed?

I got 2m 10s on my 1950x stock.

ram is 64Gb running 2400mhz, i know these cpus can get allot of boost with faster ram.
User Avatar
Member
20 posts
Joined: 2月 2017
Online
I have 2990wx 128gb ram and been using it for about 2 months now, 24/7 sims and rendering. It`s beastly, but not for every user. Windows performance, compared to Linux is much worse, pretty much in every task. I get 10-30% better speeds in pyro/flip/grain etc on Linux Mint compared to win 10. Mantra speed is scaling so much better on linux too.

I have seen absolutely zero speed improvement after Dynamic Local Mode update using Ryzen Master. That goes for benchmarks on sims, or rendering(mantra, vray,corona,)

Now to mantra. Scaling with cores is bad in windows mostly. Pretty much no difference between 32 and 64 cores. And i`ll be bold, lets not blame this on completly on CPU. We know it has its limits on mem bandwidth,latency, and terrible windows scheduler, but that`s it. Other render engines scale much better with this CPU in very same environment, but mantra don`t….

I use vray for 13 years, it scales this CPU pretty much linearly on normal scenes. And with somewhat degraded performance on super heavy scenes(hundreds milions poly + displacements etc).

Small benchmark scene with 6 cylinders like this should not see performance degradation at all. I wish devs can look into this and optimize Mantra for multi NUMA CPUs more, it seems multi NUMA nodes might become the new trend, simple because rising core freq. has become very problematic.

btw, your scene renders on Linux Mint in 55 seconds on my 2990WX at 3.8Gz all cores
Edited by psanitra - 2018年11月12日 16:35:05
User Avatar
Member
6 posts
Joined: 10月 2013
Offline
psanitra thank you very much for your experience. I have been testing linux mint for several days. And i think take this OS for my workstation.
User Avatar
Member
7046 posts
Joined: 7月 2005
Offline
On Linux (Centos 7.5) with 24 core (48 virtual) Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz rendering the file to a local disk to avoid network contention.

All times “Total Wall Clock”, all load times 3/4 of a second. Tile size 16 for all.

-48 threads: 01:11
-32 threads: 01:16
-24 threads: 01:22 0.293 secs/thread
-16 threads: 01:52 0.143 secs/thread
-12 threads: 02:27 0.08 secs/thread

In theory, running 2 frames on my local box is optimal with “hyperthreading” giving almost no help.

FWIW

Cheers,

Peter B
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
7046 posts
Joined: 7月 2005
Offline
At home, Suse 42.3, with 4 core machine (8 virtual) Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (i.e. it's ancient) I get:

-8 threads: 5:18 .02 secs/thread

So, it's cheaper for me to buy 5 more of these processors and a cheap mobo/ram combo than one of the E5-2687w cpus bare

Unless I'm miscalculating something but I don't think I am…

Cheers,

Peter B
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
648 posts
Joined: 7月 2005
Offline
Heileif
What is your ram speed?
G.Skill Trident Z F4-3200C16Q-64GTZ 64GB (4x16GB) PC4-25600 (3200MHz)

interesting what a difference memory can make, here's two 1950x threadrippers, same speed, almost the same amount of ram, different memory speed:
link [browser.geekbench.com]
User Avatar
Member
75 posts
Joined: 7月 2013
Offline
pbowmar
On Linux (Centos 7.5) with 24 core (48 virtual) Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz rendering the file to a local disk to avoid network contention.

All times “Total Wall Clock”, all load times 3/4 of a second. Tile size 16 for all.

-48 threads: 01:11
-32 threads: 01:16
-24 threads: 01:22 0.293 secs/thread
-16 threads: 01:52 0.143 secs/thread
-12 threads: 02:27 0.08 secs/thread

In theory, running 2 frames on my local box is optimal with “hyperthreading” giving almost no help.

FWIW

Cheers,

Peter B

I thought changing the number of buckets to the number of threads was better for performance. Have you experimented?
- “spooky action at a distance”. Albert Einstein
User Avatar
Member
7046 posts
Joined: 7月 2005
Offline
Noboru Garcia
I thought changing the number of buckets to the number of threads was better for performance. Have you experimented?

Yes I tried a few different bucket sizes, it maybe made a second or two's difference, not enough to bother writing it down
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
8 posts
Joined: 2月 2015
Offline
Hello everyone,

So I'm in the same boat as some of the other people.

I'm about to do an upgrade and will pick between the 2950 or the 2990wx.

Most of the work will be sims and rendering on a windows machine.

Any info would be great
  • Quick Links