Multi-socket EPYC System can't reach above 50% load

   664   7   1
User Avatar
Member
7 posts
Joined: Dec. 2017
Offline
Hi everyone,

I've built a dual-socket EPYC 9654 system. Running my workloads, I can't seem to get higher than 50% load. All cores are active, but only reach 50% usage. I tried turning off SMT in the bios (reduces number of cores by 50%), but the 50% utilization remained. I'm using the Pyro benchmark developed here (https://www.vfxarabia.co/post/houdini-benchmark-cores-vs-clockspeed-updated) as my baseline for testing. This is a dual-socket system, so I'm wondering if there's something I'm missing in my kernel config that's causing this. Any thoughts on where to start troubleshooting?

Some information about the system:

cat /proc/cpuinfo
processor : 383
vendor_id : AuthenticAMD
cpu family : 25
model : 17
model name : AMD EPYC 9654 96-Core Processor
stepping : 1
microcode : 0xa101148
cpu MHz : 400.000
cache size : 1024 KB
physical id : 1
siblings : 192
core id : 79
cpu cores : 96
apicid : 415
initial apicid : 415
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips : 4802.25
TLB size : 3584 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 52 bits physical, 57 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

rux@rux ~ $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
400000
rux@rux ~ $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance
rux@rux ~ $ cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
performance
rux@rux ~ $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
amd-pstate-epp

rux@rux ~ $ uname -a
Linux rux 6.6.52-gentoo-gentoo-dist #9 SMP PREEMPT_DYNAMIC Sun Oct 20 22:03:28 MDT 2024 x86_64 AMD EPYC 9654 96-Core Processor AuthenticAMD GNU/Linux

System Information
Operating System Gentoo Linux
Kernel Linux 6.6.52-gentoo-gentoo-dist x86_64
Model Giga Computing MZ73-LM0-000
Motherboard Giga Computing MZ73-LM0-000
BIOS GIGABYTE R04_F32

CPU Information
Name AMD EPYC 9654
Topology 2 Processors, 192 Cores, 384 Threads
Identifier AuthenticAMD Family 25 Model 17 Stepping 1
Base Frequency 3.71 GHz
L1 Instruction Cache 32.0 KB x 96
L1 Data Cache 32.0 KB x 96
L2 Cache 1.00 MB x 96
L3 Cache 16.0 MB x 12

Memory Information
Size 125 GB

System is watercooled, CPUs are at 56C degrees at idle (and when running these tests).
User Avatar
Member
7 posts
Joined: Dec. 2017
Offline
I am able to achieve 100% CPU usage with sysbench (sysbench cpu --threads=384 run).
User Avatar
Member
7 posts
Joined: Dec. 2017
Offline
Here's the Performance tab as well.
Edited by ruxbat - Oct. 21, 2024 14:24:16

Attachments:
screenshot-2024-10-21_12-21-11.png (1.5 MB)

User Avatar
Member
7 posts
Joined: Dec. 2017
Offline
And here's what HTOP looks like when it's running

Attachments:
screenshot-2024-10-21_12-23-40.png (2.7 MB)

User Avatar
Member
7 posts
Joined: Dec. 2017
Offline
For anyone else that stumbles across this, I asked SideFX what they think of this. Here's their response:

With 384 threads you need a rather large workload before they can saturate. You may even have better perf in throughput by running lower thread count with the -j option, and then running multiple houdinis. As synchronizing 384 threads itself takes time...

You can see if we spin up all the threads using attribute wrangle, our granularity is 1024 points by default, so you want at least 384 * 1024 points. I'd create 20 million or so, and then have a kernel that does some lengthy math per point. Ie, the SOP should take seconds to run, if you instead try for a SOP running many times to add up, you will have the sync points and not be able to fully launch all threads. (You want to avoid using arrays as that may trigger memory allocation, which might trigger some syncing)

Since my use case is to do tens of thousands of individual simulations in TOP / PDG, this solution is fine for me.
User Avatar
Member
1701 posts
Joined: March 2009
Offline
One thing you should consider throwing some more money at is your memory. You got 64gb per socket, that's.. not great. You will likely run into trouble if you really put some parallel load on this box.
Martin Winkler
money man at Alarmstart Germany
User Avatar
Member
7 posts
Joined: Dec. 2017
Offline
Hey Martin! Absolutely not enough RAM to make the most of these cores, I agree! The DDR5 tax is strong though, I think that 128GB was almost $600 . Plenty more slots to add as I scale up.
User Avatar
Member
13 posts
Joined: Jan. 2017
Offline
ruxbat
Plenty more slots to add as I scale up.

Since you're on a board with 12 slots per processor (which Epyc basically expects to be filled out) you're basically crippling the memory bandwidths and more especially the latencies (which are already awful with DDR5) on an Epyc 9000 board. The processor caches won't be able to pull data from memory fast enough to operate on with all cores most of the time. Both the main and SMT thread on each core on Epyc 9000 need to be saturated with data every cycle in order to saturate all of the AVX2 / AVX512 pipelines which have especially long instructions (in terms of bytes) and the huge number of integer units, and you need lots of memory bandwidth to avoid stalling the load / store units and the prefetch that's keeping them all fed.

sysbench, being synthetic (and just meant as a general speed test IIRC, been a long time since I messed with linux), probably doesn't generate enough memory use to overflow the CCX caches and might not intend to. Most tests don't account for the large amounts of cache Epic / TR Pro have and tend to not be testing the right thing if they're meant to be acting like a "real world" program.

Try to find an all-purpose benchmark that operates on a lot of data for R or Julia or some similar math framework and I'm guessing you'll see the same usage pattern as Houdini.
  • Quick Links