Distributed sim slowness??

   3903   4   1
User Avatar
Member
8 posts
Joined: July 2005
Offline
Just to pop a question here, I setup a small queue and run a distributed sim using the master class video as my guide, it ran well but it is very slow compare to simulate the sim on a single machine.

From 1 ~ 16million(runs for 23 hours) particles, where the 16m sim runs about 20minutes for 10 frame(out of 240, so it will run about 8 hours on single machine). It is true that the machines run distributed sim is a bit slower than my workstation but still it's quite slow when you have 4 slower machines stack up to run one sim.

Is there a threshold in particle counts that distributed sim start to show advantage?(cause 1m sim is about 10x slower, while 16m sim seems to be only 3x slower) Or is there something wrong in my setup?

Note, this is for the fluid sim only, no white water sim.
User Avatar
Staff
4199 posts
Joined: Sept. 2007
Offline
I think the threshold is generally the sim runs out of memory on a single machine.

16 CPUs in a single computer always be faster than 4x 4core machines. I believe the SideFX tests were using a 10Gb network connection, but even that is going to have more latency than just more cores in a single machine. Remember, depending on how much overlap you have between slices, there could potentially be many many gigabytes of data being copied back and forth at each time step.

If simulation speed is the concern, cram the fastest, highest number of cores in a single box, with as much memory as you can. But other then that, distributed sims are currently going to help the shots that just won't fit in memory of one machine. A faster network is just going to make that process less painful, but currently it doesn't replicate how fast a computer can talk to itself.
I'm o.d.d.
User Avatar
Member
8 posts
Joined: July 2005
Offline
I'm aware of more cores/GHz are better then combined.
But in my test case, my machine is 12 cores, all dist sim machines are either 8 cores or 12 cores(actually 2 8 cores and 2 12 cores, they all have same 48G ram), the only difference is my workstation is better version Xeon, while the 8 cores might be a i7 with hyper thread. Cause I want to see how the load balance work.

I thought that only particles on boundary are transferred not everything(cause when I check tracker output, transfer only take miliseconds, and maybe up to seconds, with 240 frame this only takes minutes total out of entire simulation.)

If dist sim is intended to just make sure over sized sim can actually be simed, then I think it still need to be worked on.(cause the tank_initial is actually taking up quite a lot of time as well even sim on a single machine.(if not using flat tank)
User Avatar
Staff
6413 posts
Joined: July 2005
Offline
Something seems wrong with your setup…

Do you have boundary creation turned off? Most of the tank tools create boundary particles every frame across the entire sim.

Turn on Enable Performance Monitor Logging. The giant file that results you can send through the $HFS/houdini/python2.7libs/perfmon_sum.py to sort by node and verify there are no surprises. Something should show up as significantly different between the two sims.
User Avatar
Staff
6413 posts
Joined: July 2005
Offline
I just tried a simple 10MPart sim to see how the low range goes. I mostly timed over 100MPart sims so didn't have a timing handy for it.

I got 38m08s for 240 frames on a single machine; 17m49s for four machines. So about 2x speed up. Not the 3.2x improvement I get with 180MPart sims, but definitely not a 3x *reduction*.

Attached is the file I used. I think there must be some other difference in our setup.

One thing is to try turning off Global Pressure Solve. If it gets significantly faster, it would be a network issue. Speaking technically, I saw significant slow downs before I denagled the connection. It could be some how the denagling is failing on your platform and/or network setup.

Attachments:
simpleslice.hip (2.0 MB)

  • Quick Links