Distributed simulation node synchronisation issue

   2203   1   3
User Avatar
Member
1 posts
Joined: Oct. 2016
Offline
Hi,

We started using distributed simulations a while ago with good effect, mostly because the current FLIP solver (especially after the narrow band support was added) doesn't seem to scale very well to many cores, so it's more efficient to split the domain to multiple parts (I know, ideally proper multithreading is more efficient but..)

The process is a slightly modified (just port numbers and a few small things) version of the Python script that comes with Houdini and the whole setup is running in Tractor (no Hqueue involved but I don't think it makes much difference).

This is what happens occasionally:
1, Sim tracker starts, the configured ports are open
2, Simulation jobs start
3, Simulation machines show up in the http stream and print tracker messages
4, All sim machines start cooking the first frame
5, Some machines progress to the second frame, some don't
6, Sync is lost and none of them progress any further

Log looks otherwise correct.

So to get more information, I've enabled
$HOUDINI_DISTRIBUTEDPROJECTION_DIAGNOSTICS
and noticed something:
Those machines that have not managed to go to step 5 (to the second frame of the sim) print this:

Number of connected components = 1

I assume what's happening is that these machines are waiting for incoming data but other nodes have already stepped to the second frame and the whole process gets stuck.

The setup:
- fairly standard river simulation
- with or without narrow band enabled
- 5 slices arranged in a single direction
- distributed pressure solve enabled
- gas net slice exchange: mostly default settings (other than tracking), tried to change overlap values with little success
- no balancing node added

The issue is happening in a bit haphazard way, hard to reproduce but some setups show it more consistently than others and even little tweaks like changing the resolution in the scene file can “solve” (shadow) it.

While cannot be fully ruled out, I don't think it's bandwidth or system related either: machines involved are in a single rack with about 500-800Mb/s bandwidth available which is about 10-20x of what I see with running jobs. (Previously also tested on a single machine with internal bandwidth of about 25-30Gb/s).

Anyone has experience with this?

Thanks,
Andras
User Avatar
Member
12 posts
Joined: Oct. 2016
Offline
have you found a solution because we are looking to do the same.
Can you explain to me in more detail your procedure which you employ.

thanks in advance.
  • Quick Links