flip distributed sim stuck

   5404   10   3
User Avatar
Member
52 posts
Joined: June 2012
Offline
Hi,

trying to sim FLIP slices through HQueue with two machines to start. The simulation starts then get stuck after sometimes 1, sometimes 10 and more rarely about 30 frames. Most of the time it stops at 10, the result is exactly what I would expect so I wonder what could be happening that makes Houdini get stuck like this.

I found a few similar posts online but with no answer, any element of debugging will be more than welcome.

Cheers
Chris

Attachments:
flipDistributed.v001_02.hip (2.2 MB)

User Avatar
Member
183 posts
Joined: Nov. 2008
Offline
See what simtracker.py server tells you. There is a simple web interface to it. Go to your browser and type:

hostname_where_simtracker_is_running : PORT

PORT should be the port number simtracker.py is running on + 1
Edited by Stalkerx777 - March 15, 2018 15:00:08
Aleksei Rusev
Sr. Graphics Tools Engineer @ Nvidia
User Avatar
Member
52 posts
Joined: June 2012
Offline
Thanks Aleksei for your answer,

I did many things I could find in documentation, forums and tutorials. Checking the tracker was one of them though even though I can see it stops during the syncing, it does not help me much finding what I can do to sort it out.

I wonder if it could be a network issue where the router receives a flood of request on some port and ends up blocking it. The route does not seem to give any warning about that though.

I attached a screenshot of the tracker page, this time it went up to frame 26.

Attachments:
tracker-stuck.png (77.4 KB)

User Avatar
Staff
1274 posts
Joined: July 2005
Offline
Hello,

By any chance are you distributing the sim over Mac machines?

It's a known issue that distributed sims may get stuck randomly on Mac. There's an Apple bug report that's filed which describes and details the networking behavior that triggers the issue.

The only workaround that I know of is to use sim caches and restart the sim if it gets stuck.

Cheers,
Rob
User Avatar
Member
18 posts
Joined: July 2011
Offline
Hi,

We have a very similar issue running on CentOS 6.7 with Deadline. Here are a few observations on that:

- as Chris pointed out, stuck occurs when one peer jumps to the next frame without waiting for the others to finish the previous frame
- number of slices seem to have an effect: more slices you use makes the simulation stuck earlier
- same simulation being run several times always gets stuck at the same frame and likely to stuck on the same slice
- stuck occurs only on a fairly high resolution simulations, raising particle separation resolves the issue
- turning Distributed Pressure Solve off seems to resolve the issue
- heartbeats continue after simulation has stopped

We'll do the auto-restart from the checkpoints for now, but any hints on where to check for the possible reason would be much appreciated.

-Pavel
Edited by nimnul - Aug. 6, 2018 03:01:46
User Avatar
Staff
6413 posts
Joined: July 2005
Offline
nimnul
as Chris pointed out, stuck occurs when one peer jumps to the next frame without waiting for the others to finish the previous frame

This is a sort of “split-brain” problem. This usually isn't a networking issue, but a logic issue in the distributed simulation.

A very common cause of this is when the substepping isn't synced between machines.

This doesn't explain:
nimnul
turning Distributed Pressure Solve off seems to resolve the issue
however, as that shouldn't materially affect whether substepping stays synced.

Does your flip sim use variable substepping? Ie, Min Substep < Max Substep?

If you can try locking substeps by setting those equal, it may stop the issue?

Ideally if you have a case that reproduces that you can submit to support, we'd like to see it fail here as it can be hard to figure out what is causing computation to diverge.
User Avatar
Member
18 posts
Joined: July 2011
Offline
Hi Jeff,

Thank you for your response. The substep settings were indeed 1 < 2.
I'll try running it locked a couple of times.

-Pavel
User Avatar
Member
5 posts
Joined: June 2016
Offline
Did you resolve your issue? I have had no success with distributed flip systems. grains and pyro work fine but flip will just sit in the Que and do nothing.
User Avatar
Member
31 posts
Joined: June 2010
Offline
nimnul
Hi,

We have a very similar issue running on CentOS 6.7 with Deadline. Here are a few observations on that:

- as Chris pointed out, stuck occurs when one peer jumps to the next frame without waiting for the others to finish the previous frame
- number of slices seem to have an effect: more slices you use makes the simulation stuck earlier
- same simulation being run several times always gets stuck at the same frame and likely to stuck on the same slice
- stuck occurs only on a fairly high resolution simulations, raising particle separation resolves the issue
- turning Distributed Pressure Solve off seems to resolve the issue
- heartbeats continue after simulation has stopped

We'll do the auto-restart from the checkpoints for now, but any hints on where to check for the possible reason would be much appreciated.

-Pavel

Can you guide me how to do checkpoints on distributed sim? I tried, in checkpoint name I added $SLICE variable. If I want sim to start from some frame, I set frame range from last checkpoint to the end of range. It goes on farm and does nothing, memory goes to 20gb, but CPU 1%. In normal case, memory would be around 50GB. Any advice how to do it would be of great help. Thanks!
User Avatar
Member
27 posts
Joined: Oct. 2015
Offline
With distributed sims you have to set the "gas net slice exchange" DOP node to always update. (Its set to set to initial by default)
User Avatar
Member
6 posts
Joined: March 2017
Offline
timjan
With distributed sims you have to set the "gas net slice exchange" DOP node to always update. (Its set to set to initial by default)

I tried everything and finally your solution worked ! Thanks! So the default setup created by distribute Particle Fluid shelf tool isnt right? thats weird
Akshay Asarkar
  • Quick Links