Hi,
trying to sim FLIP slices through HQueue with two machines to start. The simulation starts then get stuck after sometimes 1, sometimes 10 and more rarely about 30 frames. Most of the time it stops at 10, the result is exactly what I would expect so I wonder what could be happening that makes Houdini get stuck like this.
I found a few similar posts online but with no answer, any element of debugging will be more than welcome.
Cheers
Chris
flip distributed sim stuck
5405 10 3- chistof
- Member
- 52 posts
- Joined: June 2012
- Offline
- Stalkerx777
- Member
- 183 posts
- Joined: Nov. 2008
- Offline
See what simtracker.py server tells you. There is a simple web interface to it. Go to your browser and type:
hostname_where_simtracker_is_running : PORT
PORT should be the port number simtracker.py is running on + 1
hostname_where_simtracker_is_running : PORT
PORT should be the port number simtracker.py is running on + 1
Edited by Stalkerx777 - March 15, 2018 15:00:08
Aleksei Rusev
Sr. Graphics Tools Engineer @ Nvidia
Sr. Graphics Tools Engineer @ Nvidia
- chistof
- Member
- 52 posts
- Joined: June 2012
- Offline
Thanks Aleksei for your answer,
I did many things I could find in documentation, forums and tutorials. Checking the tracker was one of them though even though I can see it stops during the syncing, it does not help me much finding what I can do to sort it out.
I wonder if it could be a network issue where the router receives a flood of request on some port and ends up blocking it. The route does not seem to give any warning about that though.
I attached a screenshot of the tracker page, this time it went up to frame 26.
I did many things I could find in documentation, forums and tutorials. Checking the tracker was one of them though even though I can see it stops during the syncing, it does not help me much finding what I can do to sort it out.
I wonder if it could be a network issue where the router receives a flood of request on some port and ends up blocking it. The route does not seem to give any warning about that though.
I attached a screenshot of the tracker page, this time it went up to frame 26.
- rvinluan
- Staff
- 1274 posts
- Joined: July 2005
- Offline
Hello,
By any chance are you distributing the sim over Mac machines?
It's a known issue that distributed sims may get stuck randomly on Mac. There's an Apple bug report that's filed which describes and details the networking behavior that triggers the issue.
The only workaround that I know of is to use sim caches and restart the sim if it gets stuck.
Cheers,
Rob
By any chance are you distributing the sim over Mac machines?
It's a known issue that distributed sims may get stuck randomly on Mac. There's an Apple bug report that's filed which describes and details the networking behavior that triggers the issue.
The only workaround that I know of is to use sim caches and restart the sim if it gets stuck.
Cheers,
Rob
- nimnul
- Member
- 18 posts
- Joined: July 2011
- Offline
Hi,
We have a very similar issue running on CentOS 6.7 with Deadline. Here are a few observations on that:
- as Chris pointed out, stuck occurs when one peer jumps to the next frame without waiting for the others to finish the previous frame
- number of slices seem to have an effect: more slices you use makes the simulation stuck earlier
- same simulation being run several times always gets stuck at the same frame and likely to stuck on the same slice
- stuck occurs only on a fairly high resolution simulations, raising particle separation resolves the issue
- turning Distributed Pressure Solve off seems to resolve the issue
- heartbeats continue after simulation has stopped
We'll do the auto-restart from the checkpoints for now, but any hints on where to check for the possible reason would be much appreciated.
-Pavel
We have a very similar issue running on CentOS 6.7 with Deadline. Here are a few observations on that:
- as Chris pointed out, stuck occurs when one peer jumps to the next frame without waiting for the others to finish the previous frame
- number of slices seem to have an effect: more slices you use makes the simulation stuck earlier
- same simulation being run several times always gets stuck at the same frame and likely to stuck on the same slice
- stuck occurs only on a fairly high resolution simulations, raising particle separation resolves the issue
- turning Distributed Pressure Solve off seems to resolve the issue
- heartbeats continue after simulation has stopped
We'll do the auto-restart from the checkpoints for now, but any hints on where to check for the possible reason would be much appreciated.
-Pavel
Edited by nimnul - Aug. 6, 2018 03:01:46
- jlait
- Staff
- 6413 posts
- Joined: July 2005
- Offline
nimnul
as Chris pointed out, stuck occurs when one peer jumps to the next frame without waiting for the others to finish the previous frame
This is a sort of “split-brain” problem. This usually isn't a networking issue, but a logic issue in the distributed simulation.
A very common cause of this is when the substepping isn't synced between machines.
This doesn't explain:
nimnulhowever, as that shouldn't materially affect whether substepping stays synced.
turning Distributed Pressure Solve off seems to resolve the issue
Does your flip sim use variable substepping? Ie, Min Substep < Max Substep?
If you can try locking substeps by setting those equal, it may stop the issue?
Ideally if you have a case that reproduces that you can submit to support, we'd like to see it fail here as it can be hard to figure out what is causing computation to diverge.
- nimnul
- Member
- 18 posts
- Joined: July 2011
- Offline
- Daniel_Lefebre
- Member
- 5 posts
- Joined: June 2016
- Offline
- Htogrom
- Member
- 31 posts
- Joined: June 2010
- Offline
nimnul
Hi,
We have a very similar issue running on CentOS 6.7 with Deadline. Here are a few observations on that:
- as Chris pointed out, stuck occurs when one peer jumps to the next frame without waiting for the others to finish the previous frame
- number of slices seem to have an effect: more slices you use makes the simulation stuck earlier
- same simulation being run several times always gets stuck at the same frame and likely to stuck on the same slice
- stuck occurs only on a fairly high resolution simulations, raising particle separation resolves the issue
- turning Distributed Pressure Solve off seems to resolve the issue
- heartbeats continue after simulation has stopped
We'll do the auto-restart from the checkpoints for now, but any hints on where to check for the possible reason would be much appreciated.
-Pavel
Can you guide me how to do checkpoints on distributed sim? I tried, in checkpoint name I added $SLICE variable. If I want sim to start from some frame, I set frame range from last checkpoint to the end of range. It goes on farm and does nothing, memory goes to 20gb, but CPU 1%. In normal case, memory would be around 50GB. Any advice how to do it would be of great help. Thanks!
- timjan
- Member
- 27 posts
- Joined: Oct. 2015
- Offline
- akshay_asarkar
- Member
- 6 posts
- Joined: March 2017
- Offline
timjan
With distributed sims you have to set the "gas net slice exchange" DOP node to always update. (Its set to set to initial by default)
I tried everything and finally your solution worked ! Thanks! So the default setup created by distribute Particle Fluid shelf tool isnt right? thats weird
Akshay Asarkar
-
- Quick Links