Hi,
I'm currently trying to set find out how I could simulate across multiple computers a flip simulation while using royal render (as I don't have access to HQueue and my teachers don't want to install it only to simulate only)or, if rRender can't do it, making the distributed jobs without Hqueue, based on the masterclass made by Jeff Wagner (http://www.sidefx.com/index.php?option=com_content&task=view&id=1516&Itemid=9 [sidefx.com]) , witch will be much more painful.
So, I've followed Jeff's steps until the end, just before he automates the distribution.
And here comes the troubles: :twisted:
I always get the same message,
Write Connect Failure. Error occured
My length 34
My adress “Computer name” by 8000
Message in error state after connection atempt.
I first thought about the tracker, but he can be accessed, even from another computer.
Any ideas of where it could comes from or explain to me how I could use rRender to sim across multiple computers (as it doesn't take hqueue simulation into account).
Thank's in advance
Manual Distributed Simulation tracker problems
9798 14 3-
- Matthew05
- Member
- 94 posts
- Joined: April 2011
- Offline
-
- pbowmar
- Member
- 7046 posts
- Joined: July 2005
- Offline
-
- claudiohick
- Member
- 36 posts
- Joined: May 2013
- Offline
-
- trojan_goat
- Member
- 41 posts
- Joined: June 2010
- Offline
I just dealt with this about a month ago. I did not use hqueue, instead I ran a script on each machine.
Using the same setup, make sure you disable resize container (slices don't work with resize)
In your houdini file, you have to set the name of the host of your tracker. You will find this under “DISTRIBUTE_pyro_CONTROLS” You can probably just use local host.
make sure you run all the commands in a houdini shell,
(houdini install path)\houdini\14.0.233\bin\hcmd.exe
when you have the shell open you need to run a hython command to start the simtracker
hython (houdini install path)\houdini\14.0.233\houdini\python2.7libs\simtracker.py 8000 9000
Once the sim tracker is running, you can run the houdini shell on your remote machines and use this command:
hbatch -c “setenv SLICE=0; render /obj/distribute_uprespyro/saveslices;quit;” houdinifliepath\distributed_pyro.hip
You'll of course need to change the object and file paths to suit your needs and you will also need to change the slice on each computer.
If any one of the machines fail, the tracker will report the error
Using the same setup, make sure you disable resize container (slices don't work with resize)
In your houdini file, you have to set the name of the host of your tracker. You will find this under “DISTRIBUTE_pyro_CONTROLS” You can probably just use local host.
make sure you run all the commands in a houdini shell,
(houdini install path)\houdini\14.0.233\bin\hcmd.exe
when you have the shell open you need to run a hython command to start the simtracker
hython (houdini install path)\houdini\14.0.233\houdini\python2.7libs\simtracker.py 8000 9000
Once the sim tracker is running, you can run the houdini shell on your remote machines and use this command:
hbatch -c “setenv SLICE=0; render /obj/distribute_uprespyro/saveslices;quit;” houdinifliepath\distributed_pyro.hip
You'll of course need to change the object and file paths to suit your needs and you will also need to change the slice on each computer.
If any one of the machines fail, the tracker will report the error
~t.goat
-
- jlait
- Staff
- 6531 posts
- Joined: July 2005
- Online
-
- claudiohick
- Member
- 36 posts
- Joined: May 2013
- Offline
-
- jlait
- Staff
- 6531 posts
- Joined: July 2005
- Online
That error is generated by the Houdini session when it fails to connect to another machine.
In this case, the port 8000 means you are likely trying to connect to the machine running the tracker.
Is “Computer name” running the tracker?
You should be able to point a web browser to
http://Computer [computer] Name:9000
and see the tracker status there.
In this case, the port 8000 means you are likely trying to connect to the machine running the tracker.
Is “Computer name” running the tracker?
You should be able to point a web browser to
http://Computer [computer] Name:9000
and see the tracker status there.
-
- claudiohick
- Member
- 36 posts
- Joined: May 2013
- Offline
Jeff, I can only access the tracker's status through the own machine,
almost sure its permission/firewall issue. I just dont know which
service/port enable. Every time the tracker uses a different door.
I'm using linux, btw.
Hqueue status: NODE02 computer
The Houdini 15.0.244.16 environment has been initialized.
ALF_PROGRESS 0%
Write Connect failure. code: Bad Address
My length 0
My address node03 by 35332
Write Connect failure. code: Bad Address
My length 31
My address node03 by 35332
Message in error state after connection attempt!
—- Pump enters error status —-
Write Connect failure. code: Bad Address
My length 140515278872504
My address node03 by 35332
Write Connect failure. code: Bad Address
My length 29
My address node03 by 35332
Message in error state after connection attempt!
Tracker message: NODE03
REFRESH(30 sec): http://node03:57854/ [node03]
Active List
Job: flipsolver1_2: @1446651059.781678, (n: 2, a: 1, d: 0, e: 0)
Peer Info acquire->sync acquire->done sync->done
peer #3 (@1446651059.781682) - 192.168.0.11 : 45239 pending pending
pending
__________________________________________________________________
Barriers
__________________________________________________________________
Done List
almost sure its permission/firewall issue. I just dont know which
service/port enable. Every time the tracker uses a different door.
I'm using linux, btw.
Hqueue status: NODE02 computer
The Houdini 15.0.244.16 environment has been initialized.
ALF_PROGRESS 0%
Write Connect failure. code: Bad Address
My length 0
My address node03 by 35332
Write Connect failure. code: Bad Address
My length 31
My address node03 by 35332
Message in error state after connection attempt!
—- Pump enters error status —-
Write Connect failure. code: Bad Address
My length 140515278872504
My address node03 by 35332
Write Connect failure. code: Bad Address
My length 29
My address node03 by 35332
Message in error state after connection attempt!
Tracker message: NODE03
REFRESH(30 sec): http://node03:57854/ [node03]
Active List
Job: flipsolver1_2: @1446651059.781678, (n: 2, a: 1, d: 0, e: 0)
Peer Info acquire->sync acquire->done sync->done
peer #3 (@1446651059.781682) - 192.168.0.11 : 45239 pending pending
pending
__________________________________________________________________
Barriers
__________________________________________________________________
Done List
-
- claudiohick
- Member
- 36 posts
- Joined: May 2013
- Offline
the documentation shows something important. I'll try to fix using static DNS.
Verify that network connections are possible between machines.
The client machines will communicate with the server and with the shared folder host machine. Check that every client machine can locate the HQueue Server machine by its domain (DNS) name. Similarly check that the clients can locate the Shared Folder Server machine by its DNS name.
Additionally check that the host names (or computer names) of the client machines match their DNS names. This is important for when the HQueue server needs to contact the clients.
Verify that network connections are possible between machines.
The client machines will communicate with the server and with the shared folder host machine. Check that every client machine can locate the HQueue Server machine by its domain (DNS) name. Similarly check that the clients can locate the Shared Folder Server machine by its DNS name.
Additionally check that the host names (or computer names) of the client machines match their DNS names. This is important for when the HQueue server needs to contact the clients.
-
- jlait
- Staff
- 6531 posts
- Joined: July 2005
- Online
So you got Hqueue to work? Excellent. That is the best approach.
However, as you notice, it allows the tracker to open any port rather than a fixed port. This is important to ensure more than one tracker can run on one machine, and that you don't get failures due to a port still being held from a previous simulation.
The client-to-client communication also allocates a port at run time. There are no fixed set of ports to open between the machines. There needs to be no firewall/shorewall between the machines that are simulating, or between them and the tracker. There MUST, of course, be a firewall between your machines and the rest of the internet!
If you can't access the tracker's status from other machines, then they also will be unable to access the tracker directly. The first thing to fix is this. Unfortunately there are many linux configurations out there so it is hard to provide more than general pointers.
However, as you notice, it allows the tracker to open any port rather than a fixed port. This is important to ensure more than one tracker can run on one machine, and that you don't get failures due to a port still being held from a previous simulation.
The client-to-client communication also allocates a port at run time. There are no fixed set of ports to open between the machines. There needs to be no firewall/shorewall between the machines that are simulating, or between them and the tracker. There MUST, of course, be a firewall between your machines and the rest of the internet!
If you can't access the tracker's status from other machines, then they also will be unable to access the tracker directly. The first thing to fix is this. Unfortunately there are many linux configurations out there so it is hard to provide more than general pointers.
-
- claudiohick
- Member
- 36 posts
- Joined: May 2013
- Offline
OK!
the solution to avoid the error before was set /etc/hosts using hostname/ip
I have no error messages, but it seems stuck in the same status.
tracker message:
Active List
Job: flipsolver1_2: @1446662733.185918, (n: 4, a: 1, d: 0, e: 0)
Peer Info acquire->sync acquire->done sync->done
peer #3 (@1446662733.185923) - 127.0.0.1 : 48640 pending pending pending
the solution to avoid the error before was set /etc/hosts using hostname/ip
I have no error messages, but it seems stuck in the same status.
tracker message:
Active List
Job: flipsolver1_2: @1446662733.185918, (n: 4, a: 1, d: 0, e: 0)
Peer Info acquire->sync acquire->done sync->done
peer #3 (@1446662733.185923) - 127.0.0.1 : 48640 pending pending pending
-
- jlait
- Staff
- 6531 posts
- Joined: July 2005
- Online
Only one machine successfully connected to the tracker. The only machine that connected was the machine running the tracker, thus the 127.0.0.1 loopback address.
The two problems I see are:
1) The other three machines aren't connecting/finding the tracker
2) The 127.0.0.1 suggests you have your own machine name in /etc/hosts as 127.0.0.1. Some linux machines are configured this way. If you have a line
127.0.0.1 mymachinename
it should be removed.
The two problems I see are:
1) The other three machines aren't connecting/finding the tracker
2) The 127.0.0.1 suggests you have your own machine name in /etc/hosts as 127.0.0.1. Some linux machines are configured this way. If you have a line
127.0.0.1 mymachinename
it should be removed.
-
- claudiohick
- Member
- 36 posts
- Joined: May 2013
- Offline
perfect! its working now, just a final little point.
I had to temporally stop firewall for all machines…
then finally still have to release something in the firewall, it would be a service or port? Apparently the tracker always uses a random port, so I imagine it is a service that needs to be released, right?
final status tracker:
Active List
Job: flipsolver1_228: @1446673101.530937, (n: 3, a: 2, d: 0, e: 0)
Peer Info acquire->sync acquire->done sync->done
peer #2 (@1446673101.530939) - 192.168.0.10 : 35440 pending pending pending
peer #0 (@1446673101.536830) - 192.168.0.11 : 48590 pending pending pending
Barriers
Done List
Job: flipsolver1_228 - Pressure Exchange: @1446673101.446718, (n: 3, a: 3, d: 3, e: 0)
acquire->sync: 0.001169s
sync->done: 0.008716s
Peer Info acquire->sync acquire->done sync->done
peer #1 (@1446673101.446720) - 192.168.0.3 : 47357 0.001167s 0.009683 0.008516
peer #2 (@1446673101.446785) - 192.168.0.10 : 35440 0.001102s 0.009733 0.008631
peer #0 (@1446673101.447858) - 192.168.0.11 : 48590 0.000029s 0.007951 0.007922
Job: flipsolver1_228 - Pressure Request: @1446673101.426048, (n: 3, a: 3, d: 3, e: 0)
acquire->sync: 0.001146s
sync->done: 0.012683s
Peer Info acquire->sync acquire->done sync->done
peer #2 (@1446673101.426050) - 192.168.0.10 : 35440 0.001144s 0.013817 0.012673
peer #1 (@1446673101.426124) - 192.168.0.3 : 47357 0.001070s 0.013174 0.012104
peer #0 (@1446673101.427141) - 192.168.0.11 : 48590 0.000053s 0.012087 0.012034
I had to temporally stop firewall for all machines…
then finally still have to release something in the firewall, it would be a service or port? Apparently the tracker always uses a random port, so I imagine it is a service that needs to be released, right?
final status tracker:

Active List
Job: flipsolver1_228: @1446673101.530937, (n: 3, a: 2, d: 0, e: 0)
Peer Info acquire->sync acquire->done sync->done
peer #2 (@1446673101.530939) - 192.168.0.10 : 35440 pending pending pending
peer #0 (@1446673101.536830) - 192.168.0.11 : 48590 pending pending pending
Barriers
Done List
Job: flipsolver1_228 - Pressure Exchange: @1446673101.446718, (n: 3, a: 3, d: 3, e: 0)
acquire->sync: 0.001169s
sync->done: 0.008716s
Peer Info acquire->sync acquire->done sync->done
peer #1 (@1446673101.446720) - 192.168.0.3 : 47357 0.001167s 0.009683 0.008516
peer #2 (@1446673101.446785) - 192.168.0.10 : 35440 0.001102s 0.009733 0.008631
peer #0 (@1446673101.447858) - 192.168.0.11 : 48590 0.000029s 0.007951 0.007922
Job: flipsolver1_228 - Pressure Request: @1446673101.426048, (n: 3, a: 3, d: 3, e: 0)
acquire->sync: 0.001146s
sync->done: 0.012683s
Peer Info acquire->sync acquire->done sync->done
peer #2 (@1446673101.426050) - 192.168.0.10 : 35440 0.001144s 0.013817 0.012673
peer #1 (@1446673101.426124) - 192.168.0.3 : 47357 0.001070s 0.013174 0.012104
peer #0 (@1446673101.427141) - 192.168.0.11 : 48590 0.000053s 0.012087 0.012034
-
- jlait
- Staff
- 6531 posts
- Joined: July 2005
- Online
There are no specific ports, I'm afraid. I'm not sure how your firewall is configured, but all ports should be opened for peer-to-peer traffic. I suspect the pre-built “services” are to predefine port-sets.
It is very important you keep a firewall between your compute nodes and the rest of the internet, however. A usual configuration is your node machines are all behind a single router/firewall box but have free connection to each other.
It is very important you keep a firewall between your compute nodes and the rest of the internet, however. A usual configuration is your node machines are all behind a single router/firewall box but have free connection to each other.
-
- Leon_Y
- Member
- 7 posts
- Joined: April 2016
- Offline
-
- Quick Links