TOP Deadline updates in Houdini 18.0.399 (new MQ)

   7767   15   3
User Avatar
Member
571 posts
Joined: May 2017
Offline
The TOP Deadline scheduler has received some changes in Houdini 18.0.399.
You'll find the latest daily builds here: https://www.sidefx.com/download/daily-builds/#category-devel [www.sidefx.com]

New parms for the Message Queue:


The main change here is the Message Queue (MQ) server mechanism we use for reporting results back from jobs running on the farm. After we introduced it, there have been quite a few issues related to it. We made some improvements to address these. Currently we are trying it out with the TOP Deadline scheduler to test it out, and get feedback.

There are now 3 types of MQ modes: Local, Farm, Connect
  • Local (default) creates MQ server on local machine (ideal if no firewall)
  • Farm creates MQ job (with its own job settings) on the farm
  • Connect will connect to already running MQ server (only for new MQ, see below)

Previously the TOP Deadline scheduler always created a MQ server running on the farm, either as a background process, or as a task. Now, it will always default to running it locally on the submission machine (Local mode). This should reduce a lot of the MQ network and job scheduling issues.

If you do have firewalls between the farm machines and the submission machine, then you can use the Farm mode to have the MQ server run on the farm, allowing the work tasks to report results to it. Then you can open up limited ports through the firewalls from the MQ server to your submission machine, and specify those ports on the parm interface posted above. Note that MQ server runs as a job taking up a farm machine slot. To alleviate this, there are new parms to specify job settings just for the MQ job.

The above 2 modes work with the current MQ server. You can simply try that out and let us know how it goes.

If you would like to help test the new MQ server, please continue reading.

New MQ server
The new MQ server is wholly different from the current MQ. It uses a different network protocol, can run standalone, and supports relaying results from multiple PDG jobs. It is packaged as a standalone executable found at $HFS/bin/mqserver. While it is automatically managed in Local and Farm modes, in Connect mode, you'll need to manage it yourself.

To use it, first, set PDG_USE_PDGNET=1 in your environment before launching Houdini. This enables using the new MQ server with the TOP Deadline scheduler.

Now try the Local and Farm modes via the TOP Deadline scheduler. If you have multiple TOP Deadline scheduler nodes, they will use the same MQ automatically (so there should only be 1 MQ job running on the farm for each graph).

Connect is a new mode that is only available with a new MQ server we added. This is meant for connecting to an existing MQ server, and sharing it with other PDG jobs. To use this, you must start the MQ server manually on a machine that all farm machines can communicate with.

For the new Connect mode and standalone MQ server, this requires a bit of setup.

Copy the $HFS/bin/mqserver to the machine you want to run the server from (or launch from your local machine).
Launch it with the following command:
mqserver -p 0 -n 128 -l 3 -w 0 128 result -s

After it launches, you'll see a log such as the following:
PDG_MQ 192.168.1.246 49175 49175 49176
## Message Queue Server Running

The address breakdown is as follows: IP, rpc port, relay port, http port

You'll need to enter the relay port into Relay Port on the TOP Deadline scheduler's nodes parm interface under Message Queue.
Similarly, enter the http port into the Task Callback Port.

Now if you cook, it should use the MQ server. You'll see some logging about clients connecting and disconnecting.

We'll be improving the workflow and UX of this over time. The new MQ was released to get initial testing. Any help you can provide is appreciated. More TOP Deadline improvements are coming as well.

Attachments:
dl_newmq.png (20.2 KB)

User Avatar
Member
571 posts
Joined: May 2017
Offline
Additional changes:
Jobs are automatically placed in a batch with timestamped name. The batch contains 2 jobs with following names:
* PDG TASKS (work item tasks)
* PDG MQ (mq job)
You can change the batch name and job names as you wish.

Removed running MQ as a background process.

Submit Graph as Job will now always use a local MQ (i.e. not a separate MQ job).

Work item tasks are scheduled right away instead of waiting for the MQ connection file first. This speeds up the initial Deadline scheduling time.

Improved cancelling jobs and job failures handling.
Edited by seelan - March 4, 2020 14:16:53
User Avatar
Member
120 posts
Joined: July 2005
Offline
Looking forward to trying this new MQ with HQueue, coming soon?

BTW I just got it (linux 18.0.401) up and running and I notice it sits at about 8-9% CPU while idle which seems excessive?

> mqserver -p 4999 -s -i 0.0.0.0
Edited by drew - March 10, 2020 01:51:41
User Avatar
Member
120 posts
Joined: July 2005
Offline
A bit of extra info to the last post. Strace'ing indicates that a thread is being woken up, very often. Maybe that is the cause of the idle cpu load?

nanosleep({tv_sec=0, tv_nsec=100}, NULL) = 0
User Avatar
Member
571 posts
Joined: May 2017
Offline
Yes, that is a known issue and will be fixed, along with some other minor improvements to the MQ server. Thanks.
User Avatar
Member
571 posts
Joined: May 2017
Offline
I'll be switching TOP Deadline scheduler to use the new MQ server feature permanently in 2 weeks. Please test it out before then in your farm setup so that we can deal with issues before then.

All your really need to do for testing is to set PDG_USE_PDGNET=1 in your environment before launching Houdini.

Also, the mqserver's high cpu usage issue has been fixed and will be available in tomorrow's daily build.

Thanks.
Edited by seelan - April 8, 2020 11:47:55
User Avatar
Member
7046 posts
Joined: July 2005
Offline
Hi, found this super useful thread, hitting some issues that maybe unrelated but thought I'd give it a try.

I'm using 18.0.460 Win 10, I have 3 machines, 1x coord and 2x workers, shared Q: drive on all three, Deadline set up and working fine with Python test scripts on all machines.

I'm using Submit Graph As Job as I don't expect or really want interactivity, just execution of the static TOPs graph. Current test is a single ROP Geometry Output. This works fine with localscheduler.

I've set the PDG_USE_PDGNET env var to 1 on coord and client machines just to be safe.

I've tried both Local and Farm modes in the Message Queue section.

When I submit the job (Submit Graph As Job) it shows up correctly in Deadline, but errors on Workers with:
2020-10-06 17:51:12:  0: STDOUT: Running Houdini 18.0.460 with PID 4980
2020-10-06 17:51:12: 0: STDOUT: PDG: Loading .hip file q://aws_testGrid18.0.460py27_v01.hip.
2020-10-06 17:51:13: 0: STDOUT: PDG: Setting batch sub index to 0
2020-10-06 17:51:13: 0: STDOUT: PDG_START: noisegrid_ropfetch10;0
2020-10-06 17:51:34: 0: STDOUT: ERROR: The attempted operation failed.
2020-10-06 17:51:34: 0: STDOUT: Error: Python error: Traceback (most recent call last):
2020-10-06 17:51:34: 0: STDOUT: File "<stdin>", line 44, in <module>
2020-10-06 17:51:34: 0: STDOUT: File "q:\pdgtemp\4348\scripts\pdgcmd.py", line 409, in execStartCook
2020-10-06 17:51:34: 0: STDOUT: _invokeRpc(s, "start_cook_batch", item_name, subindex, theJobid)
2020-10-06 17:51:34: 0: STDOUT: File "q:\pdgtemp\4348\scripts\pdgcmd.py", line 258, in _invokeRpc
2020-10-06 17:51:34: 0: STDOUT: return _invokeRpcFn(fn, fn_name, *args)
2020-10-06 17:51:34: 0: STDOUT: File "q:\pdgtemp\4348\scripts\pdgcmd.py", line 262, in _invokeRpcFn
2020-10-06 17:51:34: 0: STDOUT: return _invokePDGNetRpcFn(fn, fn_name, *args)
2020-10-06 17:51:34: 0: STDOUT: File "q:\pdgtemp\4348\scripts\pdgcmd.py", line 276, in _invokePDGNetRpcFn
2020-10-06 17:51:34: 0: STDOUT: if fn.lasterror == 5 or fn.lasterror == 6:
2020-10-06 17:51:34: 0: STDOUT: AttributeError: 'PDGNetRPCMessage' object has no attribute 'lasterror'

Any thoughts?

Cheers,

Peter B
Edited by pbowmar - Oct. 6, 2020 14:00:58
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
7046 posts
Joined: July 2005
Offline
I also found this line, did the env var from this thread change?

# New MQ using PDGNet for jobs
use_pdgnet = os.environ.get('PDG_JOBUSE_PDGNET', ‘0’) == ‘1’
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
603 posts
Joined: Sept. 2016
Offline
pbowmar
I also found this line, did the env var from this thread change?

# New MQ using PDGNet for jobs
use_pdgnet = os.environ.get('PDG_JOBUSE_PDGNET', ‘0’) == ‘1’

There are two env vars - PDG_JOBUSE_PDGNET is internally sent to the job to tell it to use PDGNET instead of the XMLRPC MQ. PDG_USE_PDGNET is used to tell houdini to enable that, in versions before it became the default.
User Avatar
Member
603 posts
Joined: Sept. 2016
Offline
pbowmar
Hi, found this super useful thread, hitting some issues that maybe unrelated but thought I'd give it a try.

I'm using 18.0.460 Win 10, I have 3 machines, 1x coord and 2x workers, shared Q: drive on all three, Deadline set up and working fine with Python test scripts on all machines.

I'm using Submit Graph As Job as I don't expect or really want interactivity, just execution of the static TOPs graph. Current test is a single ROP Geometry Output. This works fine with localscheduler.

I've set the PDG_USE_PDGNET env var to 1 on coord and client machines just to be safe.

I've tried both Local and Farm modes in the Message Queue section.

When I submit the job (Submit Graph As Job) it shows up correctly in Deadline, but errors on Workers with:
2020-10-06 17:51:12:  0: STDOUT: Running Houdini 18.0.460 with PID 4980
2020-10-06 17:51:12: 0: STDOUT: PDG: Loading .hip file q://aws_testGrid18.0.460py27_v01.hip.
2020-10-06 17:51:13: 0: STDOUT: PDG: Setting batch sub index to 0
2020-10-06 17:51:13: 0: STDOUT: PDG_START: noisegrid_ropfetch10;0
2020-10-06 17:51:34: 0: STDOUT: ERROR: The attempted operation failed.
2020-10-06 17:51:34: 0: STDOUT: Error: Python error: Traceback (most recent call last):
2020-10-06 17:51:34: 0: STDOUT: File "<stdin>", line 44, in <module>
2020-10-06 17:51:34: 0: STDOUT: File "q:\pdgtemp\4348\scripts\pdgcmd.py", line 409, in execStartCook
2020-10-06 17:51:34: 0: STDOUT: _invokeRpc(s, "start_cook_batch", item_name, subindex, theJobid)
2020-10-06 17:51:34: 0: STDOUT: File "q:\pdgtemp\4348\scripts\pdgcmd.py", line 258, in _invokeRpc
2020-10-06 17:51:34: 0: STDOUT: return _invokeRpcFn(fn, fn_name, *args)
2020-10-06 17:51:34: 0: STDOUT: File "q:\pdgtemp\4348\scripts\pdgcmd.py", line 262, in _invokeRpcFn
2020-10-06 17:51:34: 0: STDOUT: return _invokePDGNetRpcFn(fn, fn_name, *args)
2020-10-06 17:51:34: 0: STDOUT: File "q:\pdgtemp\4348\scripts\pdgcmd.py", line 276, in _invokePDGNetRpcFn
2020-10-06 17:51:34: 0: STDOUT: if fn.lasterror == 5 or fn.lasterror == 6:
2020-10-06 17:51:34: 0: STDOUT: AttributeError: 'PDGNetRPCMessage' object has no attribute 'lasterror'

Any thoughts?

Cheers,

Peter B

Could you tried setting PDG_USE_PDGNET=0 in your houdini environment? If it's a general networking issue the error should show up in that mode as well.
User Avatar
Member
7046 posts
Joined: July 2005
Offline
Exact same error with PDG_USE_PDGNET set to 0 on all machines…
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
7046 posts
Joined: July 2005
Offline
On these machines I do have the ability open ports between them, in case that helps?
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
603 posts
Joined: Sept. 2016
Offline
pbowmar
Exact same error with PDG_USE_PDGNET set to 0 on all machines…

It shouldn't be the exact same error, I would expect the RPC to timeout, but the message would not mention PDGNetRPCMessage, because that is specific to the ‘PDGNET’ mode.

In any case, for submit-as-job you need to ensure that the Task Callback Port is opened between farm machines, so that the RPCs can reach PDG.

https://www.sidefx.com/docs/houdini/tops/farm_troubleshooting.html#work-items-fail-to-report-results-due-to-connection-refused-or-time-out [www.sidefx.com]
User Avatar
Member
7046 posts
Joined: July 2005
Offline
Thanks Chris,

Yes, apologies, I misread the error, it is now:

2020-10-06 23:33:40:  0: STDOUT: 23:33:40: Failed RPC start_cook_batch with error: PDGnet RPC send-get-reply failed. (error 268435577: MQ error #268435577). Retry 1/4
2020-10-06 23:34:01: 0: STDOUT: 23:34:01: Failed RPC start_cook_batch with error: PDGnet RPC send-get-reply failed. (error 268435577: MQ error #268435577). Retry 2/4
2020-10-06 23:34:22: 0: STDOUT: 23:34:22: Failed RPC start_cook_batch with error: PDGnet RPC send-get-reply failed. (error 268435577: MQ error #268435577). Retry 3/4
2020-10-06 23:34:43: 0: STDOUT: 23:34:43: Failed RPC start_cook_batch with error: PDGnet RPC send-get-reply failed. (error 268435577: MQ error #268435577). Retry 4/4
2020-10-06 23:35:04: 0: STDOUT: 23:35:04: Failed RPC start_cook_batch with error: PDGnet RPC send-get-reply failed. (error 268435577: MQ error #268435577)
2020-10-06 23:35:04: 0: STDOUT: 23:35:04: Failed RPC to 10.99.29.159:1025: start_cook_batch ('noisegrid_ropfetch120', 0, '5f7cfeae8bff1d09f082e4e3')
2020-10-06 23:35:04: 0: STDOUT: ERROR: The attempted operation failed.
2020-10-06 23:35:04: 0: STDOUT: Error: Python error: Traceback (most recent call last):
2020-10-06 23:35:04: 0: STDOUT: File "<stdin>", line 17, in <module>
2020-10-06 23:35:04: 0: STDOUT: File "q:\pdgtemp\4320\scripts\rop.py", line 223, in preFrame
2020-10-06 23:35:04: 0: STDOUT: execStartCook(item_name, batch_index, server_addr)
2020-10-06 23:35:04: 0: STDOUT: File "q:\pdgtemp\4320\scripts\pdgcmd.py", line 502, in execStartCook
2020-10-06 23:35:04: 0: STDOUT: _invokeRpc(s, "start_cook_batch", item_name, subindex, theJobid)
2020-10-06 23:35:04: 0: STDOUT: File "q:\pdgtemp\4320\scripts\pdgcmd.py", line 297, in _invokeRpc
2020-10-06 23:35:04: 0: STDOUT: return _invokeRpcFn(fn, fn_name, *args)
2020-10-06 23:35:04: 0: STDOUT: File "q:\pdgtemp\4320\scripts\pdgcmd.py", line 303, in _invokeRpcFn
2020-10-06 23:35:04: 0: STDOUT: result = _invokePDGNetRpcFn(fn, fn_name, *args)
2020-10-06 23:35:04: 0: STDOUT: File "q:\pdgtemp\4320\scripts\pdgcmd.py", line 342, in _invokePDGNetRpcFn
2020-10-06 23:35:04: 0: STDOUT: raise RuntimeError(msg)
2020-10-06 23:35:04: 0: STDOUT: RuntimeError: Failed RPC to 10.99.29.159:1025: start_cook_batch ('noisegrid_ropfetch120', 0, '5f7cfeae8bff1d09f082e4e3')

Turning off Windows Defender Firewall allows everything to work, but I'm guessing my employer will not think that's a good idea

I did try the Task Callback Port and Relay Ports, which are opened in the AWS security group, but that didn't seem to help. Also the docs say that is only for “Connect” MQ mode, not Farm or Local. Looking at the error, it looks like my opening port 1025 should have helped? So it's not the AWS rule, but the Windows firewall itself?

So… still stumped, though I can turn off the Firewalls temporarily to get my testing done. The docs make it sound like the MQ is meant to avoid the Firewall issue, but I clearly don't understand it. I am in no way a firewall/networking expert, this is the first time in my career I've ever had to deal with a network that wasn't hermetically sealed

Cheers,

Peter B
Edited by pbowmar - Oct. 6, 2020 19:39:55
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
7046 posts
Joined: July 2005
Offline
Sigh. Opening ports 1024 and 1025 in Defender Firewall on the MQserver machine makes it work. I appreciate you helping with this, can I suggest that the docs be updated so it's clear that it's required to open those ports, whereas now the docs say “Only needed Type is set to Connect” which made me not pay attention since I was not using Connect

Cheers,

Peter B
Edited by pbowmar - Oct. 6, 2020 19:48:41
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
46 posts
Joined: July 2009
Offline
Hello,

does using the MQ-Server still requires one "manager" Task running in Deadline?
I was hoping its unnecessary when using a MQ-Server.

I submitted the following graph.


In Deadline it shows like this.


The "stagemanger" Task is blocking one machine.
I would like to only submitt the usd and karma task and manage them via MQ-Server.
All three DeadlineSubmitter are set to "connect". The MQ-server is running and logging jobs/machines.

I assume I'm doing something wrong or is that behaviour intended?

Thanks a lot in advance,
Cheers Louis
Edited by louisx - Sept. 23, 2021 06:20:31

Attachments:
deadline2.png (21.8 KB)
submitGraph.png (34.8 KB)

  • Quick Links