Please help me understand PDGDeadline

   5725   33   7
User Avatar
Member
38 posts
Joined: Aug. 2017
Offline
Through trial and error I just got PDG to submit a job to Deadline using the Deadline scheduler, and it works. Things are being distributed and rendered.

And from what I see in the Monitor, it looks like one worker picks up the job, then splits it into tasks for other workers to pick up. That first worker however, it keeps on working on that initial job, it doesn't switch to one of those tasks it created. It doesn't render anything but it's not idle either.

I found a thread where someone's solution was to create a new worker on the same machine using a command deadlineworker -name "instance-01"
And it does work, but now we have two permanent workers on the same machine and it makes a mess.

So my questions are, what is happening and can we do something about it to make it cleaner?
User Avatar
Member
40 posts
Joined: May 2019
Offline
So the supervisor job keeps open until the supervised are completed. How is this surprising?
There are options to supervise from the *submitting Houdini*, and thus no additional job is created.

If you go to Deadline scheduler -> Message Queue -> Type and set it to Local, that should work.
User Avatar
Member
38 posts
Joined: Aug. 2017
Offline
@monomon Back when I wrote this topic I was taking my first steps into PDG Deadline. I didn't understand why we can't just operate normally without that additional monitor. Now I know just how much it brings to the table by managing things the way Deadline is not capable of. I learned it can cache substeps or run subsequent tasks as soon as the first frames of a simulation have finished (the sim being one big batch). And I'm sure there's more.

I appreciate your advice, but these days I found appreciation for the monitor job I only wish I knew how to assign it to a different Deadline group than the tasks job.
User Avatar
Member
40 posts
Joined: May 2019
Offline
Hey, thanks for your perspective. We are actually just getting into PDG.
Do you happen to know if *different* Deadline schedulers can be made as dependencies?
btw how do you go about setting frame dependencies as you mentioned?
User Avatar
Member
38 posts
Joined: Aug. 2017
Offline
Are you asking about submitting multiple separate dependent jobs to Deadline from PDG?
My solution is to add a "JobDependencies" key in the Job File Key-Values tab of the Deadline Scheduler and give it the id via Python. I automated it by rewriting the Deadline Scheduler's Submit As Job button function (available in its Python Module) so it stores the ID in a variable upon submission. And then I just chain-submit different schedulers through code.

As for the frame dependencies it seems to work like this by default. No need to change anything. I'm attaching a GIF of two ROP Geometry Output TOPs. The first is set to All Frames in One Batch, the second is set to 1 frame per batch. This setup translates to Deadline as well where this job first starts with a single task and then creates new tasks as the first frames finish caching.
Edited by alexmajewski - Aug. 6, 2024 06:47:25

Attachments:
pdg_dependent_cache.gif (83.3 KB)
pdg_deadline.gif (41.3 KB)

User Avatar
Member
40 posts
Joined: May 2019
Offline
Thanks again. This seems relevant to the workflow we want to have in the end.

I actually tested now, after reading this post:
https://www.sidefx.com/forum/topic/95136/ [www.sidefx.com]
and it seems that adding one DL Scheduler for the overall process, then separate DL Schedulers for the sub-jobs, works!
For our use case it is preferable that it all runs on the farm, because people cannot leave their Houdini running overnight (and it might be unreliable).
User Avatar
Member
48 posts
Joined: Nov. 2017
Offline
Hey I hope I can further add questions to this. I'm using deadlinescheduler for the first time in TOPs and I find it quite hard to grasp the concept. Using 20.5.278 - I want to render my karma stage with several wedge options to control lightrigs/shots.
I was expecting that when cooking the output node, that each work-item will be submitted under a single job.

This is an extract from the deadlineschedule docs.
With this scheduling type, the node also schedules one main job with a task for each work item generated

In my case, it creates a new deadline job housing a single frame.

Either something is broken, or I'm missing a setting to have a single job, with the work items as frames.



HIP file attached
Image Not Found



Any help with this would be highly appreciate! I'm also on Houdini-and-Chill / Houdini Academy discord servers - arvidurs.

Attachments:
Screenshot 2024-08-11 221550.png (59.7 KB)
rnd_test_deadlineTOPS_v003.hip (793.1 KB)

--
IG: www.instagram.com/arvidschneider
YT: www.youtube.com/arvidschneider
X: www.twitter.com/arvidurs

Lighting Supervisor at Image Engine
User Avatar
Staff
1284 posts
Joined: July 2005
Offline
arvidurs
In my case, it creates a new deadline job housing a single frame.

Hello,

This is the expected behavior in Houdini 20.5 -- a job is created for each work item. Prior to H20.5, the Deadline scheduler created a single job containing many tasks, one task per work item, however, based on discussions with the AWS Thinkbox Deadline team, we were advised to switch to having a single job per work item. This was to avoid a race condition in Deadline that caused PDG tasks to be randomly dropped from the farm. We received several reports from users about it.

There is a brief blurb about the behavior change in the H20.5 What's New help page (https://www.sidefx.com/docs/houdini/news/20_5/pdg.html):
Deadline Scheduler TOP now submits work items as jobs instead of tasks.


arvidurs
This is an extract from the deadlinescheduler docs.
With this scheduling type, the node also schedules one main job with a task for each work item generated

Oh thanks for pointing this out. That sentence is now outdated. I'll update the help page and add the information that I mentioned here.

Cheers,
Rob
User Avatar
Member
48 posts
Joined: Nov. 2017
Offline
Hi Rob! Thanks for clarifying this! It was driving me nuts.
Now my question to you, is it possible to let the user decide that? I mean, having an option somewhere that allows to spawn frames or jobs for work-items?
Because quite frankly having houdini deadline spawn 1000s of jobs is quite terrible if you ask me.
Especially in a shot context, imagine a shot having several render layers, lots of frames, this will become a nightmare really quick.

Is there a workaround for now, maybe using the 20.0 deadline scheduler plugin?

Arvid
--
IG: www.instagram.com/arvidschneider
YT: www.youtube.com/arvidschneider
X: www.twitter.com/arvidurs

Lighting Supervisor at Image Engine
User Avatar
Member
40 posts
Joined: May 2019
Offline
Thanks Rob. I think I understand the race condition (saving stale copies of the job concurrently), however wouldn't it be possible to batch work items and submit them at once as separate jobs? This behavior does seem odd to Deadline users.
User Avatar
Staff
1284 posts
Joined: July 2005
Offline
arvidurs
Hi Rob! Thanks for clarifying this! It was driving me nuts.
Now my question to you, is it possible to let the user decide that? I mean, having an option somewhere that allows to spawn frames or jobs for work-items?
Because quite frankly having houdini deadline spawn 1000s of jobs is quite terrible if you ask me.
Especially in a shot context, imagine a shot having several render layers, lots of frames, this will become a nightmare really quick.

Is there a workaround for now, maybe using the 20.0 deadline scheduler plugin?

Arvid

There's currently no built-in way to switch back into the old behaviour of one-task-per-work-item. We could add a toggle to the Deadline Scheduler TOP node and let the user decide with the caveat that choosing one-task-per-work-item could result in tasks mysteriously disappearing from the farm. I'm not sure how easy or difficult it would be to backport such a change to Houdini 20.5. I'd have to see first how safely the two modes can live together side-by-side in the code.

Hmmm. You might be able to use the 20.0 Deadline Scheduler in H20.5 to get back the old behaviour. You wouldn't need the PDGDeadline plugin. You would just need the Deadline Scheduler HDA and supporting Python modules. So you can try this:
  1. In Houdini 20.0, put down a Deadline Scheduler TOP node.
  2. Save the Deadline Scheduler to a new HDA on disk (i.e. RMB-click node -> Digital Assets -> Save a Copy). Save the .hda to your Houdini 20.5 user preferences directory (i.e. $HOME/houdini20.5/otls/deadlinescheduler.hda).
  3. Copy the Houdini 20.0 Deadline Scheduler Python modules, tbdeadline.pyand tbdeadline_utils.py, from $HFS/houdini/pdg/types/schedulersand then paste the modules in your Houdini 20.5 user preferences directory (i.e. $HOME/houdini20.5/pdg/types/schedulers).
  4. In Houdini 20.5, put down a Deadline Scheduler TOP node but make sure it is the one that's saved in your user preferences directory.

I haven't tried the above steps myself so you may run into some issues but in theory it should work.

Cheers,
Rob
User Avatar
Staff
1284 posts
Joined: July 2005
Offline
monomon
Thanks Rob. I think I understand the race condition (saving stale copies of the job concurrently), however wouldn't it be possible to batch work items and submit them at once as separate jobs? This behavior does seem odd to Deadline users.

The race condition has to do with dynamically adding tasks to a live job, which is what the Houdini 20.0 PDG Deadline Scheduler did. Deadline's MongoDB backend is not equipped to handle adding tasks to a live job. If the live job tries to write to the MongoDB backend for any reason (i.e. writing progress updates) while PDG asks Deadline to add a task to the job, then the task never gets created. And so you end up with missing tasks in the job. This is not the case when you add new jobs while other jobs are live.

We thought about batching work items and submitting them as separate jobs rather than submitting one job per work item, however, the issue is that PDG does not submit work items to the farm in regular batches. It all depends on what work items are available at the time PDG is ready to submit work to the farm. It may have one work item available, one hundred work items available or anything in between. So you would end up with jobs on the farm containing varying number of tasks. At least with one job per work item, it's consistent and you know that each job contains exactly one task and represents exactly one work item. That also makes it easier to set a meaningful label on the job; the label can be set to contain the info of the work item that the job represents.

I think moving forward adding a toggle to switch between one-job-per-work-item (new behaviour) and single-job-for-all-work-items (old behaviour) makes the most sense.

Cheers,
Rob
User Avatar
Member
48 posts
Joined: Nov. 2017
Offline
Again, thanks Rob for the in depth help and the backport steps. I'll give that a go.
Havin a toggle would be great if that's technically doable!
--
IG: www.instagram.com/arvidschneider
YT: www.youtube.com/arvidschneider
X: www.twitter.com/arvidurs

Lighting Supervisor at Image Engine
User Avatar
Member
40 posts
Joined: May 2019
Offline
Awesome information here.

My point was more (and I have no idea how PDG works internally) about serially submitting batches, that are "debounced", accumulated over some time period. Then, something akin to the MQ server would serialize the submission, ensuring there are no concurrent submissions.

I think there is no point in allowing behavior that is known to be broken. So I wouldn't expose the old behavior.

By the way is there a way to learn more about the compiled core of PDG? Maybe a header file, or documentation.
User Avatar
Staff
1284 posts
Joined: July 2005
Offline
monomon
Awesome information here.

My point was more (and I have no idea how PDG works internally) about serially submitting batches, that are "debounced", accumulated over some time period. Then, something akin to the MQ server would serialize the submission, ensuring there are no concurrent submissions.

I think there is no point in allowing behavior that is known to be broken. So I wouldn't expose the old behavior.

By the way is there a way to learn more about the compiled core of PDG? Maybe a header file, or documentation.

Sorry for the late reply.

I'll keep your points in mind but I may end up adding the toggle anyway. What's interesting is that the race condition only occurs for a subset of users. There are users that never hit the issue (perhaps dumb luck?). So I could see the toggle being useful for them in order to get back to the old behaviour without any downside.

I checked with our lead PDG developer and he pointed out that there are PDG headers in the HDK ($HFS/toolkit/include/PDG/*.h). There's also high level documentation in the HDK docs with a bit of information on different core constructs, cooking, etc and an example of writing a custom partitioner:
https://www.sidefx.com/docs/hdk/_h_d_k__op_basics__top_intro.html [www.sidefx.com]

The HDK headers also have a bunch of comments as well.

I hope that helps.

Cheers,
Rob
User Avatar
Member
85 posts
Joined: Nov. 2017
Offline
Is it possible that toggle gets added quickly? Current state is unworkable unfortunately
User Avatar
Staff
1284 posts
Joined: July 2005
Offline
HristoVelev
Is it possible that toggle gets added quickly? Current state is unworkable unfortunately

Hello,

Adding a toggle to restore the old task per frame behavior is something that we are considering but it is not something that we can guarantee at this point, let alone add quickly. A lot of code was changed to the Deadline Scheduler to adapt to the new one job per frame behavior, including parameter changes. There's risk involved in adding support for both behaviors. We are also weighing in on the feedback given by the Thinkbox team and other users on adding such a toggle.

Cheers,
Rob
User Avatar
Member
85 posts
Joined: Nov. 2017
Offline
Ok, I see. We've started on writing our own scheduler, based on the 20.5 one, let's see how that goes. Thanks!
User Avatar
Member
8886 posts
Joined: July 2007
Offline
Honestly this sounds like an issue that'd ideally be solved on Deadline side so that it can receive dynamic task creation calls and properly queue them and create additional tasks within the job without race conditions

I dont know much about Deadline (may have to switch to it soon though) and whether representing workitems as tasks has other disadvantages, but creating a new job per workitem sounds like a dirty workaround and a nightmare to manage
Tomas Slancik
FX Supervisor
Method Studios, NY
User Avatar
Member
85 posts
Joined: Nov. 2017
Offline
Job per work item works, if you wrap all jobs in a partition. Then if this partition has an attribute about frame range, that can be passed to Deadline so it creates tasks per frame. We'll try writing that into our scheduler.
  • Quick Links