Please help me understand PDGDeadline

Forums PDG/TOPs Please help me understand PDGDeadline

1403 14 3


alexmajewski: Member; 20 posts; Joined: 8月 2017; Offline

2024年6月5日 4:23

Through trial and error I just got PDG to submit a job to Deadline using the Deadline scheduler, and it works. Things are being distributed and rendered.

And from what I see in the Monitor, it looks like one worker picks up the job, then splits it into tasks for other workers to pick up. That first worker however, it keeps on working on that initial job, it doesn't switch to one of those tasks it created. It doesn't render anything but it's not idle either.

I found a thread where someone's solution was to create a new worker on the same machine using a command deadlineworker -name "instance-01"
And it does work, but now we have two permanent workers on the same machine and it makes a mess.

So my questions are, what is happening and can we do something about it to make it cleaner?


monomon: Member; 39 posts; Joined: 5月 2019; Offline

2024年8月6日 3:53

So the supervisor job keeps open until the supervised are completed. How is this surprising?
There are options to supervise from the *submitting Houdini*, and thus no additional job is created.

If you go to Deadline scheduler -> Message Queue -> Type and set it to Local, that should work.


alexmajewski: Member; 20 posts; Joined: 8月 2017; Offline

2024年8月6日 4:41

@monomon Back when I wrote this topic I was taking my first steps into PDG Deadline. I didn't understand why we can't just operate normally without that additional monitor. Now I know just how much it brings to the table by managing things the way Deadline is not capable of. I learned it can cache substeps or run subsequent tasks as soon as the first frames of a simulation have finished (the sim being one big batch). And I'm sure there's more.

I appreciate your advice, but these days I found appreciation for the monitor job

I only wish I knew how to assign it to a different Deadline group than the tasks job.


monomon: Member; 39 posts; Joined: 5月 2019; Offline

2024年8月6日 4:47

Hey, thanks for your perspective. We are actually just getting into PDG.
Do you happen to know if *different* Deadline schedulers can be made as dependencies?
btw how do you go about setting frame dependencies as you mentioned?


alexmajewski: Member; 20 posts; Joined: 8月 2017; Offline

2024年8月6日 6:47

Are you asking about submitting multiple separate dependent jobs to Deadline from PDG?
My solution is to add a "JobDependencies" key in the Job File Key-Values tab of the Deadline Scheduler and give it the id via Python. I automated it by rewriting the Deadline Scheduler's Submit As Job button function (available in its Python Module) so it stores the ID in a variable upon submission. And then I just chain-submit different schedulers through code.

As for the frame dependencies it seems to work like this by default. No need to change anything. I'm attaching a GIF of two ROP Geometry Output TOPs. The first is set to All Frames in One Batch, the second is set to 1 frame per batch. This setup translates to Deadline as well where this job first starts with a single task and then creates new tasks as the first frames finish caching.

Edited by alexmajewski - 2024年8月6日 06:47:25

Attachments:
pdg_dependent_cache.gif (83.3 KB)
pdg_deadline.gif (41.3 KB)


monomon: Member; 39 posts; Joined: 5月 2019; Offline

2024年8月6日 7:11

Thanks again. This seems relevant to the workflow we want to have in the end.

I actually tested now, after reading this post:
https://www.sidefx.com/forum/topic/95136/ [www.sidefx.com]
and it seems that adding one DL Scheduler for the overall process, then separate DL Schedulers for the sub-jobs, works!
For our use case it is preferable that it all runs on the farm, because people cannot leave their Houdini running overnight (and it might be unreliable).


arvidurs: Member; 48 posts; Joined: 11月 2017; Offline

2024年8月12日 1:18

Hey I hope I can further add questions to this. I'm using deadlinescheduler for the first time in TOPs and I find it quite hard to grasp the concept. Using 20.5.278 - I want to render my karma stage with several wedge options to control lightrigs/shots.
I was expecting that when cooking the output node, that each work-item will be submitted under a single job.

This is an extract from the deadlineschedule docs.

With this scheduling type, the node also schedules one main job with a task for each work item generated

In my case, it creates a new deadline job housing a single frame.

Either something is broken, or I'm missing a setting to have a single job, with the work items as frames.

HIP file attached

Image Not Found

Any help with this would be highly appreciate! I'm also on Houdini-and-Chill / Houdini Academy discord servers - arvidurs.

Attachments:
Screenshot 2024-08-11 221550.png (59.7 KB)
rnd_test_deadlineTOPS_v003.hip (793.1 KB)

--
IG: www.instagram.com/arvidschneider
YT: www.youtube.com/arvidschneider
X: www.twitter.com/arvidurs

Lighting Supervisor at Image Engine


rvinluan: スタッフ; 1267 posts; Joined: 7月 2005; Offline

2024年8月12日 9:45

arvidurs
In my case, it creates a new deadline job housing a single frame.

Hello,

This is the expected behavior in Houdini 20.5 -- a job is created for each work item. Prior to H20.5, the Deadline scheduler created a single job containing many tasks, one task per work item, however, based on discussions with the AWS Thinkbox Deadline team, we were advised to switch to having a single job per work item. This was to avoid a race condition in Deadline that caused PDG tasks to be randomly dropped from the farm. We received several reports from users about it.

There is a brief blurb about the behavior change in the H20.5 What's New help page (https://www.sidefx.com/docs/houdini/news/20_5/pdg.html):


Deadline Scheduler TOP now submits work items as jobs instead of tasks.

arvidurs
This is an extract from the deadlinescheduler docs.
With this scheduling type, the node also schedules one main job with a task for each work item generated

Oh thanks for pointing this out. That sentence is now outdated. I'll update the help page and add the information that I mentioned here.

Cheers,
Rob


arvidurs: Member; 48 posts; Joined: 11月 2017; Offline

2024年8月12日 12:26

Hi Rob! Thanks for clarifying this! It was driving me nuts.
Now my question to you, is it possible to let the user decide that? I mean, having an option somewhere that allows to spawn frames or jobs for work-items?
Because quite frankly having houdini deadline spawn 1000s of jobs is quite terrible if you ask me.
Especially in a shot context, imagine a shot having several render layers, lots of frames, this will become a nightmare really quick.

Is there a workaround for now, maybe using the 20.0 deadline scheduler plugin?

Arvid

--
IG: www.instagram.com/arvidschneider
YT: www.youtube.com/arvidschneider
X: www.twitter.com/arvidurs

Lighting Supervisor at Image Engine


monomon: Member; 39 posts; Joined: 5月 2019; Offline

2024年8月13日 3:22

Thanks Rob. I think I understand the race condition (saving stale copies of the job concurrently), however wouldn't it be possible to batch work items and submit them at once as separate jobs? This behavior does seem odd to Deadline users.


rvinluan: スタッフ; 1267 posts; Joined: 7月 2005; Offline

2024年8月13日 9:51

arvidurs
Hi Rob! Thanks for clarifying this! It was driving me nuts.
Now my question to you, is it possible to let the user decide that? I mean, having an option somewhere that allows to spawn frames or jobs for work-items?
Because quite frankly having houdini deadline spawn 1000s of jobs is quite terrible if you ask me.
Especially in a shot context, imagine a shot having several render layers, lots of frames, this will become a nightmare really quick.

Is there a workaround for now, maybe using the 20.0 deadline scheduler plugin?

Arvid

There's currently no built-in way to switch back into the old behaviour of one-task-per-work-item. We could add a toggle to the Deadline Scheduler TOP node and let the user decide with the caveat that choosing one-task-per-work-item could result in tasks mysteriously disappearing from the farm. I'm not sure how easy or difficult it would be to backport such a change to Houdini 20.5. I'd have to see first how safely the two modes can live together side-by-side in the code.

Hmmm. You might be able to use the 20.0 Deadline Scheduler in H20.5 to get back the old behaviour. You wouldn't need the PDGDeadline plugin. You would just need the Deadline Scheduler HDA and supporting Python modules. So you can try this:

In Houdini 20.0, put down a Deadline Scheduler TOP node.
Save the Deadline Scheduler to a new HDA on disk (i.e. RMB-click node -> Digital Assets -> Save a Copy). Save the .hda to your Houdini 20.5 user preferences directory (i.e. $HOME/houdini20.5/otls/deadlinescheduler.hda).
Copy the Houdini 20.0 Deadline Scheduler Python modules, tbdeadline.pyand tbdeadline_utils.py, from $HFS/houdini/pdg/types/schedulersand then paste the modules in your Houdini 20.5 user preferences directory (i.e. $HOME/houdini20.5/pdg/types/schedulers).
In Houdini 20.5, put down a Deadline Scheduler TOP node but make sure it is the one that's saved in your user preferences directory.

I haven't tried the above steps myself so you may run into some issues but in theory it should work.

Cheers,
Rob


rvinluan: スタッフ; 1267 posts; Joined: 7月 2005; Offline

2024年8月13日 10:10

monomon
Thanks Rob. I think I understand the race condition (saving stale copies of the job concurrently), however wouldn't it be possible to batch work items and submit them at once as separate jobs? This behavior does seem odd to Deadline users.

The race condition has to do with dynamically adding tasks to a live job, which is what the Houdini 20.0 PDG Deadline Scheduler did. Deadline's MongoDB backend is not equipped to handle adding tasks to a live job. If the live job tries to write to the MongoDB backend for any reason (i.e. writing progress updates) while PDG asks Deadline to add a task to the job, then the task never gets created. And so you end up with missing tasks in the job. This is not the case when you add new jobs while other jobs are live.

We thought about batching work items and submitting them as separate jobs rather than submitting one job per work item, however, the issue is that PDG does not submit work items to the farm in regular batches. It all depends on what work items are available at the time PDG is ready to submit work to the farm. It may have one work item available, one hundred work items available or anything in between. So you would end up with jobs on the farm containing varying number of tasks. At least with one job per work item, it's consistent and you know that each job contains exactly one task and represents exactly one work item. That also makes it easier to set a meaningful label on the job; the label can be set to contain the info of the work item that the job represents.

I think moving forward adding a toggle to switch between one-job-per-work-item (new behaviour) and single-job-for-all-work-items (old behaviour) makes the most sense.

Cheers,
Rob


arvidurs: Member; 48 posts; Joined: 11月 2017; Offline

2024年8月13日 12:12

Again, thanks Rob for the in depth help and the backport steps. I'll give that a go.
Havin a toggle would be great if that's technically doable!

--
IG: www.instagram.com/arvidschneider
YT: www.youtube.com/arvidschneider
X: www.twitter.com/arvidurs

Lighting Supervisor at Image Engine


monomon: Member; 39 posts; Joined: 5月 2019; Offline

2024年8月13日 15:14

Awesome information here.

My point was more (and I have no idea how PDG works internally) about serially submitting batches, that are "debounced", accumulated over some time period. Then, something akin to the MQ server would serialize the submission, ensuring there are no concurrent submissions.

I think there is no point in allowing behavior that is known to be broken. So I wouldn't expose the old behavior.

By the way is there a way to learn more about the compiled core of PDG? Maybe a header file, or documentation.


rvinluan: スタッフ; 1267 posts; Joined: 7月 2005; Offline

2024年8月19日 12:06

monomon
Awesome information here.

My point was more (and I have no idea how PDG works internally) about serially submitting batches, that are "debounced", accumulated over some time period. Then, something akin to the MQ server would serialize the submission, ensuring there are no concurrent submissions.

I think there is no point in allowing behavior that is known to be broken. So I wouldn't expose the old behavior.

By the way is there a way to learn more about the compiled core of PDG? Maybe a header file, or documentation.

Sorry for the late reply.

I'll keep your points in mind but I may end up adding the toggle anyway. What's interesting is that the race condition only occurs for a subset of users. There are users that never hit the issue (perhaps dumb luck?). So I could see the toggle being useful for them in order to get back to the old behaviour without any downside.

I checked with our lead PDG developer and he pointed out that there are PDG headers in the HDK ($HFS/toolkit/include/PDG/*.h). There's also high level documentation in the HDK docs with a bit of information on different core constructs, cooking, etc and an example of writing a custom partitioner:
https://www.sidefx.com/docs/hdk/_h_d_k__op_basics__top_intro.html [www.sidefx.com]

The HDK headers also have a bunch of comments as well.

I hope that helps.

Cheers,
Rob

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts