Q: telling PDG "failed" is actually "OK"

Forums PDG/TOPs Q: telling PDG "failed" is actually "OK"

5432 11 4


pbowmar: Member; 7046 posts; Joined: July 2005; Offline

April 18, 2019 2:05 p.m.

Hi,

Using Hqueue, 17.0.229, I have a bunch of PDG tasks on Hqueue and one of them failed in Hqueue, likely due to a crappy old machine running out of RAM

So I just rescheduled it after disabling the old machine, and the frame finished fine.

Sadly, PDG staunchly reports that frame as Failed and won't carry on.

What to do?

Cheers,

Peter B

Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

April 18, 2019 2:17 p.m.

You can set your ‘Cache Mode’ parm to ‘Read’ or ‘Automatic’ and re-cook your node. It should pick up the expected output and all your items as cooked right away.


pbowmar: Member; 7046 posts; Joined: July 2005; Offline

April 18, 2019 3:09 p.m.

Genius!

However, what happens on a longer render when it's overnight? EG we have scripts or people that will fix the issue on the farm but wouldn't have access to the PDG graph directly to manually Recook like that.

Any way to have it auto-recheck every 30 seconds or something?

Cheers,

Peter B

Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

April 22, 2019 8:13 a.m.

When work items fail there's no way to make PDG try them again except by stopping the cook and starting a new cook.
But I think it would be a good RFE to have a mechanism for automatic retries during a cook.


jason_iversen: Member; 12671 posts; Joined: July 2005; Offline

April 22, 2019 3:32 p.m.

That would point back to the beta-era request to have TOPs able to continually attempt to solve network, perhaps? ie. this seems like a Re-Run Until Done variation on some kind of Re-Run Continually behaviour.

Edited by jason_iversen - April 22, 2019 15:35:12

Jason Iversen, Technology Supervisor & FX Pipeline/R+D Lead @ Weta FX
also, http://www.odforce.net [www.odforce.net]


Andrew Graham: Member; 151 posts; Joined: Feb. 2009; Offline

Dec. 3, 2019 7:17 p.m.

This ability would be useful. So far I have been using pdg in interactive sessions, so failed frames are fine in that scenario if you just resubmit something that is fast to execute. But anything that takes a long time or that submits pdg on a remote system will need some number of retries of tasks before bailing out anything being affected or downstream. we also wouldn't want to exit simulations for example if a task that is a sibling hits the max failure limit, so stopping the whole graph would be undesirable.

https://openfirehawk.com/ [openfirehawk.com]
Support Open Firehawk - An open source cloud rendering project for Houdini on Patreon.
This project's goal is to provide an open source framework for cloud computing for heavy FX based workflows and allows end users to pay the lowest possible price for cloud resources.


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

Dec. 3, 2019 9:02 p.m.

Andrew Graham
But anything that takes a long time or that submits pdg on a remote system will need some number of retries of tasks before bailing out anything being affected or downstream. we also wouldn't want to exit simulations for example if a task that is a sibling hits the max failure limit, so stopping the whole graph would be undesirable.

FYI Local Scheduler now has ‘Exit code handling’ which can be used to retry, and Hqueue Scheduler has a ‘retries’ job parameter that can be set.


Andrew Graham: Member; 151 posts; Joined: Feb. 2009; Offline

Dec. 5, 2019 7:52 a.m.

That's good to know. So with hqueue - would it bail out on a sim if other tasks downstream are failing or would that sim be safe to finish? It would be great to see this in Deadline too if it isn't already there.


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

Dec. 5, 2019 9:52 a.m.

Yes, for example if a partition contains a sim and other work items that fail before the sim is finished, the cook will carry on until all ready items are finished.


Marco_M: Member; 5 posts; Joined: Aug. 2013; Offline

June 6, 2020 4:56 p.m.

chrisgreb
You can set your ‘Cache Mode’ parm to ‘Read’ or ‘Automatic’ and re-cook your node. It should pick up the expected output and all your items as cooked right away.

What if we are using ROP Alembic? Like when we have a heavy mesh been exported in a single file, where some frames are failing?

A workaround would be to export an alembic sequence… But I've no sure if there is a way to merge them together later or if we have to create another task just for this.

Marco Melantonio

FX/CFX Artist
https://www.endorfina.me/ [www.endorfina.me]


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

June 7, 2020 3:17 p.m.

If you want to ignore failed items you can use a Filter by Expression node with a python expression that removes the failed ones:

> pdg.workItem().state != pdg.workItemState.CookedSuccess

However every time you cook the failed items will be re-run.

Edited by chrisgreb - June 7, 2020 15:32:10


mestela: Member; 1803 posts; Joined: May 2006; Offline

June 7, 2020 7:24 p.m.

chrisgreb
FYI Local Scheduler now has ‘Exit code handling’ which can be used to retry, and Hqueue Scheduler has a ‘retries’ job parameter that can be set.

Really need this for tractor too!

http://www.tokeru.com/cgwiki [www.tokeru.com]
https://www.patreon.com/mattestela [www.patreon.com]

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts