Houdini 20.5 Executing tasks with PDG/TOPs

Custom File Tags and Handlers

PDG uses file tags to determine the type of an output file.

On this page

Overview

Output files and file attributes on PDG work items are assigned tags so that PDG can identify their file types. A tag determines what application is used to open a file from the attribute pane, as well as which cache handler(s) to run when checking for an existing file on disk. You can register custom tags with the PDG pdg.TypeRegistry and use them with built-in nodes. You can also register custom cache handling functions written in Python in order to manually control PDG’s caching mechanism.

Custom tags added through the Type Registry automatically appear in the tag chooser drop-down on any nodes that have a parameter for setting the file tag.

Custom file tags

Custom file tags are registered through the PDG Type Registry. You can do so on-demand from the Python Shell, or by adding a registration script to the the $HOUDINI_PATH/pdg/types directory. When Houdini starts, PDG automatically loads all scripts and modules from the pdg/types directory on the Houdini search path. For example, you can create a script to load custom tags and save it as $HOME/houdiniX.Y/pdg/types/custom_tags.py. Inside of the script file you’ll need to define a registerTypes function, which is called automatically when PDG loads the script.

def registerTypes(type_registry):
    type_registry.addTag("file/geo/collision")
    type_registry.addExtensionTag(".py, "file/text/pythonscript")
    type_registry.addTagViewer("file/bgeo", "gplay")

There are two API methods that you can use to register a custom file tag:

  • The first is pdg.TypeRegistry.addTag, which directly adds the tag to the global list. No association is made with any file extensions.

  • The second method is pdg.TypeRegistry.addExtensionTag, which lets you use your custom tag with a particular file type. A mapping is made from the file extension to the tag in addition to adding the tag to the global list. PDG will then automatically use your custom tag for any files that have that extension, unless the call to add the file explicitly specifies a different tag.

You can also supply the name of a viewer application for a particular tag. The viewer application determines how to open file links in the work item’s pane. If you specify a viewer application, PDG will use it to open the file. Otherwise, it falls back to opening the directory that contains the file. You can add custom viewers using the pdg.TypeRegistry.addTagViewer method.

Custom cache handlers

Many nodes in PDG support disk caching. If the expected output files for a work item already exist on disk, then that work item is able to cook from cache instead of re-running. You can enable caching per node, and you can configure it to either always read, always write, or read from cache only if the files exist. If you set the HOUDINI_PDG_CACHE_DEBUG environment variable before launching Houdini, PDG will print cache file debug information when a graph is cooking.

Internally, PDG verifies that cache files exist by checking if they're found on disk. This may not be suitable for all applications or all types of files. For example, this would not be suitable if your output files are stored on a cloud storage system or if you want to do extra file validation as part of the cache check.

As an alternative, you can verify the existence of cache files by registering a custom cache handler in the same way that custom file tags are registered (as described in the previous section). For example, you can create a script that defines your custom cache handlers and save it as $HOME/houdiniX.Y/pdg/types/custom_handlers.py.

import os

def simple_handler(local_path, raw_file, work_item):
    print(local_path)
    return pdg.cacheResult.Skip

def custom_handler(local_path, raw_file, work_item):
    # Skip work items that don't have the right attribute
    if work_item['usecustomcaching'].value() == 0:
        return pdg.cacheResult.Skip
    try:
        if os.stat(local_path).st_size == 0:
            return pdg.cacheResult.Miss
        return pdg.cacheResult.Hit
    except:
        return pdg.cacheResult.Miss

def registerTypes(type_registry):
    type_registry.registerCacheHandler("file/geo", custom_handler)
    type_registry.registerCacheHandler("file/usd", simple_handler)

Each cache handler is passed three arguments: the local path to the cache file, the raw pdg.File object, and the pdg.WorkItem that owns the file. The file object contains all of the metadata associated with the file, such as its raw path and file tag.

Warning

Do not modify the work item during the cache handler hook, and do not store the work item in a global variable and then try to access it outside of the handler method. This is invalid.

Cache handlers are registered for a particular file tag. In the above example, a file tagged as file/usd would first be checked using the simple_handler. Since that handler returns the pdg.cacheResult.Skip return code, the cache system then moves on to the next possible handler which is file/geo. That handler verifies that the file has a non-zero size, but it only does so if the work item that owns the file has the usecustomcaching attribute set. If both handlers return Skip, then PDG’s built in cache checking mechanism is used instead.

As soon as a handle returns pdg.cacheResult.Hit or pdg.cacheResult.Miss, handler evaluation stops and that result is used. The most specific matching tag pattern is always evaluated first.

Note

You can register a handler for all file types by adding it with the file tag.

Note

The cache handler will be called with the batch sub item if the file was generated as part of a batch. You can get the batch parent by looking at work_item.batchParent.

Custom file transfer handlers

Background

PDG has some default expectations and behaviors when it comes to file transfers. PDG assumes that you are using a shared file system which exists between all your machines and those machines only work on tasks in a specific graph. For example, scheduler nodes customarily try to copy local files like the current .hip file to the working directory (which may be a local file path or a path that’s accessible over an NFS mount point) specified on the schedulers.

However, your set-up may be a lot different from what PDG is expecting. For example, your worker machines might only be able to access data using an object storage system or a database, or you just might not be able to share a mount point with the submitting machine.

In cases like these, you can create a custom PDG file transfer handler to implement your own file transfer logic.

Behavior

PDG exposes a file transfer hook that you can override with your own Python logic to perform custom file transfer behaviors. This hook has access to your work items, the scheduler associated with the file transfer, and all the information about the file being transferred. And the file transfer handler can use whatever Python code is necessary to copy the file to its intended destination (for example, like a series of a remote machines).

A custom file transfer handler can choose to implement its own caching logic or decide to not handle a particular file. Because of this behavior, you can install multiple custom handlers and control them with the attributes from incoming work items. If none of your custom handlers want to handle a particular file transfer operation, PDG’s default handler will automatically try to copy the file directly into the associated scheduler’s working directory.

The file transfer handler system stores a thread-safe cache that maps local file paths to a user-defined cache ID. Each time a handler transfers a file, it passes the last cache ID that was reported for that file or 0 if the file does not have a stored ID. The handler can then use the ID to determine if the file needs to be copied again or if it should be treated as cached/already copied. The cache checking logic is taken care of by the file transfer handler, but PDG handles the storage of the cache IDs so that it can avoid the need for global variables in the handler.

Example implementation

The following code demonstrates an approximate Python implementation of the default transfer handler that is built into PDG.

Save it to a file (for example, like $HOME/houdiniX.Y/pdg/types/custom_file_transfer.py) if you want to test it in a live Houdini session.

import pdg
import os
import shutil

def transferCopy(src_path, dst_path, file_object, work_item, scheduler, cache_id):
    dst_exists = os.path.exists(dst_path)

    # If the paths are the same and exist, return early            
    is dst_exists and os.path.samefile(src_path, dst_path):
        return pdg.TransferPair.skip()

    # unconditionally copy directores
    if os.path.isdir(src_path):
        try:
            shutil.copytree(src_path, dst_path)
            return pdg.TransferPair.success()
        except:
            return pdg.TransferPair.failure()

    # if the old mod time matches the current one, skip the copy and
    # indicate that the transfer was cached
    src_mtime = int(os.path.getmtime(src_path))
    if dst_exists:
        if os.path.getsize(src_path) == os.path.getsize(dst_path):
            if src_mtime == cache_id:
                return pdg.TransferPair.cached()

    dst_dir = os.path.dirname(dst_path)

    # create intermediate directories as needed
    try:
        if not os.path.exists(dst_dir):
            os.makedirs(dst_dir)
    except OSError:
        pass

    # copy the file
    shutil.copyfile(src_path, dst_path)
    return pdg.TransferPair.success(src_mtime)

# Register the handler for all types of file
def registerTypes(type_registry):
    type_registry.registerTransferHandler("file", transferCopy)

You can also pass in a use_mod_time argument to the registration function in order to use the default local file mod time for caching. For example:

def registerTypes(type_registry):
    type_registry.registerTransferHandler("file", transferCopy. use_mod_time=True)

PDG will only invoke the handler when the file has not been transferred yet for a particular scheduler, or when the local file has a mod time that is newer than the time recorded at the last transfer.

Warning

Do not try to modify the work item passed to the custom handler function, and do not store the work item in a global variable and then try to access it outside of the scope of the function. This is invalid.

Custom file stat functions

Files stored as work item outputs or attributes have both a size field and an extra 64-bit integer hash that PDG uses to identify if the file is stale. For example, if you recook a File Pattern TOP node after modifying a file on disk, the work items that correspond to the modified files are automatically dirtied as part of the cook. By default, PDG will stat the file to determine its size, and uses the file’s mod time to set the extra field. However, you can register a custom stat function in Python to use a custom scheme.

Examples

Like with cache handlers, you register custom stat functions based on the output file tag. The function can return either a single value or a pair of values. If a single value is return it gets assigned to the file’s hash field and the file size is determined by stat-ing the file path on disk. If two values are returned, the file’s size field is set to the second value instead.

The following code demonstrates a CRC checksum for text files. Save it to a file (for example, like $HOME/houdiniX.Y/pdg/types/custom_file_hash.py) if you want to test it in a live Houdini session.

import zlib

def crc_handler(local_path, raw_file, work_item):
    try: 
        with open(local_path, 'rb') as local_file:
            return zlib.crc32(local_file.read())
    except:
        return 0

def registerTypes(type_registry):
    type_registry.registerStatHandler("file/txt", crc_handler)

The custom stat function is invoked with three arguments: the local path to the file, the raw pdg.File object, and the pdg.WorkItem that owns the file. If the function returns a non-zero value, that value is stored as the file’s hash. If it returns a pair of non-zero values, the second value is assigned to the file’s size field. If the return value is exactly zero, then PDG checks for other matching stat functions (if any exist). If no other custom stat functions are found, then PDG falls back to the built-in implementation that uses the file’s mod time/size on disk.

You can apply the function to specific node types by filtering on the node type in the function implementation. For example, the following stat function applies to all types of files, but only for work items in a File Pattern:

import zlib

def crc_handler(local_path, raw_file, work_item):
    if work_item and work_item.node.type.typeName != 'filepattern':
        return 0

    try: 
        with open(local_path, 'rb') as local_file:
            return zlib.crc32(local_file.read())
    except:
        return 0

def registerTypes(type_registry):
    type_registry.registerStatHandler("file", crc_handler)

Warning

Do not modify the work item in the custom stat function, and do not store the work item in a global variable and then try to access it outside of the scope of the function. This is invalid.

Executing tasks with PDG/TOPs

Basics

Beginner Tutorials

Next steps

  • Running external programs

    How to wrap external functionality in a TOP node.

  • File tags

    Work items track the results created by their work. Each result is tagged with a type.

  • PDG Path Map

    The PDG Path Map manages the mapping of paths between file systems.

  • Feedback loops

    You can use for-each blocks to process looping, sequential chains of operations on work items.

  • Service Blocks

    Services blocks let you define a section of work items that should run using a shared Service process

  • PDG Services

    PDG services manages pools of persistent Houdini sessions that can be used to reduce work item cooking time.

  • Integrating PDG with render farm schedulers

    How to use different schedulers to schedule and execute work.

  • Visualizing work item performance

    How to visualize the relative cook times (or file output sizes) of work items in the network.

  • Event handling

    You can register a Python function to handle events from a PDG node or graph

  • Tips and tricks

    Useful general information and best practices for working with TOPs.

  • Troubleshooting PDG scheduler issues on the farm

    Useful information to help you troubleshoot scheduling PDG work items on the farm.

  • PilotPDG

    Standalone application or limited license for working with PDG-specific workflows.

Reference

  • All TOPs nodes

    TOP nodes define a workflow where data is fed into the network, turned into work items and manipulated by different nodes. Many nodes represent external processes that can be run on the local machine or a server farm.

  • Processor Node Callbacks

    Processor nodes generate work items that can be executed by a scheduler

  • Partitioner Node Callbacks

    Partitioner nodes group multiple upstream work items into single partitions.

  • Scheduler Node Callbacks

    Scheduler nodes execute work items

  • Custom File Tags and Handlers

    PDG uses file tags to determine the type of an output file.

  • Python API

    The classes and functions in the Python pdg package for working with dependency graphs.

  • Job API

    Python API used by job scripts.

  • Utility API

    The classes and functions in the Python pdgutils package are intended for use both in PDG nodes and scripts as well as out-of-process job scripts.