stamps faster than compiled sops? lies!

Forums Technical Discussion stamps faster than compiled sops? lies!

10580 29 5


mestela: Member; 1802 posts; Joined: May 2006; Offline

April 17, 2017 10:39 p.m.

Some chat on the discord forum lead to this example. Boxes on a curve, stamping the width vs compile + foreach to set the width.

I'm clearly doing something wrong, because as the number of points increase (say +1000), the stamp is substantially faster than the compiled version; on this machine the stamp runs at 10fps, compiled runs at 2fps.

Ideas?

Attachments:
stamp_vs_compile.hip (83.2 KB)
compile_vs_stamp.jpg (33.2 KB)

http://www.tokeru.com/cgwiki [www.tokeru.com]
https://www.patreon.com/mattestela [www.patreon.com]


jsmack: Member; 8042 posts; Joined: Sept. 2011; Offline

April 17, 2017 10:58 p.m.

I get 30 fps for the compiled, and 15 for the stamp in your scene.

Edit:
For comparison, I get 85 fps if with copy to points with {foo,1,1} as a scale attribute instead of stamping one by one.

Edited by jsmack - April 17, 2017 23:02:18


varomix: Member; 460 posts; Joined: July 2005; Offline

April 17, 2017 10:59 p.m.

Hi everybody
been testing and yes, the old copy stamping is faster then the compile

I'm using 2 Intel Xeon E5-2630 v3 Processors for a total of 32 threads, not latest gen but still good, or so I thought.

another user with a Intel i7-5930K Processor with 12 threads, is getting way faster speeds, like 10x

so not sure what to say, downgrade to one processor?

varomix - Founder | Educator @ Mix Training
Technical Artist @ Meta Reality Labs


varomix: Member; 460 posts; Joined: July 2005; Offline

April 17, 2017 11:05 p.m.

we should be talking about cook time not viewport fps

this are my results

Attachments:
copyStamp_profile.png (30.2 KB)

varomix - Founder | Educator @ Mix Training
Technical Artist @ Meta Reality Labs


tamte: Member; 8820 posts; Joined: July 2007; Offline

April 17, 2017 11:07 p.m.

12 fps stamp
5 fps uncompiled for each
21 fps compiled (4 cores)

so I guess lies are that uncompiled for each is as fast/slow as stamping, I've noticed in many other cases that that is not true, which is sad
H16.0.572

Tomas Slancik
FX Supervisor
Method Studios, NY


jsmack: Member; 8042 posts; Joined: Sept. 2011; Offline

April 17, 2017 11:08 p.m.

If you do trivial operations, then stamping can still be fast, but if you actually do something where parallelizing matters, then you can see massive gains. Also, stamping transformations is a bad example, since the copy to points will handle that orders of magnitude faster. I have yet to find a test case where the compiled for loop version is slower than alternatives, excluding stamping transforms with copy to points, or using vex parallelism (modify all data at once.)


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

April 17, 2017 11:14 p.m.

Compiled
-j1 = ~5fps
-j6 = ~13fps
-j12 = ~13 fps
-j24 = ~3 fps

Stamped
-jn ~5fps

Ubuntu - 2 * X5680 @ 3.33GHz × 24

Edited by anon_user_37409885 - April 17, 2017 23:25:29


tamte: Member; 8820 posts; Joined: July 2007; Offline

April 17, 2017 11:19 p.m.

comparing cook times 240f:
i7-6700HQ 2.6GHz 4Cores

Edited by tamte - April 17, 2017 23:19:26

Attachments:
stamp_compile_benchmark.png (22.5 KB)

Tomas Slancik
FX Supervisor
Method Studios, NY


jsmack: Member; 8042 posts; Joined: Sept. 2011; Offline

April 17, 2017 11:21 p.m.

60 frames:

Attachments:
cookstats.jpg (77.5 KB)


jsmack: Member; 8042 posts; Joined: Sept. 2011; Offline

April 17, 2017 11:24 p.m.

Artye
Compiled
-j1 = ~5fps
-j6 = ~13fps
-j12 = ~13 fps
-j24 = ~3 fps

Stamped
-jn ~5fps

Ubuntu

Looks like you found a threading bug, this must be why those guys are seeing crazy slow speeds.


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

April 17, 2017 11:42 p.m.

jsmack
Looks like you found a threading bug, this must be why those guys are seeing crazy slow speeds.

It'll be mighty tasty when fixed


mestela: Member; 1802 posts; Joined: May 2006; Offline

April 17, 2017 11:50 p.m.

A threading bug was what I suspected, but wanted to get a few eyes on it to make sure I wasn't doing some really stupid (which is always highly likely).

http://www.tokeru.com/cgwiki [www.tokeru.com]
https://www.patreon.com/mattestela [www.patreon.com]


jlait: Staff; 6413 posts; Joined: July 2005; Offline

April 18, 2017 1:46 p.m.

As mentioned, please use perfmonitor or wall clock time, not FPS. FPS not only includes draw time, but has a weird inverse relationship. It is too tempting to say: “I lost 5 FPS!” when that is very different thing when going 15->10 versus 60->55.

Please do also keep in mind that we worked really hard to make copy/stamp fast! And we didn't spend any time trying to slow it down in order for new approaches to seem fast by comparison.

A quick test on my machine (6 cores, 12 threads) gives, for 240 frame playback:

compiled, no threading: 23.3sec
compiled, multithread: 15.3sec
Copy/stamp: 15.2 + 8.7 = 23.9 sec

So we see here only a very small threading performance, which makes sense as the workload is very trivial. This is why we expose things like Job Size in the Attribute Wrangle. We may need to have something like that added to make this easier to handle.

The massive drop at 24 threads suggests something is tripping over itself when the thread count gets high enough in this example.

In the attached I have added a Blocked variant that creates a block attribute to group points into sets of 256. Then it does a parallel foreach over these blocks, and only then iterates over each individual point. This should cut the threading overhead. Can you let me know if it improves on your 24 thread machines?

Attachments:
stamp_vs_compile_blocked.hip (102.0 KB)


varomix: Member; 460 posts; Joined: July 2005; Offline

April 18, 2017 2:03 p.m.

This are my results from that new Scene

the new block example Jeff added it's almost exactly the same as the stamp version, which is nice cause I like the stamp way, used to that and the new way is a little complex for new people.

So what's the magic ingredients that makes it faster?

this is on a 32 threads machine

Edited by varomix - April 18, 2017 14:03:45

Attachments:
copyStamp_profileJL.png (35.7 KB)

varomix - Founder | Educator @ Mix Training
Technical Artist @ Meta Reality Labs


jlait: Staff; 6413 posts; Joined: July 2005; Offline

April 18, 2017 2:45 p.m.

A hard problem with multithreading is picking “grain size”. This is the size of the amount of work you run on each task. If you make the work too small, like adding two numbers together, you'll spend all your time in task management and end up way, way, slower. So whenever you multithread in C++ you always have to think about grainsize and ensure small datasets are batched together.

It is sort of like submitting jobs to a render farm. If your frames get short enough, it can become necessary to render 10 frames at once per job rather than each frame individually.

The problem is that we have no idea from the outside what the size of your contained network is. If it is a big chunk of work, the task scheduling won't show up and you should see nice scaling. But with really simple examples the opposite happens.

What I have done in that example is manually batch 256 points at a time. With the given point counts, this also means only 6 threads really become active, saving your 32 thread machine from tripping over itself for no benefit.

That said, I will be looking closer at this high-thread behaviour. Currently the point() function has an unnecessary lock in this example that could also be responsible for things locking up.

I've submitted Bug: 82242 to track this.


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

April 18, 2017 3:08 p.m.

Using Perf Monitor. Playback of 240 frames -j24

New:
27.849sec

Old:
1min 14sec

CopySop:
48.791 sec

Playback of 240 frames -j6

New:
23.142 sec

Old:
24.437 sec

CopySop:
46.090 sec


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

April 18, 2017 4:02 p.m.

Hmmm MacOs is rubbish. H16.0.577 - same machine as Ubuntu.

New: 1min 12sec
Old: 2min 14sec
CopyS: 44 sec


edward: Member; 7893 posts; Joined: July 2005; Offline

April 18, 2017 11:50 p.m.

I wonder if jemalloc improves this on OSX since we last tried in H14:

DYLD_INSERT_LIBRARIES=$HFS/../Libraries/libjemalloc.1.dylib houdini


edward: Member; 7893 posts; Joined: July 2005; Offline

April 18, 2017 11:52 p.m.

OR perhaps tbbmalloc_proxy:

DYLD_INSERT_LIBRARIES=$HFS/../Libraries/libtbbmalloc_proxy.dylib houdini

Edited by edward - April 18, 2017 23:52:39


tamte: Member; 8820 posts; Joined: July 2007; Offline

April 19, 2017 12:40 a.m.

does anyone else see non-compiled for each to be much slower than Copy Stamp?
since it's not always possible to compile I think it's quite a big deal

Attachments:
stamp_vs_compile_vs_notcompile.hip (133.8 KB)
stamp_compile_benchmark2.png (21.7 KB)

Tomas Slancik
FX Supervisor
Method Studios, NY

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts