stamps faster than compiled sops? lies!
10582 29 5- mestela
- Member
- 1802 posts
- Joined: May 2006
- Offline
Some chat on the discord forum lead to this example. Boxes on a curve, stamping the width vs compile + foreach to set the width.
I'm clearly doing something wrong, because as the number of points increase (say +1000), the stamp is substantially faster than the compiled version; on this machine the stamp runs at 10fps, compiled runs at 2fps.
Ideas?
I'm clearly doing something wrong, because as the number of points increase (say +1000), the stamp is substantially faster than the compiled version; on this machine the stamp runs at 10fps, compiled runs at 2fps.
Ideas?
- jsmack
- Member
- 8042 posts
- Joined: Sept. 2011
- Offline
- varomix
- Member
- 460 posts
- Joined: July 2005
- Offline
Hi everybody
been testing and yes, the old copy stamping is faster then the compile
I'm using 2 Intel Xeon E5-2630 v3 Processors for a total of 32 threads, not latest gen but still good, or so I thought.
another user with a Intel i7-5930K Processor with 12 threads, is getting way faster speeds, like 10x
so not sure what to say, downgrade to one processor?
been testing and yes, the old copy stamping is faster then the compile
I'm using 2 Intel Xeon E5-2630 v3 Processors for a total of 32 threads, not latest gen but still good, or so I thought.
another user with a Intel i7-5930K Processor with 12 threads, is getting way faster speeds, like 10x
so not sure what to say, downgrade to one processor?
varomix - Founder | Educator @ Mix Training
Technical Artist @ Meta Reality Labs
Technical Artist @ Meta Reality Labs
- varomix
- Member
- 460 posts
- Joined: July 2005
- Offline
- tamte
- Member
- 8820 posts
- Joined: July 2007
- Offline
- jsmack
- Member
- 8042 posts
- Joined: Sept. 2011
- Offline
If you do trivial operations, then stamping can still be fast, but if you actually do something where parallelizing matters, then you can see massive gains. Also, stamping transformations is a bad example, since the copy to points will handle that orders of magnitude faster. I have yet to find a test case where the compiled for loop version is slower than alternatives, excluding stamping transforms with copy to points, or using vex parallelism (modify all data at once.)
- anon_user_37409885
- Member
- 4189 posts
- Joined: June 2012
- Offline
- tamte
- Member
- 8820 posts
- Joined: July 2007
- Offline
- jsmack
- Member
- 8042 posts
- Joined: Sept. 2011
- Offline
- jsmack
- Member
- 8042 posts
- Joined: Sept. 2011
- Offline
- anon_user_37409885
- Member
- 4189 posts
- Joined: June 2012
- Offline
- mestela
- Member
- 1802 posts
- Joined: May 2006
- Offline
- jlait
- Staff
- 6413 posts
- Joined: July 2005
- Offline
As mentioned, please use perfmonitor or wall clock time, not FPS. FPS not only includes draw time, but has a weird inverse relationship. It is too tempting to say: “I lost 5 FPS!” when that is very different thing when going 15->10 versus 60->55.
Please do also keep in mind that we worked really hard to make copy/stamp fast! And we didn't spend any time trying to slow it down in order for new approaches to seem fast by comparison.
A quick test on my machine (6 cores, 12 threads) gives, for 240 frame playback:
compiled, no threading: 23.3sec
compiled, multithread: 15.3sec
Copy/stamp: 15.2 + 8.7 = 23.9 sec
So we see here only a very small threading performance, which makes sense as the workload is very trivial. This is why we expose things like Job Size in the Attribute Wrangle. We may need to have something like that added to make this easier to handle.
The massive drop at 24 threads suggests something is tripping over itself when the thread count gets high enough in this example.
In the attached I have added a Blocked variant that creates a block attribute to group points into sets of 256. Then it does a parallel foreach over these blocks, and only then iterates over each individual point. This should cut the threading overhead. Can you let me know if it improves on your 24 thread machines?
Please do also keep in mind that we worked really hard to make copy/stamp fast! And we didn't spend any time trying to slow it down in order for new approaches to seem fast by comparison.
A quick test on my machine (6 cores, 12 threads) gives, for 240 frame playback:
compiled, no threading: 23.3sec
compiled, multithread: 15.3sec
Copy/stamp: 15.2 + 8.7 = 23.9 sec
So we see here only a very small threading performance, which makes sense as the workload is very trivial. This is why we expose things like Job Size in the Attribute Wrangle. We may need to have something like that added to make this easier to handle.
The massive drop at 24 threads suggests something is tripping over itself when the thread count gets high enough in this example.
In the attached I have added a Blocked variant that creates a block attribute to group points into sets of 256. Then it does a parallel foreach over these blocks, and only then iterates over each individual point. This should cut the threading overhead. Can you let me know if it improves on your 24 thread machines?
- varomix
- Member
- 460 posts
- Joined: July 2005
- Offline
This are my results from that new Scene
the new block example Jeff added it's almost exactly the same as the stamp version, which is nice cause I like the stamp way, used to that and the new way is a little complex for new people.
So what's the magic ingredients that makes it faster?
this is on a 32 threads machine
the new block example Jeff added it's almost exactly the same as the stamp version, which is nice cause I like the stamp way, used to that and the new way is a little complex for new people.
So what's the magic ingredients that makes it faster?
this is on a 32 threads machine
Edited by varomix - April 18, 2017 14:03:45
varomix - Founder | Educator @ Mix Training
Technical Artist @ Meta Reality Labs
Technical Artist @ Meta Reality Labs
- jlait
- Staff
- 6413 posts
- Joined: July 2005
- Offline
A hard problem with multithreading is picking “grain size”. This is the size of the amount of work you run on each task. If you make the work too small, like adding two numbers together, you'll spend all your time in task management and end up way, way, slower. So whenever you multithread in C++ you always have to think about grainsize and ensure small datasets are batched together.
It is sort of like submitting jobs to a render farm. If your frames get short enough, it can become necessary to render 10 frames at once per job rather than each frame individually.
The problem is that we have no idea from the outside what the size of your contained network is. If it is a big chunk of work, the task scheduling won't show up and you should see nice scaling. But with really simple examples the opposite happens.
What I have done in that example is manually batch 256 points at a time. With the given point counts, this also means only 6 threads really become active, saving your 32 thread machine from tripping over itself for no benefit.
That said, I will be looking closer at this high-thread behaviour. Currently the point() function has an unnecessary lock in this example that could also be responsible for things locking up.
I've submitted Bug: 82242 to track this.
It is sort of like submitting jobs to a render farm. If your frames get short enough, it can become necessary to render 10 frames at once per job rather than each frame individually.
The problem is that we have no idea from the outside what the size of your contained network is. If it is a big chunk of work, the task scheduling won't show up and you should see nice scaling. But with really simple examples the opposite happens.
What I have done in that example is manually batch 256 points at a time. With the given point counts, this also means only 6 threads really become active, saving your 32 thread machine from tripping over itself for no benefit.
That said, I will be looking closer at this high-thread behaviour. Currently the point() function has an unnecessary lock in this example that could also be responsible for things locking up.
I've submitted Bug: 82242 to track this.
- anon_user_37409885
- Member
- 4189 posts
- Joined: June 2012
- Offline
- anon_user_37409885
- Member
- 4189 posts
- Joined: June 2012
- Offline
- edward
- Member
- 7893 posts
- Joined: July 2005
- Offline
- edward
- Member
- 7893 posts
- Joined: July 2005
- Offline
- tamte
- Member
- 8820 posts
- Joined: July 2007
- Offline
-
- Quick Links