Help : CORRUPT .bgeo.sc FILES

   2404   12   2
User Avatar
Member
7 posts
Joined: 4月 2017
Offline
Hi All,
We are seeing intermittent corrupt .bgeo.sc files when we are running sims via Houdini Engine as a deadline job.

No Error message is displayed; the job continues and successfully writes frames following the corrupt frame.
The job completes without error.

But when we try to reference the .bgeo.sc files, the corrupt frame cannot be loaded. The file size for the corrupt frame is also smaller than the frames around it.

When this happens, the simulation is not usable, and the sim must be run again from the beginning, which is tremendously inconvenient.

We have not been able to determine any consistent cause for this. This may happen once a week, or sometimes 3-5 times a day. Sometimes we go days without seeing this, but it always returns.

Has anyone else experienced this? Any suggestions? I would really like to be able to trust our sim jobs more than I do, but now we have to watch them closely when the sim is for a shot with a deadline.

Thank you,
Fred
Edited by Fred P - 2022年9月3日 16:12:26
User Avatar
Member
8037 posts
Joined: 9月 2011
Offline
Fred P
Has anyone else experienced this? Any suggestions? I would really like to be able to trust our sim jobs more than I do, but now we have to watch them closely when the sim is for a shot with a deadline.

Weird, never. Do you think you have an isilon going bad?
User Avatar
Member
7 posts
Joined: 4月 2017
Offline
The server reports nothing unusual and has been working well. It is always possible,I suppose. I have suspected that it could be something networking, a switching issue or perhaps a write cache issue on the server. But we have not found any proof of that yet.
User Avatar
Member
8037 posts
Joined: 9月 2011
Offline
Fred P
The server reports nothing unusual and has been working well. It is always possible,I suppose. I have suspected that it could be something networking, a switching issue or perhaps a write cache issue on the server. But we have not found any proof of that yet.

Does the problem go away with 'background writes' disabled on the geometry ROP?
User Avatar
Member
7 posts
Joined: 4月 2017
Offline
We have seen the issue with background writes enabled and disabled.
User Avatar
スタッフ
528 posts
Joined: 8月 2019
Offline
Try setting the HOUDINI_BUFFEREDSAVE environment variable to 1. This will use more memory as the files will be written to RAM first, so you may need to take that into consideration depending on your file sizes.
User Avatar
Member
7 posts
Joined: 4月 2017
Offline
We have enabled the BUFFEREDSAVE and we are running some tests. As this is an intermittent problem, I will give it a couple of days to see if it happens. Thanks!
User Avatar
Member
21 posts
Joined: 9月 2015
Offline
HI, We have exactly same problem. Did you manage to solve this issue by setting BUFFEREDSAVE variable?
User Avatar
Member
21 posts
Joined: 9月 2015
Offline
Guys, have anybody solution for this? It starting to be very very unpleasant situation. Unfortunately BUFFEREDSAVE didn't solve this issue for us.
User Avatar
スタッフ
528 posts
Joined: 8月 2019
Offline
Can you give us more information about your issue? Are you using Deadline? How is your infrastructure set up? Can you provide a corrupted file that we can take a look at?
User Avatar
Member
21 posts
Joined: 9月 2015
Offline
Sure. Pretty much the same description as the author's original post. We launch sims via Deadline. All our project data are stored on a huge Isilon NAS storage. Render slave's OS is Rocky Linux 8.8. It happens randomly, so we can't reproduce ourselves on purpose. I've sent couple of corrupted files under this RFE https://www.sidefx.com/bugs/#/bug/135536 [www.sidefx.com] but I can deliver more if needed. Due to fact Houdini isn't give any error during simulation, we discover particular corruption after whole simulation is done and cache is used.

We are using several Houdini versions ranging from 19.5.xxx to 20.0.xxx. In the past we used HQUEUE on windows (also storing outputs to ISILON)and never saw this happening. As I've already mentioned setting of HOUDINI_BUFEREDSAVE didn't help us.

Because we switched to Deadline and linux OS in the same time we cannot tell what the right reason could it be.
Edited by adlabac - 2024年4月23日 15:32:08
User Avatar
Member
7 posts
Joined: 4月 2017
Offline
adlabac
HI, We have exactly same problem. Did you manage to solve this issue by setting BUFFEREDSAVE variable?

I am sorry to say we could not find an absolute cause or definite solution to this issue.

BufferSave seemed to help reduce the frequency of this issue for us. At the time, I honestly thought it could have been caused by our server's write cache not being able to keep up with the number of nodes creating the .bgeo.sc files because we only had this happen when we had many machines simultaneously writing very large cache files.

We have not seen this issue since the original post, so I am afraid I cant share much more.

I do hope you get this resolved very soon, it is terrible when this is what slows down production.
User Avatar
Member
21 posts
Joined: 9月 2015
Offline
Thanks Fred P for the info. SideFX guys are investigating it so we will see if they come up with something.
  • Quick Links