What's your longest running time (without restart) in production/deployment?

I’m in the process of tracking down an issue* that, so far, I’ve only been able to confirm after running my applications for several days (seems like about 6 days is the trick, I’m trying to speed the failure up so I can debug faster).

This isn’t to point any fingers at Cinder, I’m assuming this issue is one I’ve introduced myself. (The Cinder library is great–even after weeks of running time, I see steady framerates, memory usage is flat and CPU usage is low… Thanks again to the maintainers and contributors!)

But I’m curious how long anyone out there has actually run their apps, without restart, in production. I once asked a partner/client to run an installation for a week, but later found out their technician would restart every AM anyway out of habit. (Despite ambitions, I’ve not yet had a chance to do a permanent or semi-permanent interactive.)

Also curious how many people choose to run at long intervals vs. people who intentionally trigger periodic restarts, and if anyone who has done long-running applications has run into any gotchas with any pieces of the Cinder lib.

*For the curious, on the chance that someone has a clue about the answer-- I have two separate animation systems to handle two different types of animations. One updates “by hand”, in that I update variables myself using time deltas calculated and passed on from Cinder’s update. That system is working fine. The other system uses tweens on Cinders main app::timeline()-- after a week, the animation value handled by that system begin to stutter. Again, app seems to be fine otherwise-- framerate is steady, and the other animation system remains smooth.

Hi,

great question! Here’s my two cents:

  • I once ran an installation for longer than 25 days and suddenly it started to misbehave. Menu’s would not close, timers were off. I was puzzled, because I could not reproduce the bug. Until it finally dawned to me that the internal clock, which represented milliseconds since last boot, used a signed 32-bit integer. After 25 days, it rolled around and became negative, but most of my code assumed the value would be positive. It’s stuff like that which is really hard to debug.
  • The Timeline class uses floats internally. As you may know, floating point numbers have the best resolution around zero. The larger the value, the lower the resolution becomes. If you add 0.1f to a value of 1.0f, the result will be a value very close to 1.1f. But if you add 0.1f to 1000000.0f, the result will be more like 1000000.7f (example, not for realsies). That’s why, after a while, Timeline based applications begin to stutter. If you’d use doubles instead of floats, you probably would not have that problem, or at least it would take far longer before you’d see any stutters. So maybe it’s time for a rewrite, Cinder folks? *)
  • I also have the habit of restarting my installations at least once per 24 hours. This has nothing to do with Cinder, but more with Windows. It tends to slow down after a while and a fresh restart can solve most of your problems. Once a day is overkill, but once a week might not be enough. Although I have very limited experience with Linux, I think it does not exhibit the same problems and you can run it for far longer periods.

-Paul

*) Alternatively, you can reset the Timeline, for example when all animations are finished. The clock will be reset to zero and gone are the stutters.

Edit:
1.0f + 0.1f = 1.10000002f
1000000.0f + 0.1f = 1000000.13f.

2 Likes

Thanks, your input is hugely helpful. I’ve eliminated floats in some of my long-running code in the past, so the Timeline issue makes sense-- in the near term, I’ll use the reset approach you mentioned.

It’s also great to confirm that, practically speaking, folks like yourself are using scheduled restarts. I was starting to feel/lean this way myself, but think I needed to hear someone else say it.

I’ll agree with everything Paul has written. You’re likely running into the outer limits of float precision; after a week I believe your precision exceeds one frame, which is no good.

Regardless, getElapsedSeconds() needs to be changed to return double - it’s high on my list for 0.9.2 changes. However as Paul already pointed out, Timeline uses float internally, so you’ll need to correct for this based on how you’re using it. ci::Timer itself should be fine, and I’d recommend starting by maintaining your own rather than the 'App's, and then correcting in some app-specific way.

Ultimately we may redesign Timeline to use either uint64_t or double internally, but for the time being you’ll have to do something like this.

And I’d agree with Paul that in general a periodic reboot is a good idea anyway - graphics driver resource leaks and the like.

-Andrew

3 Likes

If you’re having trouble convincing your client to implement scheduled down time / restarts, take the route of being green / energy efficient with them. It sounds a lot better than “I don’t trust my application / OS to run for multiple days”, or “the animation engine i’m using loses floating point precision after x days”. :wink:

3 Likes

Well, even a Boeing 787 needs a reboot from time to time

P.

2 Likes

Hey,

I’m having multiple projects running non-stop for multiple years now.
All of them reboot automatically at 5am in the morning.
I have one project where the reboot has a negative side-effect. After the reboot the pc hangs when starting the Cinder app. It only happens a few times per year and is probably something hardware related.
Asides that cinder apps run great with a daily reboot.

1 Like