Tips, Tricks and Techniques to Tame Timer Trouble
First published on the PARSE Software Devices website September 18th, 2006
© Copyright 2006 by Robert Krten, all rights reserved.
Synopsis
This article discusses several cases of timers as used in a large QNX Neutrino project.
The author's experiences with the evolution of the use of timers in this project is highlighted.
It seems so simple...
The project is a control program for a piece of hardware.
The main program is an event driven set of state machines that performs actions based on
various events (such as I/O points changing, timers expiring, other modules completing tasks) arriving
into the main state machine.
It's in this context that I designed the initial timer functionality for this control program.
Three simple calls were provided:
- arm a timer in milliseconds (as a one-shot or as an auto-reload),
- cancel a timer, and
- tell the timing system to evaluate its timers.
The basic idea was that when the control program required a timer (say for determining that a piece of
hardware did not get to a desired state within a certain period of time), it would simply arm a timer,
giving it the number of milliseconds from "now" when it should fire, and a 32-bit timer ID.
Then, the control program would return to its blocking point (a MsgReceive() call).
The control program had a regular heartbeat "tick" that would generate a pulse every 10ms or so, and that
was used for tickling the software watchdog process, as well as alerting the timer subsystem that
it was time to check the timer chains to see if any timers had expired.
If no timers in the chain had expired, nothing happened.
If a timer had expired, a pulse would be generated back to the control program, and would eventually
be handled the next time the control program hit its MsgReceive() rendezvous point.
It seems so simple, and yet there were many problems with this.
In this article, I'll discuss two scenarios with timers and blocking calls, and examine the problem
in depth as well as the solution.
The First Implementation and Problem
The first implementation of this timing system is the simplest.
Every time that the control program's "heartbeat tick" occured, we assumed that 10ms
had gone by -- after all, the heartbeat tick was controlled by a timer_settime() function,
and it was programmed to 10ms.
(The minor quibble about whether the kernel's timing system would give us exactly 10ms,
and not something like 9.999ms or 10.001ms was irrelevant -- the things we were timing were
in the hundred milliseconds to tens of seconds range, so a few microseconds either way was unimportant.)
Well, for a long time, this worked, or at least seemed to.
Occaisonally, there were unexplained events, where it looked like the hardware had failed -- from
the point of view of the control program, it looked like the hardware didn't reach a particular state
within its alloted time because its associated timer had "popped" (timed out).
These events were rare, and thus were prioritized at the bottom of the work queue.
When we finally got around to analyzing the problem, it turned out that time was "running too fast".
This was a real head scratcher.
Surely, every time the timer ticked, 10ms had gone by, so therefore, simply subtracting 10ms from each
timer in the timer chain would be the correct thing to do, no ifs ands or buts about it, right?
Well, that's the way it was designed, but not the way it turned out to work "in system".
Consider the following sequence of operations:
- display message to operator
- arm timer for 2 seconds
- trigger hardware
- go back to MsgReceive()
One of the assumptions of the system is that all function calls are virtually non-blocking.
This means that when we issue the function call to display a message to the operator, it might block for a few
hundred microseconds, but it certainly wouldn't go away for "a long time" (e.g. tens of milliseconds).
However, like most cookbook :-) implementations using QNX Neutrino, this was a deeply-blocking system
(e.g., A sent a message to B and blocked, B passed the request on
to C and blocked, C performed the work, then unblocked B, which then
unblocked A. Effectively, A was "deeply blocked" on C).
The practical impact of this is that after we called the function to display a message to the operator,
we then armed the timer for 2 seconds (which really only put the value 2000
into the timer's timeout field).
Due to the implementation, the function to display the message may indeed have blocked for a long time
(without getting into the gory details, there's a serial protocol involved with timeouts and retries).
... continued on page 2...
|