cron

Obviously cron jobs are abundantly useful for so many things, all the way from basic housekeeping up to big application functionality.

They're also the source of plenty of flail. What do I mean?

They are neither code nor data, so often get overlooked, or shonkily installed, by application deployment tools
They run with a minimal environment that can catch out the unwary: scripts that work in interactive shell sometimes don't from cron
The default behaviour of mailing output to the cronjob owner generates large amounts of mail that gets ignored, filtered or bounced
Jobs can fail silently and no-one notices until, say, you need to restore that backup that hasn't run for last six months
Jobs that helpfully append their output to a log commonly don't rotate that log
It's easy to have jobs overlapping if they get stuck or take longer than expected to complete. This is a splendid way of wedging a machine.

The mail aspect is a particular peeve. In some jobs my mailbox has enjoyed several thousand cron generated mails a day, and there's no way I'm able to accurately look at each one and react to it. Mostly they contain expected output from successful job execution, so they're easy to skip. But I don't trust my eyes to get that right all the time.

One approach to this is to arrange for jobs to only send mail on error. This is an improvement, but can lead into thinking that a job is happily succeeding when in fact it's either not running or the only-on-error logic is bust. Since cron jobs often cover essential system tasks like backing up, syncing data around and reporting it's vital that they don't fail silently.

I've worked somewhere that tackled this by collating cron-generated mails from diverse systems into a system mailbox and pattern matching them for failure signs. This seems slightly dubious -- it's fragile and labour intensive -- but at least the system also flagged if expected jobs failed to arrive and got our inboxes tamed.

To tackle these problems I find myself writing wrappers for cronjobs. I've written several variants to meet different situation's needs. Unhelpfully I call them all cronwrap. These wrappers sets out to

Engage the amazingly useful lockrun utility to guard against multiple execution of stuck crons
Place cron output into timestamped logs that can be both aged out and made available to interested parties
Hook into local monitoring systems:
1. On execution, update a run counter (SNMP data or some simple text file)
2. On failure, send a SNMP trap or leave some bait for Nagios. Also, update a fail counter
3. If lockrun has prevented a job running owing to overlap, send a SNMP trap or similarly bait Nagios
If required, send output by mail somewhere (sometimes this is necessary, even with the concerns listed above)

So, nothing surprising there. Using such wrappers helps keep cron jobs tamed and reliable, and it's monitoring them near to where the action occurs, rather than mediating via SMTP.

This is hardly invention either, there's plenty of prior art with different nuances in behaviour to meet the needs of different environments. Perhaps I'll merge the variants of my efforts and publish too.

What's curious is that this functionality isn't available inside the cron daemon (( To be clear, I'm talking about the BSD cron written by Paul Vixie. None of the variants I've seen address these concerns either. I'd love to know if there's any I've missed.)) itself. It is perfectly placed to catch exit status, divert output and know if a job has overrun; and would remove the need for all this additional monkeying to make jobs reliable and well behaved. If my C wasn't just read-only I'd have a crack at it!

There, I've finally condensed all my cron rant into one sustained piece.

Update: I posted a cron wrapper at https://github.com/zomo/cronwrap.

last updated: 2010-02-23