Scheduler Fault Tolerance & Load Balancing

Obsidian Scheduler provides enterprise scheduling features while natively supporting pooling and clustering, or in other words, load balancing and fault tolerance. But Obsidian does so in a way that is painless and non-invasive. In fact, you don’t have to do anything. Load balancing and fault tolerance are built into each instance of Obsidian Scheduler whether you choose to run it inside the web admin app, embedded in your own application, as a standalone or any combination of these. This is critical for a scheduler since you could encounter software/hardware faults, unanticipated load or any number of other things that could cripple or bring down a scheduler instance that would otherwise impact critical items from firing. This is where pooling & clustering fits so well.

In fact, we are so passionate about fault tolerance and load balancing, that we don’t offer a single node version of Obsidian. All licences are a minimum of two nodes and your fully functional trial allows you to see two nodes running without any functional restriction. We want you to have, at minimum, a second instance running to ensure your scheduled jobs run on time and that a failure doesn’t prevent other scheduled items from completing or subsequent instances from firing.

Many enterprise server solutions support pooling and clustering but often utilize a variety of complex configuration strategies and/or pool participant inter communication approaches. Obsidian doesn’t need any of these. Every Obsidian Scheduler instance of any type automatically joins the existing pool/cluster or establishes it if it is the first one on the scene. No extra configuration required. No communication between servers necessary. No multicast, no replication of data between servers. This means that you can easily swap out hardware in case of failure or add a new member for load sharing with ease. In fact, if you have standby hardware, you can have it running, awaiting availability of a node licence and it will automatically take over as soon as a node licence is available.

Obsidian also supports fault tolerance of individual jobs. If a job stalls, fails to complete because the instance failed, fails with an exception, didn’t run because no nodes were running, was conflicted by another job – all these are job failure modes that Obsidian provides recovery and tolerance mechanisms for and are all configurable and managed via the web interface. You can even configure specialized job chaining using source chain job state. In an upcoming release, Obsidian will expose internally fully manageable workflow based on source job state and/or its output/results, really, any condition or criteria you may have. You can also use the web interface to subscribe to server and job events at a high level or just target the events you are concerned with so that you’re kept up-to-date without having to login, parse and review log files, etc.

We know that running software in production environments can be unpredictable at times and that all too frequently, bad things happen. We want Obsidian Scheduler to keep you safe and to help you feel secure. Share with us your stories or let us know if you can think of any other ways we can make Obsidian better able to adapt to scheduling problems.