This document details best practices to maximize uptime for applications running on ClustrixDB. This covers a wide range of topics, from environmental requirements to change management procedures, all of which ultimately impact the availability of your application. Many of these are standard best practices or concepts with which you are likely already familiar.
Designing for high availability minimizes risk with the following strategies:
The following best practices maximize high availability on ClustrixDB:
An operationally mature organization may have as many as four different environments:
For highest availability, a production environment may contain a pair of identical clusters which replicate in a Master-Master configuration. Only one cluster should actively take writes, with the other available for immediate failover (active/passive); alternatively, if there are distinct applications or databases, these may be configured such that one cluster is active for a set of applications, while the other cluster is active for another. Having both clusters take writes (active/active) is possible but presents a number of operational challenges (see Master-Master Replication). Managing cut-over of application load from one cluster to the other can be handled through the use of an external load balancer, or by reconfiguration of application servers.
A disaster recovery environment is typically at a geographically distinct location and includes a cluster which replicates from the production environment, along with application servers that are able to provide site functionality (possibly with degraded performance) in case of site-wide failure of the production environment.
A staging environment will typically be a (scaled down) facsimile of the production environment, including comparable application servers and datasets. It is used to validate changes to the application software as well as upgrades to the ClustrixDB software. A crucial part of the staging environment is a test automation framework which allows for the exercise of the application and database with a load approximating peak load in the production environment.
A development environment allows for more ad hoc development, where the risk of developers interfering with each other's work is inconsequential.
To achieve optimum uptime goals Clustrix highly recommends the use of multiple clusters, which can provide the following:
To take full advantage of ClustrixDB's fault tolerant architecture, the following environmental and provisioning requirements should be met:
The vast majority of software failures arise from changes in application behavior, whether due to a bug in the application itself, or exposure of a bug in an underlying layer such as the database. Accordingly, the best practice is to first thoroughly validate such changes in non-production environment(s), and then carefully roll out into production, with roll back plans in place to undo changes in case of surprises.
The existence of development and staging environments allows an organization to roll out new applications and application changes in a safe manner.
Upgrading the software on your ClustrixDB cluster can be treated in much the same way as application code changes. While ClustrixDB software releases are thoroughly tested in-house, changes such as new compiler optimizations can have an unanticipated impact on customer workloads; validation of the new release on a staging cluster running a simulated workload allows discovery of such issues prior to production rollout.
In an ideal operational environment, customers can participate in the beta program, obtaining early release candidates for use in their development cluster. Once a qualified release is available, they can test in their staging environment to eliminate problems evident under heavy workloads. When upgrading production, if a pair of clusters is available, one cluster should be upgraded first; the cluster can then undergo a full day or week of load before upgrading the second cluster, providing for failback to the second cluster running the prior, stable release.
While eliminating single points of failure in your application stack is beyond the scope of this document, the following guidelines pertain specifically to how your application interacts with the ClustrixDB database layer:
ClustrixDB parallel fast backup provides for rapid backup which allows for near-constant backup time as your cluster grows, as the work is split across the nodes. When planning your backup strategy, consider the following: