Maybe it’s because sometimes we get trapped in a bubble based on what we do in the day to day, and the field we mix in, but I genuinely appreciate the chance to talk to customers and discuss something that changes their perception of the way we all do things.
Recently I was talking to a large client about High Availability (HA), a topic I’m sure we’ve all been over a million times… and we know the rules. Oracle helpfully published the MAA architecture for us with RAC and Data Guard and the database love Weblogic clustering and replication in the mid-tier, beyond that multiple web tiers and Load Balancers, and so forth. And cloud doesn’t really change much, maybe we have Load Balancing as a Service (LBaaS) and the fact everything is virtual thrown in, with Fault and Availability Domains to consider, but it’s generally the same design principal.
Going beyond that though… what about testing for failures? Sure, we all smoke test systems and do (hopefully) the basics of crashing nodes (physical in on-premises, virtually in cloud) to simulate loss of nodes, but once we go production do we ever test the HA still holds true? Or do we just trust our design?
Speaking to my customer the other day we chatted about the concept of Chaos Engineering, and whether in Oracle what the perception would be (and indeed whether we were really brave enough to see it through). Put simply, Chaos Engineering is based around the concepts of periodically “faulting” elements (entire servers/VMs, containers, physical elements etc) of the wider system with the ultimate goal being that we gain confidence that the always-on, fault tolerant design we put in place really does do what it says on the tin.
Strictly speaking taken from Principles Of Chaos
So, the key word (for me at least) in here is distributed. As we move to cloud as the predominant paradigm for new deployments, and an increasing amount of “lift and shift” (please excuse the marketing jargon!), developers, operators and architects are going to begin to expect all systems to be modelled as distributed and concepts like Chaos Engineering are going to become part of deployment standards.
For now, I can definitely see our teams looking at it for web tier and app tier - crashing Apache, nginx, WebLogic, JBoss, Glassfish, etc. sporadically to ensure we have confidence in the system, but taking down a RAC node in a database randomly? Or even worse a single instance Oracle database? Perhaps this is where we’ll start to see architectures change with a fault tolerance tier in front of the RDBMS being considered, as distributed becomes de-facto we need to consider our database deployments very carefully.
Let me know your thoughts in the comments section below.