Nov 2024
5 Lessons Skydiving Teaches Us about IT Maintenance
Several years ago, before kids consumed my spare time, I used to hurl myself out of moving aircraft for fun. A lot of people think of skydivers as risk-taking adrenaline junkies, but I found them to have a very mature understanding of risk. They jump not because they ignore the risks, but because they have taken the time to understand and plan for them.
Technical Strategy Lead
With that in mind, I thought it would be fun to look at how applying lessons from my days as a skydiving student can help us better maintain our IT systems.
-
Make a Maintenance Plan
Before a skydiver gets on a plane they need a plan. What altitude will they get out? What manoeuvres will they undertake? At what altitude will they deploy their parachute? Where do they need to be when it opens? What is the wind speed and direction? What landing pattern must they use?
Once you exit the aircraft you are committed. If your canopy opens too far downwind of the drop zone, you’re not going to make it back, and your best-case scenario is now a beer fine. If you didn’t memorise the landing pattern, you risk a mid-air collision, which is… really bad.
All IT maintenance changes should have a plan. Good practice is to document that plan. The act of writing it down forces you to think it through, considering the potential problems and dependencies. Often when I plan out a change and read it back, I realise it’s quite a bit more complicated than it seemed in my head, and I’m going to need to allocate more time for the implementation.
The long walk of shame back to the drop zone after landing off offers a similar feeling to that of passing the desks of your colleagues who are unable to work because your change went awry or overran due to lack of planning.
-
Peer Review
Everybody makes mistakes. Our carefully crafted plan may be flawed. If we missed something during planning, we won’t be able to spot the mistake ourselves; because if we knew what was missing, we wouldn’t have missed it!
No skydiver, not matter how experienced, gets on a plane without someone else checking their kit. Is there a loose strap? Are your handles and pads in the correct places? Is there a bit of loose fabric from your pilot chute peeking out behind your back?
A second pair of eyes covers our blind spots. At Waterstons we have a colleague review all change plans because nobody is infallible, and everybody has something to learn.
-
Communicate
You’ve got a plan. It’s a good plan. It has been reviewed and is ready to go. Now share it with your colleagues!
In skydiving you may not know the other jumpers on your lift or what their plans are until you’re on the flight line together. You may be getting out at different altitudes or planning different exercises. When you’re flying through the air at 120 mph without brakes, and the only crumple zone is your face, the last thing you want to see is a canopy opening directly below your formation.
In the IT world the consequences may not be quite so severe, but you won’t win favour with your colleagues if you end up tripping over each other’s changes. At some point many of us will have experienced the frustration of being in the middle of a critical deployment only to lose our connection because someone else decided to do network maintenance in the same window; or we’ve been dragged into an all-hands major incident response only to discover it was planned maintenance all along. Sharing your plan is an easy way to avoid a whole lot of stress and wasted time. With distributed teams in particular, a centralised method of publishing planned IT maintenance is essential.
-
Plan for Failure
Sometimes things go wrong that are simply beyond your control. A hardware failure in the middle of a reboot, a line snapping when your canopy deploys… these things can and will happen.
Skydiving problems are categorised either as nuisance factors that can be remedied in the air, or malfunctions that necessitate an immediate cutting away of the failed canopy and deployment of the reserve. Skydivers train to recognise and respond to each before they are allowed in the air.
During IT maintenance, some issues can be resolved in-flight; but we should always have a back-out plan to restore service if things don’t go as expected. Never get yourself into trouble that you cannot extract yourself from. Never jump without a reserve parachute.
As with skydiving, it is important to bear in mind that you are operating within time constraints. Just because you know how to fix a problem in-flight doesn’t mean you should. A skydiver must maintain altitude awareness. If it will take them 20 seconds to fix a problem, and 10 seconds to reach the ground, they need to cut away and stop trying to fix it. While an IT professional doesn’t need to worry so much about the ground stopping their troubleshooting adventure, they do need to worry about their troubleshooting adventure stopping the business if it overruns past the end of their change window.
-
Learn from Failure
My first line manager had a catchphrase: “I don’t care if people make mistakes. I care if they don’t learn from their mistakes.” It’s important that when a change goes wrong, we take time after the dust has settled to review what happened and look for ways to improve. Incident reports help us to identify gaps, improve our processes, and make sure we don’t get bitten by the same problem again in the future.
In many scenarios and locations, a skydiver is required to equip a device called an AAD. This Automatic Activation Device tracks the skydiver’s altitude and rate of descent. If the AAD determines the user is descending too fast, too low, it fires a small explosive charge, propelling a blade that cuts through the closing loop of the reserve canopy, causing it to deploy automatically. It is a last-ditch attempt to save the life of a skydiver who has become incapacitated mid-air, or lost altitude awareness. My instructor told me that when an AAD fires, they investigate it like a fatality. From the skydiver’s perspective: their processes have failed, and somebody should have died. Lessons must be learnt and improvements made before it happens in a scenario where an AAD is unable to save them.
Where a significant business impact can occur, as is the case with security risks, partial failures should be treated with the same rigour as a major event. In our security incident log at Waterstons, we have a category for near misses. These are instances where no security breach occurred, but one or more technical controls or processes let us down. A strategy of defence in depth means that one software flaw or one human mistake should not be enough to leave us exposed, but even so we should proactively identify and remediate these failings to make sure all layers of our defence remain as effective as possible.
Stay up to date
To stay on top of your latest tech insights, subscribe to our Tech Insights Newsletter. Each month you'll be treated to more articles like this one, plus the latest tech news and updates from our team at Waterstons. Subscribe here.