One of the things I have come to believe in since working in a full stack type of role is the value of a postmortem. When something goes wrong, it is too easy to just fix it and move on to the next thing, without stopping to think what actually went wrong and how your reactions influenced the end result. This is bad because it can stop you from identifying weaknesses in your systems and preventing the issue from recurring.
A recent incident at work highlighted this for me quite nicely. We have a number of shared web front ends, which run a selection of workloads. Usually this is fine and doesn’t cause us any problems but on this occasion, one of the apps that was hosted on that box suffered an unexpected load spike and started to starve other apps on the same box of resources - causing a cascade of alerts and error messages.
Fixing the immediate issue was nice and simple - fix some SQL that was returning too much data and everything was fine. However, one of the things that we identified in the postmortem that had not been previously considered was the effects increased load and load spikes on single applications would have on the other occupants of the shared box. Some of our code that is hosted on these front end boxes is very much in the legacy category and has limited (if any) ability to monitor it. Some of this can be countered by using secondary metrics such as process CPU and RAM usage and IIS stats in the case of web apps, but nothing beats having some proper metrics being exposed from an application letting you know in real time how much load it is under.
With this in mind, we have made it more of a focus to improve our legacy code base monitoring where possible. In many cases, this is actually fairly easy to improve on - just about anything is step up from “email if something breaks”. If nothing else, writing the errors to a log allows an easier graph of error rates over time compared to having to pull them from an email inbox!
Without performing the postmortem review of what happened and what we could learn, this glaring lack of monitoring would likely have continued to be missed. Now we are at least aware that there is a hole in what we can see of our systems aand we are trying to remiediate as fast and effectively as possible.