I love reading postmortems. A good postmortem usually unveils a set of problems some of which you can have in your company as well. As they say: there is never a single root cause.
Here is a postmortem from Reddit about their Pi-day outage.
It has everything you love: complex systems, legacy software, processes that were not tested that well, sacred knowledge that is long gone, etc.
Don’t get me wrong, I’m saying that not to shame Reddit. In fact they did a great job highlighting all the problems. It’s much harder and takes more courage than just say: Calico broke - Calico bad.
Also, I have similar problems at my place as well and I bet you have too. This why it’s important to recognize the importance of such “low priority tech debt”. Cleaning that out may save your company’s ass someday.
#kubernetes #networking #postmortem