I've had what seems to be a disk crash. Luckily, I had some recent backups, so data loss isn't bad, but there is still some. But this shows some of the softness in my processes, and the pain of spinning things back up means it will take some time.
This blog is minimally affected; while the host it is built on went away, the place its deployed to is not, and the minimal data loss was easily recovered. But I shouldn't need the target to recover the builder, so a weakness needs to be addressed.
In the short term, I have to build some ad-hoc backups to feel a bit more comfy, but in the longer term I need a full DR resilient process, where I can build things back from an existing backup, and nothing is too precious. I have ideas here but I need to try and test them too. I think I'll also lean a bit more on the cloud and S3, where in the past I used glacier in a targeted manner. I'm not sure I'll migrate to S3 fully, but the goal here is to have something simple and robust, without worrying about storage details like structure and bills.
Along those lines, I have to think about encryption. I've built a simple system for backups with encryption, but I need to validate it and distribute it so that no single node has an impact on the recovery of data. Right now, I have bits distributed but not the system; and I'm realizing that it should be the keys that are secure, not the system. I also need to look into things like key rotation.
Fun fun. So, I think I might talk about aspects of this in the future, it will slow some bits down as I do some things manually, then automate and test and build further resilience into things. I hope its worth it going forward.