COE = Correction of Error
My previous employer, Amazon, was a big proponent of doing blameless analysis of outages and figuring out what could be done to fix it. I recently had an outage on my servers and wanted to share what went wrong and the fix.
Summary
Starting Thursday until Friday, all TLS requests to a *.technowizardry.net domain would have failed due to a TLS certificate expiration error. Then on Friday, all DNS queries to a *.technowizardry.net zone failed which also caused mail delivery to fail too. This happened because cert-manager had created the acme-challenge TXT record, but the record was not visible to the Internet because the HE DNS was failing to perform an AXFR Zone Transfer from my authoritative DNS server. This was because PowerDNS was unable to bind to port :53 because systemd-resolved was already listening on that port.