Kubernetes certificate missery

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive

Today I did really had to exercise some certificate signing and debugging. It all started when I saw some deployment would not run properly but was stuck in rollout status:

"waiting for deployment spec update to be observed...".

After reading all system logs files I could find and after looking at the logs from apiservers and controller-managers I saw a lot of errors like:

"error retrieving resource lock kube-system/kube-controller-manager: Unauthorized"

This is a typical sign of an expired certificate in k8s. Kubernetes is supposed to perforrm automatically renew of its certificates - that normally expires within a year - but this only happens if you do an k8s-upgrade within this period. If you miss that, all certificates except the CA-certs expires and most things stops working. From reading https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/ you learn you can renew all certificated manually by using the command:

kubeadm alpha certs renew all

It really does renew them, but the problem is that when whole k8s is locked up, it does not pick up those updated certificates. The reason for this - I think - is that running docker mounts, using overlayfs, misses these updates to the local /etc-filesystem. To be really sure that the certs are update, verify them with:

openssl x509 -in <cert> -noout -text

What I ended up doing was to stop the service kubelet on all masters, forcibly stopping all docker containers running the kube-apiserver and kube-etcd and then restart kubelet service. This will revive the apiservers and etcd daemons and they will pick up the certs.

But wait there is more, the kube-scheduler and kube-controller-manager does not have discrete certificates in the /etc/kubernetes/pki folders, instead they have their certs base64 encoded in their respectively config-files /etc/kubernetes/controller-manager.conf and /etc/kubernetes/scheduler.conf. They are also expired.

Again I had to stop the kubelet daemon on all masters, extract the certificate data, and base64-decode it. Both certs then needs to be renewed and singed by the k8s CA (/etc/kubernetes/pki/ca.{crt,key}). What I ended up doing was to import all of the above certs (and keys) in to the superior certificate tool XCA.
Inside XCA I then renewed both certs (for all masters), extract the new certs, base64 encoded them and pasted in back in the above config files. 
Make sure the certificate-data starts with "LS0tLS1C..." which is the base64 encoding of "==== BEGIN ..."

Finishing up with forcibly deleting all running kube-controller-manager and kube-scheduler (and their corresponding pause-containers) on all masters and restarting kubelet again. Note, don't do this on all nodes at the same time, but on one complete node at a time.

Finally, all proxy pods needed to be restarted by just deleting them and let the daemon-set re-create them. There is one for each node in the cluster.

I strongly recommend you to keep all the keys and certificates from k8s in a tool like xca. 

That was basically it.