Kubernetes certificate missery

Today I did really had to exercise some certificate signing and debugging. It all started when I saw some deployment would not run properly but was stuck in rollout status:

"waiting for deployment spec update to be observed...".

After reading all system logs files I could find and after looking at the logs from apiservers and controller-managers I saw a lot of errors like:

"error retrieving resource lock kube-system/kube-controller-manager: Unauthorized"

This is a typical sign of an expired certificate in k8s. Kubernetes is supposed to perforrm automatically renew of its certificates - that normally expires within a year - but this only happens if you do an k8s-upgrade within this period. If you miss that, all certificates except the CA-certs expires and most things stops working. From reading https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/ you learn you can renew all certificated manually by using the command:

kubeadm alpha certs renew all

It really does renew them, but the problem is that when whole k8s is locked up, it does not pick up those updated certificates. The reason for this - I think - is that running docker mounts, using overlayfs, misses these updates to the local /etc-filesystem. To be really sure that the certs are update, verify them with:

openssl x509 -in <cert> -noout -text

What I ended up doing was to stop the service kubelet on all masters, forcibly stopping all docker containers running the kube-apiserver and kube-etcd and then restart kubelet service. This will revive the apiservers and etcd daemons and they will pick up the certs.

But wait there is more, the kube-scheduler and kube-controller-manager does not have discrete certificates in the /etc/kubernetes/pki folders, instead they have their certs base64 encoded in their respectively config-files /etc/kubernetes/controller-manager.conf and /etc/kubernetes/scheduler.conf. They are also expired.

Again I had to stop the kubelet daemon on all masters, extract the certificate data, and base64-decode it. Both certs then needs to be renewed and singed by the k8s CA (/etc/kubernetes/pki/ca.{crt,key}). What I ended up doing was to import all of the above certs (and keys) in to the superior certificate tool XCA.
Inside XCA I then renewed both certs (for all masters), extract the new certs, base64 encoded them and pasted in back in the above config files. 
Make sure the certificate-data starts with "LS0tLS1C..." which is the base64 encoding of "==== BEGIN ..."

Finishing up with forcibly deleting all running kube-controller-manager and kube-scheduler (and their corresponding pause-containers) on all masters and restarting kubelet again. Note, don't do this on all nodes at the same time, but on one complete node at a time.

Finally, all proxy pods needed to be restarted by just deleting them and let the daemon-set re-create them. There is one for each node in the cluster.

I strongly recommend you to keep all the keys and certificates from k8s in a tool like xca. 

That was basically it.

Enabling MFA on Microsoft SSTP VPN

By turning on MFA for the microsoft SSTP VPN solution you will greatly improve the security since it basically stops anyone on internet from trying to login, they will also need the second factor to get it. With O365 you can turn on MFA authentication for most services including VPN (for a fee of course).

First follow this page: https://docs.microsoft.com/en-us/azure/active-directory/authentication/howto-mfa-nps-extension-vpn

Problems that can occur:

If the script  .\AzureMfaNpsExtnConfigSetup.ps1, fails complaining it can not connect to MSOOnline, try to test that separate step by running from powershell:

Connect-MsolService

If that fails complaining about not being able to connect to the powershell gallery, try the below:

 [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12

Another problem often occuring on windows servers is that IE, that is as the default html browser, is locked down so hard it can't open any pages at all. I suggest turning this off temporary during installtion for administrators. This is done via server manager -> "Configuer this local server" and there turn off "IE Enhanced Security Configuration".

Something that microsoft kind of not mention so much is that you MUST use the authenticator app, not not use one-time passwords, but allowing access by accepting via the app. I struggled for month to get this working when I realized that this was the problem.

Also, the NPS must not also be running on another server than the machine running the RRAS. This is because the plugin for NPS otherwise causes an infinite loop of authentications.

Modern hostname lookups in Linux

Background

Finding the IP-address for a host is not as simple as it used to be when lookups was just done via the /etc/hosts file and querying  name-servers listed in /etc/resolv.conf. Nowadays the nsswitch library and the module libnss_resolve.so is talking to systemd-resolved daemon via the system DBus bus. Its has become quite hard to follow and find any problems.

This "nss-resolve"-module then communicates with the systemd-resolved daemon via DBus which in turn will use various ways of looking up a server.

The resolvconf program

Some systems use this program (package openresolv in debian) is responsible to track all possible resolvers in use. When interfaces are started by dhcp for instance, the dhcp scripts will add what ever DNS servers and tell resolvconf about it.

Resolvconf keeps a updated database of all information under /run/resolvconf. In the subdirectory interfaces, all DNS-information from interfaces are kept in a file per interface.

When running "resolvconf -u", resolvconf reads all this data and updates /etc/resolv.conf. 

It then talks to the system DBus and inform about the changes to 

You can watch dbus-communication happen with the command "dbus-monitor --system".

systemd-resolved

This system too maintains possible resolver sources. This also the process that listens to the address 127.0.0.53 on many systems. Performing lookup and caching of lookups. The tool to communicat with this services is the command resolvectl. You can for instance flush all caches by running "recolvectl flush-caches"

Sources for resolv.conf

The file /etc/resolv.conf is normally just a symbolink link to the file /run/systemd/resolved/resolve.conf

Some debugging tricks

This is a simple test to talk directly to the systemd-resolved sending a simple query via dbus-send:

dbus-send --system --dest=org.freedesktop.resolve1 --type=method_call --print-reply /org/freedesktop/resolve1 org.freedesktop.resolve1.Manager.ResolveHostname int32:0 string:kth.se int32:0 uint64:0

Tricks and Tips

Some DBus tricks:

Listen all connections to the system dbus:

dbus-send --system --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames

Returns a list of names

dbus-send --system --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.GetConnectionUnixProcessID string::1.4

Get PID for owner of a name

Getting private key password for winacme

To copy whole Let'sEncrypt certificates including private keys created by acme can be done quite simple, even though the windows builtin export certificate does not support exporting private keys for those certs.

Just go to the ProgramData folder for winacme and copy the certs from the certificates subdirectory.

Then, run winacme, type L for list certificates and select the number for the certificiate you want to import. In the page that then shows up it lists the private key password. Then just import the certificate on the new computer and enter the corresponding password gotten above.

Subcategories