Gluster cluster split in two?

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive

It can happen that a Gluster cluster gets divided in two parts. I'm not talking a about a volume split brain here but a whole cluster. Something might have gone wrong when probing a node. Or as in our case when adding aliases for nodes, the peer info file was corrupted (seems to be a maximum name length for nodes) which caused some nodes to believe they where in another cluster.

The solution to this is to first decide which nodes you consider as the proper cluster. Running gluster peer status will show you what other nodes are considered to be in the same group as the node you run the status command on. Nodes that are in state "Peer Rejected State" might thing they are part of another cluster. If most of the nodes are in "Peer Rejected State", then probably you should run the command on one of those nodes in rejected state and you will see that most nodes there will be in ok state.

On all those nodes in rejected state, run following procedure:

  1. Stop glusterd
  2. Remove all files from /var/lib/glusterd except the glusterd.info-file
  3. Start glusterd again
  4. Run a gluster peer probe to a member node.
  5. Restart glusterd again

Other lessons learned:

Do make sure that you save the glusterd.info file, if not a new one will be created and effectively you will be creating a new node, with the same name. To solve this, stop the glusterd daemon on all nodes, remove the faulty uuid from /var/lib/glusterfs/peers and restart glusterd on all nodes again.
I did not find this error immediately and I was strugling with a lot of locking errors in glusterd.log file and any "gluster volume status" command would just hang for ever.

Did your kubelet certificate expire in k8s

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive

For some reason the kublet selfsigned certificate was expired in my cluster. That is the kubelets own API-service, running on port 10250 (i.e. not the client cert that kubelet uses to talk with api-servers). Its supposed to be a self-signed certificate but it was not renewed.

The problem was not very obvious but we saw it when the metrics-service did not work properly. It complained about expired certificates on for port 10250 on nodes.

I could not find any article about how to re-create this certificate. Sure, kubeadm certs has a lot of renewal options, but not for the actual kublet https port as far as I could find out.

The solution showed up to be quite simple. Just remove the two files /var/lib/kubelet/pki/kubelet.crt and /var/lib/kubelet/pki/kubelet.key and restart the kublet service with systemctl restart kublet.

The kubelet will then generate new self-signed certs.

In the end though, this was shown not to be the problem. First, the metrics service deployment needs to be run with the container argument: --kubelet-insecure-tls

at least if the kubelets run with self-signed certs.

Our root problem was that one api-server was running with a faulty proxy settings which caused its internal call to the metrics server to fail.

How to pre-configure rasberry image for swedish wifi and enable ssh

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive

This will setup a raspian image to be used with a swedish wifi and ssh enabled. Done in linux.

Prologue:

  • Map the partitions from the image with:
        kpartx -a <name-of-unpacked-raspian-image>.img
  • Run losetup to see which loop device was used. Lets say it was loop4 in this case
  • Mount the root partition of the image (second partition) under /mnt:
       sudo mount /dev/mapper/loop4p2 /mnt

Wifi:

  • Create the file /mnt/etc/wpa_supplicant/wpa_supplicant.conf  with following content:
    ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
    update_config=1
    country=SE
    network={
    ssid="<your-wifi-ssid>"
    psk="<your-wifi-password>"
    }
  • Replace the '1' to a '0' i all files /mnt/var/lib/systemd/rfkill/platform-*

Enable ssh daemon:

  • ln -s /lib/systemd/system/ssh.service /etc/systemd/system/
  • ln -s /lib/systemd/system/ssh.service /etc/systemd/system/multi-user.target.wants/

Epilogue:

  • Run: umount /mnt
  • Drop the devicemapper mappings:
        kpartx -x <name-of-unpacked-raspian-image>.img

Now your image file should be ready to write to a MicroSD card and the raspberry should boot up directly to the wifi network with ssh enabled.

Extra stuff

Enable camera:

  • Run the kpartx and mount /dev/mapper/loop4p1 instead as /mnt
  • Add following files to the "[all]" section of /mnt/config.txt:
    start_x=1
    gpu_mem=128

Running out of ephemeral storage in kubernetes?

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive

This is how I rescued running systems from running out of the ephemeral storage in kubernetes.

The ephemeral storage is local temporary storage used by kubermetes. most noticaly the emptyDir kind of volume. Depending of type of pods running this might not require a lot of storage, but can also need significant sizes. One particular case if you run docker-in-docker. Then all images pulled by that pod will be stored in the ephemeral space.

What i did was following on relevant nodes.

  1. First i stopped further scheduling with: kubectl cordon <node>
  2. The I made sure that none of the pods where in use anywhere in the system.
  3. Then run: kubectl drain --force --delete-emptydir-data --ignore-daemonsets <node>
  4. Stop kubelet with: systemctl stop kubelet
  5. Stopped all containers with : docker kill $(docker ps -aq) and docker rm $(docker ps -aq)
  6. Unmounted all current mounts below /var/lib/kubelet.
  7. Create a new lvm to keep the ephemeral storage. Copied the current ephemeral storage from /var/lib/kubelet to the new volume and mounted the new volume there instead (cleaning out the old data first).
  8. Then start kubelet and uncordon the node with systemctl start kubelet and kubectl uncordon <node>

Done

Number of gerrit-trigger connections keeps growing using helm jenkins

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive

When using the gerrit-trigger plugin in jenkins and wanting to configure everything from git I've experienced that the ssh connections to the gerrit server can grow to eventually consume all connections possible when using the groovy-script given as example for setting up the gerrit-trigger plugin. Since the gerrit-trigger does not yet support Jenkins as Code configuration, it must be setup with a groovy JCasC script.

I found a way of solving this connection leakage which was caused by the configuration being reloaded quite often from the helm side-cart.
This is my code section to make it work in a helm deployment of jenkins:

jenkins:
  JCasC:

gerrit-trigger: |
groovy:
- script: >
import jenkins.model.Jenkins;
import net.sf.json.JSONObject;
import com.sonyericsson.hudson.plugins.gerrit.trigger.GerritServer;
if (Jenkins.instance.pluginManager.activePlugins.find { it.shortName == "gerrit-trigger" } != null)
{
    println("JCasC Groovvy: Setting gerrit-trigger server plugin");
    def gerritPlugin = Jenkins.instance.getPlugin(com.sonyericsson.hudson.plugins.gerrit.trigger.PluginImpl.class);
    // Create new or attach to existing server
    def serverName = "my-gerrit";
    def GerritServer server;
    if (gerritPlugin.containsServer(serverName)) {
        server = gerritPlugin.getServer(serverName);
    }
    else {
        println("JCasC Groovvy: Created new gerrit server ${serverName}");
        server = new GerritServer(serverName);
    }
    server.stop();
    def config = server.getConfig();
    config.setGerritHostName("<gerrit-server>")
    config.setGerritSshPort(29418)
    config.setGerritUserName("<gerrit-ssh-user>")
    config.setGerritFrontEndURL("<your-gerrit-url>:8080")
    config.setGerritAuthKeyFile(new File("/var/jenkins_home/.ssh/id_rsa.<gerrit-ssh-user>"))
    config.setGerritEMail("<jenkins-email>")
    config.setNumberOfReceivingWorkerThreads(3);
    config.setNumberOfSendingWorkerThreads(1);
    config.setUseRestApi(false)
    server.setConfig(config);
    gerritPlugin.addServer(server);
    server.start();
    server.startConnection();
    println("JCasC Groovvy: Setting ${serverName} completed");
}

Subcategories