Cluster architecture#
See here for an explanation of the underlying infrastructure.
Nomad, Consul and Vault setup notes#
Nomad and Consul setup on non-Openstack machines#
- Add Hashicorp apt repo to get modern consul/nomad packages (the Ubuntu-packaged ones are too old to work with our config):
curl -fsSL https://apt.releases.hashicorp.com/gpg | apt-key add - cat << EOF > /etc/apt/sources.list.d/hashicorp.list # Added <date> by <you>. # Required for newer versions of nomad and consul -- Ubuntu repo versions are too old. deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main EOF apt update apt install nomad consul systemctl stop nomad consul # for now mkdir /build/consul /build/nomad chown consul:consul /build/consul
Ignore any errors about packages not being able to create directories in /opt
. Jalien stuff is mounted there, and consul/nomad would only create their default data directory there (which we don't use).
2. Use consul.hcl
and nomad.hcl
from Puppet (use the links or get them from /etc/nomad.d
/ /etc/consul.d
on a Puppet-managed host), but adapt by setting the correct IP address for advertise_addr
and/or bind_addr
, and substituting Teigi secrets manually.
Also, change the location of Grid certificates if necessary, both for nomad's multiple tls
settings and for nomad's host_volumes
. Nomad itself runs as root, so no filesystem ACL adjustments should be necessary (but as before, the CI should be allowed access to them via the mesos{ci,daq}
users).
For alissandra
machines, reserve half the resources so the CI system doesn't take up too much of the host.
3. Add a server=/consul/137.138.62.94
line to /etc/dnsmasq.conf
and restart dnsmasq.service
. This is required for Nomad to look up Vault hosts at vault.service.consul
.
4. Start the daemons:
An autogenerated nomad-client entry for the host should now appear in consul, and the new nomad client should appear in the nomad web UI.
Vault settings applied to our cluster#
Allow Nomad clients access to secrets#
Secret subtitution happens on the nomad client that receives a job. It has a nomad-server
token, with which it has to get a temporary nomad
token (which it in turn uses to fetch secrets).
vault write auth/token/roles/nomad-server allowed_policies='default, nomad, nomad-server' orphan=true
vault write auth/token/roles/nomad allowed_policies='default, nomad' orphan=true token_period=1d
Allow issuing essentially non-expiring tokens#
By default, tokens expire after a month, which means the nomad cluster breaks and tokens have to be manually updated.
Change the default max expiry time to 100 years instead.
Tokens can still be revoked manually using any one of the following commands:
vault token revoke '<secret token here>'
# Or alternatively:
vault token revoke -accessor '<token accessor here>'
This will revoke any child tokens of the given token as well, unless the -mode=orphan
option is given.
Troubleshooting#
Leftover Consul services aren't deregistered#
Sometimes, Consul can get into a state where services were registered but never deregistered. In particular, this happens when a host changes its public IP address -- in that case, the old service remains registered because a Consul agent with the same host name is still available, even though the service's public IP address is wrong.
If this happens, deregister the service with the wrong IP manually by running the following:
ssh host-with-changed-ip.cern.ch "CONSUL_HTTP_TOKEN=$(pass .../consul-token)" consul services deregister -id offending-service-id
pass
command prints out your Consul token, host-with-changed-ip
is the affected hostname, and offending-service-id
is the ID of the old service with the wrong IP, e.g. _nomad-client-xd2yun6h34z5nni56pdncbqb6b3mjfcf
.
It is critical that the consul services deregister
command runs on the host that originally registered the service.
If you run the plain consul
command locally, e.g. against one of the Consul master nodes, it will fail with a "service not found" error, even if the ID is correct.