The ALICE Github PR builder is implemented as a set of independent agents which looks at a portion of the phase space of the PRs available for testing. This means that if there are 4 builders deployed each one of them will build on average 1/4th of the PRs. This avoids the need for a centralised scheduling service. Moreover, since the process of checking a PR is not done once a given PR is built, but only when it's merged, builders keep retrying to build even PR which were previously successfully built. This is because the boundary condition of the PR checking could be different (e.g. merge conflict might happen a previously unresponsive service is back to production) and therefore the test of a given PR is considered complete only when the PR is merged.
By default the builders will behave in the following manner:
- Wait for a configurable number of seconds
- Check if there is one or more untested PR. If yes, start building them and then go back to the starting point.
- Check if there are pull requests which were previously checked and had a failure. If yes, test one of them and then go back to the starting point.
- Check if there are pull requests which were previously checked and had did not have a failure. If yes, test one of them and then go back to the starting point.
We use Nomad to deploy the builders on Linux and MacOS.
Essential operations guide#
See also: the essential CI operations guide for the ALICE Nomad deployment.
- Adding a package to be tested
- Setup your environment
- Listing active PR checker
- Scaling the number of checkers
- Creating a new checker
- Updating the PR checker
- Restarting a checker
- Inspecting the checkers
- Monitoring the checkers
Adding a package to be tested#
Each checker deployed on Nomad can test multiple packages.
These are declared as *.env
files in a subdirectory of ali-bot, following the pattern:
The DEFAULTS.env
files in the directory tree are special -- they are sourced before the applicable *.env
file, and can provide default values for various variables.
The most relevant variables you can set through the *.env
file are:
ALIBUILD_DEFAULTS
: what to pass to alibuild using its--defaults
flagALIBUILD_O2_TESTS
,ALIBUILD_O2PHYSICS_TESTS
,ALIBUILD_XJALIENFS_TESTS
: set these if O2, O2Physics and/or xjalienfs unit tests should be run as part of the checkCHECK_NAME
: the display name for this check, shown on GitHubCI_NAME
: an internal name for this checkDEVEL_PKGS
: a newline-separated list of repositories to install as "development packages" for the buildDOCKER_EXTRA_ARGS
: if running builds in Docker, any additional args to pass to the container; e.g. use this to mount atmpfs
if neededDONT_USE_COMMENTS
: only update GitHub statuses after the check completes, instead of commenting on each pull request as the alibuild userINSTALL_ALIBOT
,INSTALL_ALIBUILD
: a slug (<org>/<repo>@<tag>#egg=<package>
) specifying the ali-bot and alibuild versions to install for this checkJOBS
: how many CPU cores to use for buildingNO_ASSUME_CONSISTENT_EXTERNALS
: auto-generate a build identifier to keep builds of different pull requests apart. This uses more disk space overall, but can speed up rebuilds, since build artifacts are kept for each pull request.PACKAGE
: the package (as declared in alidist) to build for the checkPR_BRANCH
: look for pull requests against this branchPR_REPO
: look for pull requests in this repositoryPR_REPO_CHECKOUT
: check out the repository to be tested into this directoryREMOTE_STORE
: use this as alibuild's--remote-store
, for cachingTRUST_COLLABORATORS
: automatically test PRs from people who have committed to the same repo before, without waiting for approval from someone elseTRUSTED_USERS
,TRUSTED_TEAM
: trust pull requests from these users (i.e. test them without requiring approval)ONLY_RUN_WHEN_CHANGED
: a space-separated list of paths. The check will only be run if the given pull request contains any changes under any of these paths. If not, the check will be skipped and marked as successful with an appropriate message.
In order to add a new pull request check for a repository, and a checker for the required platform/container exists already, just add a *.env
file in the appropriate directory in ali-bot.
If you add a new check, make sure to update the appropriate repository's GitHub action that cleans up broken statuses to include your new check. For example, in alidist, update this file, or this file for O2.
Setup your environment#
Set up your Nomad environment as described here.
Job declarations for the pull request checkers are stored inside the ci/
subdirectory of the ci-jobs repository.
See this section of the ALICE Nomad docs for instructions on deploying CI builders.
Listing active PR checkers#
In order to see the list of the running prs you can do:
where the resulting job names will follow the convention ci-<mesos role>-<container>[-<suffix>]
.
This matches the subdirectories of ali-bot/ci/repo-config/
.
For example, the ci-mesosci-slc9-o2physics
checker will pick up its .env
files from ali-bot/ci/repo-config/mesosci/slc9-o2physics/
.
Each checker will have one or more instances with the same configuration. You can see these by running, e.g.:
You can check what each instance is doing by getting its allocation ID, then running:
Scaling the number of checkers#
Running builders should not be affected by this operation.
This will only add builders if you increased num_builders
or delete some if you decreased it.
The builders know how many there are of the same type though the config/workers-pool-size
file.
Nomad will automatically update this file for you when you deploy the updated job declaration.
Creating a new checker#
In order to add a new checker, you need to add both a Nomad job declaration by creating a new YAML file in ci-jobs/ci/
.
Assuming you want to add a checker that runs two parallel builders as the mesosci
user and compiles software using an slc9
container, create a file called ci-jobs/ci/mesosci-slc9-foobar.yaml
containing the following:
You can deploy the checkers declared using this file by running:
levant render -var-file mesosci-slc9-foobar.yaml | nomad job plan - # make sure scheduling is fine
levant render -var-file mesosci-slc9-foobar.yaml | nomad job run - # actually redeploy the job
You also need to tell the new checker what it should build and where it should look for incoming pull requests.
Do this by creating configuration files in ali-bot/ci/repo-config/mesosci/slc9-foobar/
(assuming the same parameters as above).
The easiest thing to do is to copy an existing .env
file that you want to mimic and change the CI_NAME
and CHECK_NAME
variables.
For example, if you want to test PRs in the https://github.com/AliceO2Group/AliceO2 repository by compiling O2Suite and running O2's and O2Physics' unit tests, adapt the following:
CI_NAME=build_O2_o2
CHECK_NAME=build/O2/o2
PACKAGE=O2Suite
ALIBUILD_DEFAULTS=o2
PR_REPO=AliceO2Group/AliceO2
PR_REPO_CHECKOUT=O2
PR_BRANCH=dev
TRUST_COLLABORATORS=true
DONT_USE_COMMENTS=1
DEVEL_PKGS="$PR_REPO $PR_BRANCH $PR_REPO_CHECKOUT
AliceO2Group/O2Physics master
alisw/alidist master"
ALIBUILD_O2_TESTS=1
ALIBUILD_O2PHYSICS_TESTS=1
The .env
files are in shell syntax.
They are sourced by the CI builders, but some utilities (list-branch-pr
and ci-status-overview
) also parse them using Python's shlex
library, so try not to use overly complex Bash features.
Try to use only literal strings for the PR_REPO
, PR_BRANCH
, CHECK_NAME
, TRUST_COLLABORATORS
, TRUSTED_USERS
and TRUSTED_TEAM
variables, since the Python tools parse these.
For other variables, feel free to use variable substitution with the usual shell syntax.
Updating the PR checker inner loop#
Sometimes one might need to update the inner loop of the PR checking, without having to restart the checker itself.
This is done automatically for any updates of the ali-bot/ci/build-loop.sh
and ali-bot/ci/build-helpers.sh
files.
Upgrades will be picked up at the next iteration of the PR builder, so you might still have to wait for the current one to finish before you can see your updates deployed.
Updates to ali-bot/ci/continuous-builder.sh
are also picked up, but less frequently.
Restarting a checker#
In some cases, builders need to be restarted.
This will redeploy the same configuration, but the ali-bot
scripts will be taken from the master branch and continuous-builder.sh
will be run again.
Inspecting the checkers#
You can stream the logs of any CI builder instance easily -- see here how to do this.
If you want to interactively log in to any instance, run:
nomad alloc exec <alloc ID> sh -c 'export TERM=xterm-256color HOME=$NOMAD_TASK_DIR PS1="\\u@\\h \\w \\\$ "; cd; exec bash -i'
This will give you a shell in the builder's home directory, where it stores its workarea.
Log files are in the $NOMAD_ALLOC_DIR/logs
directory.
The checkers react to some special files, which can be created in their home directory. In particular:
config/silent
with non empty content means that the checker will not report issues to the github PR page associated with the checks being done.config/debug
will force additional debugging output.config/workers-pool-size
will force the number of checkers currently active. This is normally created and updated automatically by Nomad.config/worker-index
will force the index of the current instance of the checker.
Monitoring the checkers#
The CI system is monitored using a dedicated Grafana dashboard.
Builders are monitored in Monalisa.
In particular you can use aliendb9 and look at the github-pr-checker/github.com/github-api-calls
metric to know how many API calls are being done by the system.
You can also get a detailed view the activity of the builders in our Build Infrastructure Cockpit. If you are interested in extending the reports, please contact us.
Some of the CI machines are also included in the IT Grafana dashboard. Notice this includes also machines which do not run CI builds.
There is also a dashboard that shows the status of each check and pull request. In case of outages for larger problems, this can be useful to find failed PRs, or to find out when the problems started (since PRs are sorted newest-first).
Troubleshooting#
Empty logs#
Empty logs can happen in the case the build fails in some pre-alibuild steps and the harvesting script is not able or not allowed to fetch logs. In particular, due to the criticality of some operations, logs which might contain sensitive information are not retrieved, to avoid exposing them to unprivileged ALICE users. Maintainers of the build infrastructure can follow the instructions in the report to retrieve them.
That said, the vast majority of the "missing logs issues" are due to the following two items:
- Wrong tag / commit / url when checking out the code. Make sure you review any tag and commit you changed in your PR to make sure they are actually existing. Possibly check by hand all the modified tags with
git clone -b <tag> <url>
. - Issues with the infrastructure backend not directly under our control, in particular gitlab and github. You can review if there was any known issues by going to either the github status page or the CERN Service Status Board.
Failing those (or if you are unsure) feel free to contact us directly either on mattermost or by adding a comment to your PR.