Spurious "Cannot connect to the Docker daemon" failures in CI
For a long time we have been seeing spurious CI failures of the form:
Resolving secrets 00:00
Preparing the "docker" executor 06:09
Using Docker executor with image registry.gitlab.haskell.org/ghc/ci-images/x86_64-linux-deb10:8d0224e6b2a08157649651e69302380b2bd24e11 ...
ERROR: Preparation failed: adding cache volume: set volume permissions: create permission container for volume "runner-xwc7hlhu-project-2970-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (linux_set.go:95:120s)
This appears to be a docker
or GitLab runner bug.
Relevant upstream issues:
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/26441
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27518 suggests that that jobs should be automatically restartable when failing due to runner failure
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/21387 suggests that the issue may be due to the docker daemon blocking on IO
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4578
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4443
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/3612 has quite a few other mentions
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/2890