Skip to content

ERROR: Job failed (system failure): Error: No such container: <container_id>

Summary

It happens on random (for about 10% of the jobs), in 2-5 min into the job. Job simply fails with this error. Restarting the job once or twice generally solves the problem

System: CentOs 7

Runner: 10.2.0 / 10.4.0 (DiD)

Docker: 17.06.2-ce

Gitlab: 10.4.2

Example

Job is starting normally, without much of a problem.

b101error

But sometimes it just fails with:

ERROR: Job failed (system failure): Error: No such container: 858565f6d59f647f9546f785e539884259de09d3364b804b57f7fe5e618f22c9

Note that the container id is not present anywhere in logs, not even in docker logs.

Proposal

  • Use Docker volumes, instead of getting the volume mappings from previous containers we used we will use the volumes directly this will help 80% of the cases.
  • Make container watching more reliable, this will solve a scenario where the system is under load and we start the container and the script finishes executing but we haven't yet started watching the container. When we start watching the container, the container itself is already gone which results into a 404.
  • Retry stage when the container is not found for Docker executor will retry the stage from the begging if the container was removed for some reason, this will make our executor a lot more resilient to the ephemeral nature of containers.

Dev logs

2020-04-01
2020-04-02
2020-04-03
  • Work on how we manage volumes, to create volumes directly instead of using containers and take volumes from that !1989 (61b396e7)
2020-04-06
  • Finish !1989 (merged) and get it ready for review. This should fix the ContainerCreate scenario.
  • Open !1990 (merged) to help with the issue where the container might have exited and been cleanup already before we actually start watching the container.
  • Start work on !1995 (merged) to retry a stage if it failed because of the container not found
2020-04-07

Keep working on !1995 (merged) so that if a container is removed mid execution, we restart the stage (for example before_script+script) from the beginning so we are more resilient to the ephemeral nature of Docker containers.

Container removed during execution docker rm -f $CONTAINER_ID:

Screen_Shot_2020-04-07_at_18.21.21

Container removed multiple times and we try up to 3 times:

Screen_Shot_2020-04-07_at_18.19.20

This should be ready for review tomorrow since the test on Windows is not behaving as expected, and had a ton of problem trying to figure out how to create an integration test for this.

2020-04-08

2020-04-08

All fixes have been implemented and in the review stage, to reiterate what we are doing to fix this issue:

  • Use Docker volumes, instead of getting the volume mappings from previous containers we used we will use the volumes directly this will help 80% of the cases.
  • Make container watching more reliable, this will solve a scenario where the system is under load and we start the container and the script finishes executing but we haven't yet started watching the container. When we start watching the container, the container itself is already gone which results into a 404.
  • Retry stage when the container is not found for Docker executor will retry the stage from the begging if the container was removed for some reason, this will make our executor a lot more resilient to the ephemeral nature of containers.

In the current issue description we have:

Every time we get a 404 we will print all the ContainerIDs that the Runner has and all the container IDs that the container daemon has this way we get a "snapshot" of what is the current state of things when this kind of issue prevents. The retry logic should be created using the retry library

We will not do this, if Docker returns a 404 logging the containers will not help us debugging the problem, with the fixes above we confirmed that these are more logic bugs from our end and not Docker misbehaving. I will update the issue description with the new proposal.

Regressions

All the regressions the fixes for this issue has caused:

  1. #25438 (closed)
  2. #25428 (closed)
  3. gitlab#215037 (closed)
  4. #25440 (closed)
  5. #25432 (closed)
Edited by Steve Xuereb