Make it configurable when to retry a job (e.g. only on system failure)

Description

Sometime our build fail because of a system runner failure. Even when docker connection is much more stable now (#2408 (closed)), sometimes the setup still fails and we get a "system runner failure".

It would be nice to have an automatic way to retry such failures, but only those failures.

I know I can set up a general retry, but I explicitly only want system failures to be retried.

The benefit to only retry system failures is: We have a policy that specs should always work reliably. If a spec fails just some of the times, then it is a failure and automatically retrying it until it passes is no solution.

But if it is a system runner failure, in 99.9 % of the time I hit the "retry" button and it works, because it is something temporary like ...

 Job failed (system failure): Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2

I want to remove the noise from our developers so if they see a "job failed" message they can (mostly) be sure it is because of a spec, not because of a system failure.

Proposal

Make it possibly to configure retry. If there are only two different failure type (system and script), one could add an optional when key to retry:

retry:
  count: 2
  when:
    - system_failure
    - script_failure

It should still be allowed to only pass a number to retry, so the old behaviour will stay (retry always). This means the above is similar to:

retry: 2

Edited Aug 16, 2018 by Markus Doits