taskreaper: Wait for tasks to stop running #1948
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, the task reaper only waits for a task's desired state to go
past "running" before deleting it. This can cause problems with rolling
updates when task-history-limit is set to 1. The old task can get
deleted by the reaper before the agent has a chance to update its actual
state to "shutdown", so the update never sees that the task has
completed, and has to wait awhile for a timeout.
This fixes it by making the task reaper only delete tasks whose desired
and actual state has moved past "running". It's also necessary to keep
slots in the "dirty" list until there is only one task in that slot with
desired state or actual state <= "running", so that the old task still
gets cleaned up once the actual state moves. Finally, "deleteTasks" is
changed to a map so that a task which is both part of a dirty slot and
orphaned won't cause two delete attempts (one of which would fail).
Note that this means tasks on unavailable nodes will stay around for
awhile, until the "orphaned" state is reached.
Tested by vendoring into docker and using the repro steps from moby/moby#28291.
cc @aluzzardi @dongluochen