-
Notifications
You must be signed in to change notification settings - Fork 40.5k
fix #51135 make CFS quota period configurable #63437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #51135 make CFS quota period configurable #63437
Conversation
@kubernetes/sig-node-pr-reviews @derekwaynecarr |
The value used today is the default in many OS distributions. There are many users running with 100ms today that could see a change in behavior. I was not at Kubecon, so I missed some context for the discussion. I prefer we have a kubelet flag to tweak the desired cfs_period_us setting on a linux host rather than hard-coding. Red Hat has customers that disable CFS quota entirely via the existing flag. I think its reasonable to pair that flag with the additional option to tweak the default CFS period. I do not think we need to let it be tweaked per pod. /hold |
@derekwaynecarr The |
Yes
…On Fri, May 4, 2018 at 5:37 PM Davanum Srinivas ***@***.***> wrote:
@derekwaynecarr <https://github.com/derekwaynecarr> The "cpu-cfs-quota"
command line flag in kubelet right? So we would have a
"cpu-cfs-quota-period" to go along with it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63437 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF8dbPLiCmJFAt46M0b_94jS34wGFFxYks5tvMohgaJpZM4Ty3li>
.
|
@szuecs looking forward to an update in this PR with the suggestion above |
see my comment here for why a flag is preferred. Not all nodes are tuned for latency. |
Thanks for all your comments. I will work on it probably next week, we have a short week, so I am not sure if takes more than next week. |
@derekwaynecarr I added the cli flag and config option. I hope that I added it everywhere, where it is needed. I tested, that:
Let me know if I have to change anything. |
/hold |
@szuecs i think you need to set the default value in |
/ok-to-test |
@derekwaynecarr @vishh Do we count this as a new feature? if so, Does this feature need an alpha feature gate? |
@derekwaynecarr @vishh Also, do we leave the current default as-is? (when nothing is specified in the command line) |
So, please squash the commits (for easier review), rebase to master and we should get all green (cross my fingers) thanks, |
8bba451
to
f0b2f0d
Compare
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 63437, 68081). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. |
@szuecs: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@derekwaynecarr @dims @timothysc @szuecs thanks! |
This has broken the release blocking alphafeatures test suite: https://k8s-testgrid.appspot.com/sig-release-1.12-blocking#gce-cos-1.12-alphafeatures The kubelet is crashlooping: Tracking bug for release-blocking failures: #68313 |
@dashpole @derekwaynecarr @dims please give this your immediate attention re: #68313. |
@@ -154,6 +155,9 @@ func SetDefaults_KubeletConfiguration(obj *KubeletConfiguration) { | |||
if obj.CPUCFSQuota == nil { | |||
obj.CPUCFSQuota = utilpointer.BoolPtr(true) | |||
} | |||
if obj.CPUCFSQuotaPeriod == nil && obj.FeatureGates[string(features.CPUCFSQuotaPeriod)] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this actually works... We generally do not feature gate the defaulting of flags. We generally only place the feature gate around code we don't want to run, rather than around configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does look like a problem. The flag should always default, it just shouldn’t have been used. Apologies for missing in review
Fix is out: #68386 |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. Remove feature gate from kubelet defaulting **What this PR does / why we need it**: Fixes a release-blocking test: https://k8s-testgrid.appspot.com/sig-release-1.12-blocking#gce-cos-1.12-alphafeatures Regression added by #63437 This solution was discussed on slack in the sig-release channel This should be targeted for 1.12 **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Issue ##68313 **Special notes for your reviewer**: /hold testing to make sure this fixes the issue Using: `make test-e2e-node FOCUS=ImageGCNoEviction SKIP= PARALLELISM=1 REMOTE=true TEST_ARGS='--feature-gates=CustomCPUCFSQuotaPeriod=true'` to reproduce the issue, as it runs a test with the feature gate enabled. **Release note**: ```release-note NONE ``` /assign @dims @derekwaynecarr /sig node /kind bug /priority critical-urgent
ping @Pingan2017 |
Adds a monzo.com/cpu-period resource, which allows tuning the period of time over which the kernel tracksw CPU throttling. In upstream Kubernetes versions pre-1.12, this is not tunable and is hardcoded to the kernel default (100ms). We originally introduced this after seeing long GC pauses clustered around 100ms [1], which was eventually traced to CFS throttling. Essentially it's recommended for very latency sensitive & bursty workloads (like HTTP microservices!) it's recommended to set the CFS quota period lower. We mostly set ours at 5ms across the board. See [2] and [3] for further discussion in the Kubernetes repository. This is fixed in upstream 1.12 via a slightly different path [4]; the period is now tunable via a kubelet CLI flag. This doesn't give us as fine-grained control, but we can still set this and optimise for the vast majority of our workloads. [1] golang/go#19378 [2] kubernetes#51135 [3] kubernetes#67577 [4] kubernetes#63437
Adds a monzo.com/cpu-period resource, which allows tuning the period of time over which the kernel tracksw CPU throttling. In upstream Kubernetes versions pre-1.12, this is not tunable and is hardcoded to the kernel default (100ms). We originally introduced this after seeing long GC pauses clustered around 100ms [1], which was eventually traced to CFS throttling. Essentially it's recommended for very latency sensitive & bursty workloads (like HTTP microservices!) it's recommended to set the CFS quota period lower. We mostly set ours at 5ms across the board. See [2] and [3] for further discussion in the Kubernetes repository. This is fixed in upstream 1.12 via a slightly different path [4]; the period is now tunable via a kubelet CLI flag. This doesn't give us as fine-grained control, but we can still set this and optimise for the vast majority of our workloads. [1] golang/go#19378 [2] kubernetes#51135 [3] kubernetes#67577 [4] kubernetes#63437 Squashed commits: commit 61551b0 Merge: a446c68 de2c6cb Author: Miles Bryant <[email protected]> Date: Wed Mar 13 16:16:17 2019 +0000 Merge pull request #2 from monzo/v1.9.11-kubelet-register-cpu-period Register nodes with monzo.com/cpu-period resource commit de2c6cb Author: Miles Bryant <[email protected]> Date: Wed Mar 13 15:14:58 2019 +0000 Register nodes with monzo.com/cpu-period resource We have a custom pod resource which allows tuning the CPU throttling period. Upgrading to 1.9 causes this to break scheduling logic, as the scheduler and pod preemption controller takes this resource into account when deciding where to place pods, and which pods to pre-empt. This patches the kubelet so that it registers its node with a large amount of this resource - 10000 * max number of pods (default 110). We typically run pods with this set to 5000, so this should be plenty. commit a446c68 Author: Miles Bryant <[email protected]> Date: Tue Jan 29 16:43:03 2019 +0000 ResourceConfig.CpuPeriod is now uint64, not int64 Some changes to upstream dependencies between v1.7 and v1.9 mean that the CpuPeriod field of ResourceConfig has changed type; unfortunately this means the Monzo CFS period patch doesn't compile. This won't change behaviour at all - the apiserver already validates that `monzo.com/cpu-period` can never be negative. The only edge case is if someone sets it to higher than the int64 positive bound (this will result in an overflow), but I don't think this is worth mitigating commit 1ead2d6 Author: Oliver Beattie <[email protected]> Date: Wed Aug 9 22:57:53 2017 +0100 [PLAT-713] Allow the CFS period to be tuned for containers
Our php container started seeing problems connected with CPU limits. We started profiling and saw that We are hosted on GKE, is there an option to disable cpu limits or reduce CFS period there? Does anyone has idea how to fix this problem on GKE? |
@@ -220,6 +220,8 @@ type KubeletConfiguration struct { | |||
// cpuCFSQuota enables CPU CFS quota enforcement for containers that | |||
// specify CPU limits | |||
CPUCFSQuota bool | |||
// CPUCFSQuotaPeriod sets the CPU CFS quota period value, cpu.cfs_period_us, defaults to 100ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm super confused about this one and can't figure this one out by myself: cpu.cfs_period_us
default limit called "100ms" but actually is microseconds:
cpu.cfs_period_us
: the length of a period (in microseconds)
In the change I'm commenting author clearly acted under assumptions that milliseconds are used in k8s and now it's all over the test cases and comments. But I wonder where that 1000x disperancy between k8s value and Linux value comes from? And does it exist, or k8s default is actually a microsecond, same as linux? Unless we multiple that default value of QuotaPeriod
by 1000 somewhere, converted to time.Duration
it's 100 microseconds and not milliseconds.
cc @vishh @sjenning and especially @hjacobs: I hope you all have some context about it and can tell me why I'm wrong and k8s code comments are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes cfs_period_us is in µs, that's why it has _us
suffix
Maybe https://go.dev/play/p/65fG1OmLFYN helps to understand.
As far as I remember, I think the time.Duration is only used if you want to set it via kubelet flag/config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment in that MR says "ms". In contrast, I think it better be "microseconds". I can send MR to change it as well as explicit millisecond mentions in the current code, whereas it should be microseconds. I'll ping you in it.
In this MR, you wrote milliseconds duration in multiple places in defaults and tests CPUCFSQuotaPeriod: metav1.Duration{Duration: 100 * time.Millisecond},
, I wonder if that should be changed to microseconds to match the actual default value of 100 microseconds?
Also, the discussion which led to this MR has multiple people talking about changing the value from 100 milliseconds to 5/25 milliseconds for testing. Still, it doesn't make sense unless everyone were altering the source code in microseconds, thinking it's milliseconds. @hjacobs even did a talk about it talking about the default value as if it was in milliseconds. Your code introducing milliseconds and not microseconds (so 1000x) contributed to my confusion, and I wonder if it's intentional or if you are mistaken, and it should be changed microseconds to change the real default.
The only way current code would make sense to me is if the value you feed to k8s (100 milliseconds) is divided by 1000 at some point in time and then matches the real Linux default of 100 microseconds, but I haven't found such a place yet, and I would doubt that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szuecs, the simplest form of the question I can form is the following.
Is the 100ms value of CPUCFSQuota
result in the same k8s behaviour as the default Linux value of 100 microseconds, or is it 1000 times higher as its milliseconds versus microseconds in Linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong, and the original code was correct. 100ms is indeed a default value in Linux, and I got confused by conversions in two places in code that had no comments about them. PR #112123 adds comments to the code added here.
// we set the period to 100ms by default | ||
period = quotaPeriod | ||
if !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.CPUCFSQuotaPeriod) { | ||
period = quotaPeriod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szuecs that's precise place which worries me: previously we used quotaPeriod
as-is, which, translated to time.Duration
, equals to 100 microseconds. And now we set it to cpuCFSQuotaPeriod
, which according to code and comments is 100 millisecond default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't see how this code can be worrying, because it shows the same assignment as before. If not the feature gate is set, set it to quotaPeriod as before. see https://github.com/kubernetes/kubernetes/pull/63437/files/588d2808b77d11f235b6eba5c21bcaa89a2f7804#r932625167
- if you enable the featuregate you can set to the value you like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ms" mentioned across the code alongside the tests with milliseconds and not microseconds confused me. Now I see it's still microseconds, and comments should be altered, and maybe tests as well. I'll prepare the PR, thanks for the help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paskal , makes sense to update, if it's wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #111520 about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong, and the original code was correct. 100ms is indeed a default value in Linux, and I got confused by conversions in two places in code that had no comments about them. PR #112123 adds comments to the code added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I read the kernel documentation at least 5 times when I did investigation and the code here.
const ( | ||
// Taken from lmctfy https://github.com/google/lmctfy/blob/master/lmctfy/controllers/cpu_controller.cc | ||
minShares = 2 | ||
sharesPerCPU = 1024 | ||
milliCPUToCPU = 1000 | ||
|
||
// 100000 is equivalent to 100ms | ||
quotaPeriod = 100 * minQuotaPeriod | ||
quotaPeriod = 100000 | ||
minQuotaPeriod = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quotaPeriod calculation was inlined:
100 * minQuotaPeriod = 100 * 1000 = 100000
What this PR does / why we need it:
This PR makes it possible for users to change CFS quota period from the default 100ms to some other value between 1µs and 1s.
#51135 shows that multiple production users have serious issues running reasonable workloads in kubernetes. The latency added by the 100ms CFS quota period is adding way too much time.
Which issue(s) this PR fixes:
Fixes #51135
Special notes for your reviewer:
Release note: