You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the scheduler picks up next to-run g in the following
delicate order:
1 locked g.
2 trace reader g.
3 GC backgroud mark worker g.
4 global runq if schedtick%61 == 0.
5 local runq including timers.
6 findrunnable loop.
6.1 finalizer g.
6.2 local runq including timers.
6.3 global runq, no schedtick checked.
6.4 non-blocking netpoll(0) if netpollWaiters > 0 and no other M polling.
6.5 steal any timers and runnable g from other P. spinning state.
6.6 GC idle mark worker g.
6.7 recheck all other P's local runq, idle GC marker and timers.
non-spinning state, P is idle.
6.8 blocking netpoll(delay) until next earliest timer or IO ready.
if any other M already polling, no blocking netpoll. M parked.
If found a runnable g in any step, all steps following it will be skipped.
Suppose there are many runnable gs, scheduler would almost find a g before
trying findrunnable, so netpoll left untouched. Normal Ms have no opportunity
to unpark netpoll waiters g even though IO readied. The special M sysmon polls
network if not polled for more than 10ms, which is the only opportunity to
ready any netpoll waiters g into the global runq. This causes a false positive
timeout problem.
I did a simple test to demostrate this problem. Dialing a localhost TCP
listening port costs about 10ms. This is obviously not credible. See file
runtime/sysmon_test.go for detail test code.
Normal Ms will sleep in netpoll waiting for timer and IO ready if no g
found at last. The special M sysmon simply sleeps 20us~10ms and does
a non-blocking netpoll if needed. This CL let sysmon sleep in netpoll,
do a blocking netpoll like what normal Ms do. So netpoll waiters g will
be readied into runq as soon as IO readied.
With this change, the above mentioned test shows that dialing to localhost
TCP listening port costs mostly less than 1ms. Its much more reasonable than
10ms. Detailed test results for the old and new versions are as follows:
$ ../../bin/go test -v -run TestSysmonReadyNetpollWaitersASAP
=== RUN TestSysmonReadyNetpollWaitersASAP
sysmon_test.go:105: dialed 85 times within 1.000029211s
sysmon_test.go:106: timeBucket count percent
sysmon_test.go:108: [ 0, 1)ms 1 1.18%
sysmon_test.go:108: [ 1, 2)ms 0 0.00%
sysmon_test.go:108: [ 2, 3)ms 0 0.00%
sysmon_test.go:108: [ 3, 4)ms 0 0.00%
sysmon_test.go:108: [ 4, 5)ms 0 0.00%
sysmon_test.go:108: [ 5, 6)ms 0 0.00%
sysmon_test.go:108: [ 6, 7)ms 0 0.00%
sysmon_test.go:108: [ 7, 8)ms 0 0.00%
sysmon_test.go:108: [ 8, 9)ms 1 1.18%
sysmon_test.go:108: [ 9,10)ms 2 2.35%
sysmon_test.go:108: [10,11)ms 32 37.65%
sysmon_test.go:108: [11,12)ms 49 57.65%
--- FAIL: TestSysmonReadyNetpollWaitersASAP (1.11s)
$ ../../bin/go test -v -run TestSysmonReadyNetpollWaitersASAP
=== RUN TestSysmonReadyNetpollWaitersASAP
sysmon_test.go:105: dialed 2369 times within 1.000021368s
sysmon_test.go:106: timeBucket count percent
sysmon_test.go:108: [ 0, 1)ms 2209 93.25%
sysmon_test.go:108: [ 1, 2)ms 13 0.55%
sysmon_test.go:108: [ 2, 3)ms 33 1.39%
sysmon_test.go:108: [ 3, 4)ms 68 2.87%
sysmon_test.go:108: [ 4, 5)ms 8 0.34%
sysmon_test.go:108: [ 5, 6)ms 9 0.38%
sysmon_test.go:108: [ 6, 7)ms 6 0.25%
sysmon_test.go:108: [ 7, 8)ms 4 0.17%
sysmon_test.go:108: [ 8, 9)ms 7 0.30%
sysmon_test.go:108: [ 9,10)ms 4 0.17%
sysmon_test.go:108: [10,11)ms 5 0.21%
sysmon_test.go:108: [11,12)ms 3 0.13%
--- PASS: TestSysmonReadyNetpollWaitersASAP (1.11s)
0 commit comments