From: merch-redmine@...
Date: 2020-08-31T21:38:15+00:00
Subject: [ruby-core:99809] [Ruby master Bug#14997] Socket connect timeout exceeds the timeout value for

Issue #14997 has been updated by jeremyevans0 (Jeremy Evans).

Status changed from Open to Closed

I believe this timeout issue is now solved by the Socket.tcp :resolv_timeout option, introduced in commit:6382f5cc91ac9e36776bc854632d9a1237250da7.

----------------------------------------
Bug #14997: Socket connect timeout exceeds the timeout value for 
https://bugs.ruby-lang.org/issues/14997#change-87330

* Author: maciej.mensfeld (Maciej Mensfeld)
* Status: Closed
* Priority: Normal
* ruby -v: 2.5.1
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
Given a case, where a domain is being resolved to multiple IPs (4 in the following example):

```
dig debug-xyz.elb.us-east-1.amazonaws.com a

; <<>> DiG 9.10.3-P4-Ubuntu <<>> debug-xyz.elb.us-east-1.amazonaws.com a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;debug-xyz.elb.us-east-1.amazonaws.com. IN A

;; ANSWER SECTION:
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.86.79
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.109.24
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.119.55
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.71.167

;; Query time: 4 msec
;; SERVER: 172.31.0.2#53(172.31.0.2)
;; WHEN: Tue Aug 14 13:46:18 UTC 2018
;; MSG SIZE  rcvd: 132
```

and when `connect_timeout` is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach `N * 4`.

This can disrupt some time-sensitive systems.

We've experienced it with the following setup:

- TCP server (event machine) behind an AWS NLB
- TCP server process goes down behind NLB but NLB is still responsive
- Socket connect_timeout is set to 100ms
- AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't)
- Ruby timeouts after 100ms
- Ruby tries to connect to the next IP from the pool (AWS NLB again)
- Due to 4 hosts resolving, the overall timeout is 400ms.

Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit.

Here's the code actually responsible for this behavior: https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L631-L664


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>