Skip to content

Always skip randomly failing OCI8 extauth tests #9524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from

Conversation

mvorisek
Copy link
Contributor

fix #8348 (comment)

The issue is weird as it happends only sometimes in quite repeatable environment like GH Actions.

I guess the anything connection string is the problem, sometimes it is resolved immediatelly, sometimes the resolving (or the connection) to the hostname ends with a timeout. I was however not able to reproduce it locally.

These tests was marked as slow tests. There are not slow. There are either fast or hit the problem ;-)

@mvorisek mvorisek marked this pull request as draft September 11, 2022 22:17
@@ -70,7 +66,7 @@ var_dump($c);

echo "Test 7\n";

$c = oci_connect('/', '', 'anything', null, OCI_CRED_EXT);
$c = oci_connect('/', '', 'localhost', null, OCI_CRED_EXT);
Copy link
Contributor Author

@mvorisek mvorisek Sep 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when localhost is used instead of anything, tests fail, so it cannot be really "anything".

CI https://github.com/php/php-src/runs/8295659688?check_suite_focus=true#step:10:1952

@@ -30,4 +30,4 @@ runs:
--offline \
--show-diff \
--show-slow 1000 \
--set-timeout 120
--set-timeout 1200
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/php/php-src/runs/8296934151?check_suite_focus=true#step:11:3157 shows when the issue is hit, there is no OCI8 timeout or the timeout is higher than 10 minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmb69
Copy link
Member

cmb69 commented Sep 12, 2022

Thank you for working on this!

FWIW, I just learned that as of Oracle 19c it is possible to specify CONNECT_TIMEOUT and TRANSPORT_CONNECT_TIMEOUT in the connection string.

@mvorisek
Copy link
Contributor Author

mvorisek commented Sep 12, 2022

Hi, @cjbj, I am reaching you as you are the author of these extauth tests.

Can you please explain what resources

$c = oci_connect('/', '', 'x.cz', null, OCI_CRED_EXT);

the test is connecting to and why the elapsed time varies from a second to 10 minutes?

CI results:

https://github.com/mvorisek/php-src/runs/8301382524?check_suite_focus=true#step:9:115

 ========DIFF========
--
     Test 7
     
     Warning: oci_connect(): ORA-12154: %s in %s on line %d
040+ elapsed: 13 secs
040- elapsed: %c secs
     array(4) {
       ["code"]=>
       int(12154)
--
========DONE========
FAIL Test External Authentication errors with oci_connect [ext/oci8/tests/extauth_01.phpt] 

@cjbj
Copy link
Contributor

cjbj commented Sep 12, 2022

FWIW, I just learned that as of Oracle 19c it is possible to specify CONNECT_TIMEOUT and TRANSPORT_CONNECT_TIMEOUT in the connection string.

See Oracle Database 21c Easy Connect Plus

@cjbj
Copy link
Contributor

cjbj commented Sep 12, 2022

Can you please explain what resources

$c = oci_connect('/', '', 'x.cz', null, OCI_CRED_EXT);

the test is connecting to and why the elapsed time varies from a second to 10 minutes?

These tests are supposed to fail to connect, so they were originally written with connection strings like 'anything'. Maybe with the closer-to-real connection string x.cz there is some DNS lookup that is taking time?

@mvorisek
Copy link
Contributor Author

Can you please explain what resources

$c = oci_connect('/', '', 'x.cz', null, OCI_CRED_EXT);

the test is connecting to and why the elapsed time varies from a second to 10 minutes?

These tests are supposed to fail to connect, so they were originally written with connection strings like 'anything'. Maybe with the closer-to-real connection string x.cz there is some DNS lookup that is taking time?

What I see is random ~200s / connect timeout. I cannot explain these timeouts. I tried real x.cz domain, but the problem remained. The problem must be burried somewhere else and I would even say it is some burried bug. Therefore I am asking you if the external auth tries to connect to some other host as a fallback or if you can somehow help identifying/fixing the issue.

The only solution I have now is to change it to localhost which then returns ORA-01017: invalid username/password; logon denied or remove the "Test 7" - "Test 10" completely. Not sure if the tests makes sense then, also, it can fail in other CI environments then.

@cmb69
Copy link
Member

cmb69 commented Sep 12, 2022

@mvorisek, have you tried something like anything?connect_timeout=60&transport_connect_timeout=30?

@mvorisek
Copy link
Contributor Author

@mvorisek, have you tried something like anything?connect_timeout=60&transport_connect_timeout=30?

I will once the previous tests finish (they take like 3h / commit, later, I optimized them to test OCI8 ext only).

My question to you it, what shoult this solve?

a) the timeouts are random, shouldn't we first confirm if such insane behavior is valid, even with stable DNS names like x.cz (resolves to empty host).
b) if we will add timeout config, do we still test the same behaviour? with this, CI will be producing two type of results (can be adjusted), but the long random timeouts will be kept (to the configured timeout) :)

So if we cannot find the real issue, I am more inclined to remove (comment out) these tests for now.

@mvorisek
Copy link
Contributor Author

@mvorisek, have you tried something like anything?connect_timeout=60&transport_connect_timeout=30?

@cmb69 tried - https://github.com/php/php-src/actions/runs/3039225236/jobs/4893908509#step:11:224 - has no effect

@mvorisek
Copy link
Contributor Author

mvorisek commented Sep 12, 2022

I was able to reproduce the problem in CI /w strace: https://github.com/mvorisek/php-src/actions/runs/3039764841/jobs/4895065748

relevant logs:

strace-short-timeout.txt

strace-long-timeout.txt

@cjbj can you please check the snlinGetAddrInfo libclntsh.so code?

a) it seems the Oracle Instant Client is looping massively when resolving the host
b) in the 2nd log, search for first "Timeout" (l26855) and notice, the looping continues, but with ~5 secs delays - this is the cause of connect sometimes taking randomly ~200 secs (instead of < 1 sec, even for exactly the same host).

The Oracle Instant Client has probably some condition in the code wrong.

Code to reproduce: just php -r 'oci_new_connect("/", "", "anything", null, OCI_CRED_EXT);' and luck - rerun in loop until you hit the problem (the elapsed time is > 10 secs)

@mvorisek mvorisek marked this pull request as ready for review September 12, 2022 20:49
@mvorisek mvorisek changed the title Fix OCI8 extauth tests Always skip randomly failing OCI8 extauth tests Sep 12, 2022
@mvorisek
Copy link
Contributor Author

mvorisek commented Sep 12, 2022

@cmb69 PR done, the 3 extauth tests cannot be fixed, they need to be skipped until the Oracle Instant Client is fixed. Tested with hundreds of CI runs, this PR fixes the random CI failures. Before merge, please squash the changes into one commit.

@cmb69 cmb69 closed this in 9a73ec0 Sep 13, 2022
@cmb69
Copy link
Member

cmb69 commented Sep 13, 2022

Thank you!

@mvorisek mvorisek deleted the fix_extauth_tests branch September 13, 2022 13:59
@cjbj
Copy link
Contributor

cjbj commented Sep 16, 2022

@cjbj can you please check the snlinGetAddrInfo libclntsh.so code?

I'll point our network people at this and let the experts discuss it.

Thanks for all your work on this PR.

@cjbj
Copy link
Contributor

cjbj commented Sep 19, 2022

@cjbj can you please check the snlinGetAddrInfo libclntsh.so code?

I'll point our network people at this and let the experts discuss it.

@mvorisek The reply is "The looping and timeout behavior would depend on the setting in /etc/resolv.conf. Here it looks like the default timeout is 5 seconds and there might be multiple DNS servers specified so after each timeout the next DNS host would be tried. The content of resolv.conf can be seen to verify the number of dns servers. All the looping would be inside getaddrinfo call."

@mvorisek
Copy link
Contributor Author

5 second timeout is expected. But what I have experienced, and the logs shows it as well, is massive looping/retries with ~200 seconds timeout total.

According to the https://man7.org/linux/man-pages/man5/resolver.5.html there should be 2 retries per DNS server by default. With 3 DNS servers, this will imply 5 x 2 x 3 = 30 seconds timeout. Maybe GH Actions has some special configuration or resolv patched for less failures, but I doubt so.

@cjbj Is the Oracle Instant Client really calling the getaddrinfo no more than once?

@cjbj
Copy link
Contributor

cjbj commented Sep 21, 2022

@mvorisek how long does nslookup take in that environment? You could turn on Oracle network tracing to check if snlinGetAddrInfo is being called multiple times. (Ping me if you need steps). Our net team say it's basically a DNS setup issue as DNS lookup shouldn't hang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants