Implement frameless internal function calls #12461

iluuu1994 · 2023-10-17T16:44:32Z

Coercion. Coercion happens in-place. However, since args to frameless functions aren't copied coercing them in-place can escape the function boundary (for VAR|CONST). We should either copy these values in the handler for these op types, or handle the coercion+freeing in the functions. Edit: This was only relevant to string for the current frameless handlers. It was solved by creating a temporary zval that may be used if the parameter is not a string. After the function, the zval must be cleaned up if used, which unfortunately requires manual handling in the frameless handler.
~~We probably need observer support~~ Frameless calls are disabled when there's an observer extension for now
JIT support
- Tracing
- Function
Try arity-completion (provide default-args for handlers with larger arity)

dstogov

I like the idea and the implementation is mostly good.
I don't like the compile time restrictions. I think it shouldn't be a big problem to get rid of them.

I would also recommend to add frameless handlers for trivial and offten used functions: trim(), ord(), chr(), strtolower() functions.

Zend/zend_API.c

Zend/zend_frameless_function_handler_lists.c

Zend/zend_vm_def.h

Zend/zend_vm_gen.php

As requested by Dmitry in GH-12461 for easier reviewing.

staabm · 2023-11-01T15:46:58Z

Do these frameless functions influence how stack limit works (e.g. max callstack size exceeded)?

iluuu1994 · 2023-11-01T16:04:40Z

@staabm This doesn't significantly affect stack limits. Recursion with VM reentry is subject to Cs stack overflow. Userspace recursion doesn't have a stack limit, as much as a memory limit.

dstogov

I don't see major problems.
I would prefer some better naming against "frameless" and "flf".

JIT support shouldn't be a big problem. See comments.
In case of observer you may disable compilation to ZEND_FRAMLESS_ICALL_*.

Zend/Optimizer/sccp.c

Zend/Optimizer/zend_dfg.c

Zend/zend_vm_def.h

Zend/zend_compile.h

Zend/zend_execute_API.c

Zend/zend_frameless_function.c

ext/opcache/jit/zend_jit_trace.c

Zend/zend_vm_def.h

nielsdos · 2023-11-11T21:36:31Z

I wonder whether the reverse way should be done for RCN marking: instead of marking what args may increase the refcount, mark what args will never increase the refcount and assume RCN otherwise. That's safer in case the implementer forgets. OTOH it's more boilerplate. Just something to think about.

iluuu1994 · 2023-11-11T21:39:43Z

@nielsdos I agree. I will change this so that omitting the flags sets it to -1. Btw, any idea what's going on with the LLVM script?

nielsdos · 2023-11-11T21:46:35Z

Btw, any idea what's going on with the LLVM script?

Huh, it worked yesterday. It complains about the distro version being 22.04.3, but that is the same version as yesterday...
Looking at the script's code, it fails at a wget to https://apt.llvm.org/jammy/, but that page loads locally for me...

iluuu1994 · 2023-11-11T21:57:21Z

@nielsdos It gave back a 403 just 10 minutes ago. I suppose they have (or had) server issues.

Zend/zend_builtin_functions.c

bwoebi · 2023-11-13T12:19:22Z

Zend/zend_builtin_functions.c

@@ -979,20 +971,41 @@ ZEND_FUNCTION(property_exists)
 	}
 	RETURN_FALSE;
 }
+
+/* {{{ Checks if the object or class has a property */
+ZEND_FUNCTION(property_exists)


To reduce repetition, is it realistic to write a macro like:

#define ZEND_FUNCTION_CALL_FRAMELESS(name, minArgs, maxArgs) \ ZEND_FUNCTION(name) { \ uint32_t args = ZEND_NUM_ARGS(); \ if (args > maxArgs || args < minArgs) { /* handle arg count mismatch */ } \ switch (args) { \ /* here, do some macro magic to only compile the cases between min and maxargs, to avoid undefined function references */ \ case 0: ZEND_FRAMELESS_FUNCTION_NAME(name, 0)(return_value); break; \ case 1: ZEND_FRAMELESS_FUNCTION_NAME(name, 1)(return_value, EX_VAR(0)); break; \ case 2: ZEND_FRAMELESS_FUNCTION_NAME(name, 2)(return_value, EX_VAR(0), EX_VAR(1)); break; \ case 3: ZEND_FRAMELESS_FUNCTION_NAME(name, 3)(return_value, EX_VAR(0), EX_VAR(1), EX_VAR(2)); break; \ } \ }

Making it more effortless and straightforward. Also probably avoids compiling the body of the function twice (not requiring an extra inline function too).

dstogov

See the review comments.

Zend/zend_frameless_function.h

ext/opcache/jit/zend_jit_ir.c

Zend/zend_vm_def.h

Zend/zend_compile.c

dstogov · 2023-12-18T21:10:29Z

It looks like my idea with cache slot sharing doesn't make sense. In most cases after JMP_FRAMELESS, we will go to FRAMELESS_ICALL, so this really just introduces an additional load. Probably, it's better to revert the last commit. Sorry.

I like the idea of the patch and don't see technical problems, but unfortunately I don't see any visible speed improvement. Do you see?

Symfony demo looks even a bit slower for me. This may be caused by unrelated code locality changes, but also because of the hot code size increase (PHP becomes more than 64K larger). I tried to inline ZEND_FRAMELESS handlers using ZEND_VM_HOT_HANDLER but this didn't make the benchmarks faster.

I noticed, the patch somehow enables more ASSIGN -> QM_ASSIGN optimizations (callgrind on Wordpress without JIT shows significantly less calls to ASSIGN_SPEC_CV_VAR/TMP). I didn't find the reason. Do you know this?

Also JMP_FRAMELESS switching may lead to code explosion.

<?php
namespace T;

function foo($a, $b) {
	return strpos(trim($a . "foo"), $b);
}

php -d opcache.opt_debug_level=0x20000 test.php

T\foo:
     ; (lines=26, args=2, vars=2, tmps=3)
     ; (after optimizer)
0000 CV0($a) = RECV 1
0001 CV1($b) = RECV 2
0002 JMP_FRAMELESS 8 string("t\\strpos") 0016
0003 INIT_NS_FCALL_BY_NAME 2 string("T\\strpos")
0004 JMP_FRAMELESS 24 string("t\\trim") 0010
0005 INIT_NS_FCALL_BY_NAME 1 string("T\\trim")
0006 T2 = CONCAT CV0($a) string("foo")
0007 SEND_VAL_EX T2 1
0008 V2 = DO_FCALL_BY_NAME
0009 JMP 0012
0010 T3 = CONCAT CV0($a) string("foo")
0011 V2 = FRAMELESS_ICALL_1(trim) T3
0012 SEND_VAR_NO_REF_EX V2 1
0013 SEND_VAR_EX CV1($b) 2
0014 V2 = DO_FCALL_BY_NAME
0015 RETURN V2
0016 JMP_FRAMELESS 40 string("t\\trim") 0022
0017 INIT_NS_FCALL_BY_NAME 1 string("T\\trim")
0018 T3 = CONCAT CV0($a) string("foo")
0019 SEND_VAL_EX T3 1
0020 V3 = DO_FCALL_BY_NAME
0021 JMP 0024
0022 T4 = CONCAT CV0($a) string("foo")
0023 V3 = FRAMELESS_ICALL_1(trim) T4
0024 V2 = FRAMELESS_ICALL_2(strpos) V3 CV1($b)
0025 RETURN V2

So, I'm not sure if this should be merged or not.
Do you see any remaining opportunities for improvement? (may be restricting arguments to CV and CONST and disabling specialization to reduce the code size).

iluuu1994 · 2023-12-18T21:38:17Z

@dstogov

Probably, it's better to revert the last commit. Sorry.

No worries.

I noticed, the patch somehow enables more ASSIGN -> QM_ASSIGN optimizations (callgrind on Wordpress without JIT shows significantly less calls to ASSIGN_SPEC_CV_VAR/TMP). I didn't find the reason. Do you know this?

I don't know for sure, but it might be related to the fact that FRAMELESS_ICALL_n produce IS_TMP_VAR results rather than IS_VAR. Possibly, this enables more optimizations.

Do you see any remaining opportunities for improvement?

In terms of speed this implements all the related ideas I had. I saw better results without JMP_FRAMELESS for Symfony, with calls adjusted to be fully qualified (fqn-functions branch). I will try to create some benchmarks tomorrow to see how much, and whether reducing specializations might help, potentially reducing instruction cache misses.

I saw a big improvement for frameless function calls in tight, hot loops. This might be good for some algorithms, but is admittedly more synthetic. I will try to give some concrete numbers here too.

Also JMP_FRAMELESS switching may lead to code explosion.

Oof, I agree. Nesting frameless calls is bad. I haven't considered that.

@dstogov How do you measure real performance? I usually use hyperfine but on my computer the deviation can be quite large (at least 0.5%) which makes it hard to measure these kinds of optimizations.

dstogov · 2023-12-19T07:56:11Z

I noticed, the patch somehow enables more ASSIGN -> QM_ASSIGN optimizations (callgrind on Wordpress without JIT shows significantly less calls to ASSIGN_SPEC_CV_VAR/TMP). I didn't find the reason. Do you know this?

I don't know for sure, but it might be related to the fact that FRAMELESS_ICALL_n produce IS_TMP_VAR results rather than IS_VAR. Possibly, this enables more optimizations.

It may make sense to investigate this and enable optimization for ven more cases.

Do you see any remaining opportunities for improvement?

In terms of speed this implements all the related ideas I had. I saw better results without JMP_FRAMELESS for Symfony, with calls adjusted to be fully qualified (fqn-functions branch). I will try to create some benchmarks tomorrow to see how much, and whether reducing specializations might help, potentially reducing instruction cache misses.

yes. I think the improvement may be achieved through reducing the code size.

I saw a big improvement for frameless function calls in tight, hot loops. This might be good for some algorithms, but is admittedly more synthetic. I will try to give some concrete numbers here too.

yes. please do.

Also JMP_FRAMELESS switching may lead to code explosion.

Oof, I agree. Nesting frameless calls is bad. I haven't considered that.

Nothing is wrong. This is a rare case, and I'm not sure if implementation may be improved to generate code for N2 cals instead of NN.

@dstogov How do you measure real performance? I usually use hyperfine but on my computer the deviation can be quite large (at least 0.5%) which makes it hard to measure these kinds of optimizations.

For now I run just a single thread php -T1,1000 on symfony demo and wordpress (deviation is really high. It's possible to reduce this limiting CPU frequency scaling. It's also possible to tie the process to concrete CPU core through tackset -c 1).
I may rerun the same benchmarks in multi back-end web environment through httpbench.

iluuu1994 · 2023-12-19T12:11:31Z

Disabling TurboBoost helps. Conveniently, this doesn't require a restart.

echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

I also used tuna and taskset to dedicate a specific CPU to the benchmark. This gives me deviations of <0.25%. It's not perfect, but better.

iluuu1994 · 2023-12-20T13:05:05Z

Ok, here are the results.

	master (s)	direct-calls (s)	diff (%)	direct-calls-unspecialized (s)	diff (%)	unshared-cache (s)	diff (%)	unshared-cache-unspecialized (s)	diff (%)
bench.php	44.31177	43.837141	-1.0711	43.822742	-1.1036	44.574883	0.5937	43.815165	-1.1207
bench.php+jit	14.999982	14.990234	-0.0649	15.027456	0.1831	15.064637	0.4310	14.993221	-0.0450
symfony	30.243476	30.05604	-0.6197	30.121743	-0.4025	29.904663	-1.1202	29.908698	-1.1069
symfony+jit	29.249913	28.96279	-0.9816	29.011227	-0.8160	28.743947	-1.7298	28.729461	-1.7793
symfony+fqn	30.252826	29.894223	-1.1853	30.029101	-0.7395	29.795143	-1.5128	29.801449	-1.4920
symfony+fqn+jit	29.2487	28.941631	-1.0498	28.884016	-1.2468	28.906286	-1.1706	29.034026	-0.7339
			-0.8287		-0.6875		-0.7514		-1.0463

From this benchmark, it does seem that the version without the shared cache slot, and without specializations offers the best trade-offs. The improvement isn't huge but it's measurable. Most significantly, it seems the code is faster in all variants, including with non-fqn function calls (i.e. JMP_FRAMELESS). bench.php was performed with more iterations for higher accuracy. As for hot loops I tested this by running sebastianbergmann/diff on a 5 000 line file, but without sebastianbergmann/diff#118, and by implementing a specialized version of max(a2).

	master (ms)	unshared-cache-unspecialized (ms)	diff (%)
plain	741.5	620.5	-16.3182
jit	335.3	307.8	-8.2016
fqn	670.5	574.6	-14.3027
fqn+jit	319.3	298.5	-6.5142

As expected, the difference is bigger when frameless calls may be used in hot loops.

Just to clarify, master here refers to the base of the branch (c00b0c7477 when tests were performed).

@dstogov Do you think this is enough to justify the added complexity?

dstogov · 2023-12-25T09:41:34Z

@dstogov Do you think this is enough to justify the added complexity?

I don't have a string opinion, but I don't object. I see - this improves performance!

@bwoebi @derickr this may make troubles to debuggers/profiles and extending this with support for observer API may remove all the gain. Will you need any special support?

derickr · 2024-01-31T09:54:48Z

It doesn't seem that this patch introduces any breaking changes to Xdebug. At least, all the tests pass (although ABI changed, so I had segfaults before I recompiled Xdebug).

dstogov · 2024-02-05T08:03:56Z

Let land this.
Let me know, if I should take a final look after the merge conflict resolution.

Opcodes grow exponentially for nested calls.

iluuu1994 · 2024-02-05T14:04:37Z

@dstogov There were only conflicts in generated files. I did add a CG(context) flag that avoids compiling nested JMP_FRAMELESS branches to avoid blowing up opcode count, as you've hinted to me when we last discussed this PR.
64ec25c

If that looks fine to you, I'll merge this PR.

dstogov · 2024-02-06T06:30:59Z

go forward

iluuu1994 · 2024-02-06T16:43:40Z

@dstogov Great! Thank you for all your reviews!

MaxKellermann · 2024-07-05T10:19:43Z

This PR adds struct forward declarations, but a supermajority of voters said that it should not be "allowed to forward-declare structs/unions/typedefs": https://wiki.php.net/rfc/include_cleanup
As it violates the ruling of an RFC vote, this PR should not have been merged.

nielsdos · 2024-07-05T10:21:30Z

This PR adds struct forward declarations, but a supermajority of voters said that it should not be "allowed to forward-declare structs/unions/typedefs": https://wiki.php.net/rfc/include_cleanup
As it violates the ruling of an RFC vote, this PR should not have been merged.

Secondary votes only matter when the primary vote is accepted, otherwise they take no effect.

MaxKellermann · 2024-07-05T10:33:15Z

Secondary votes only matter when the primary vote is accepted, otherwise they take no effect.

That is formally correct, but it ignores the will of the PHP community, even if it is not formally binding.
But if you want to talk about formal correctness: this PR neither adds a test, nor fixes a bug, nor implements an RFC. That alone means it should not have been merged, for formal reasons.

nielsdos · 2024-07-05T10:40:31Z

I suppose you refer to this part of the contributing guidelines?

PHP welcomes pull requests to add tests, fix bugs and to implement RFCs. Please be sure to include tests as appropriate!

You're right that it doesn't list anything other than tests, fixes and RFCs. However, "welcomes" doesn't mean "allows only".

MaxKellermann · 2024-07-05T10:51:33Z

I suppose you refer to this part of the contributing guidelines?

That, and: "New features require an RFC" (README.md). This PR clearly implements a new feature.

However, "welcomes" doesn't mean "allows only".

If you look at just CONTRIBUTING.md, you can argue that, because the wording there is indeed weak; but it was argued differently when it was about my code submissions (which, unlike this PR, did not implement a new feature; it did not even add any any actual code; thus the stronger wording in README.md did not even apply to me).

nielsdos · 2024-07-05T11:04:58Z

The wording in README seems to be inconsistent with the actual applied practices and should be corrected.

An RFC is required when it's a non-self-contained feature or when there is opposition against the change. Although I admit that is subjective. E.g. looking at the mailing list you can every so often see an email asking if anyone opposes a feature, usually followed by two weeks. When someone raises an issue then they definitely have to go through the RFC process even if it's a small self-contained feature.

The most recent source talking about RFC-requiring vs non-RFC-requiring features is this: https://wiki.php.net/rfc/release_cycle_update

MaxKellermann · 2024-07-05T11:22:14Z

But the (strong) wording in README is consistent with the (weak) wording in CONTRIBUTING. If README needs corrections, then CONTRIBUTING needs corrections as well, which is something I also tried (to allow PRs exactly like this one without an RFC!), but failed.

My real point here (before you dragged me into lawyering) was: when I submitted struct forward declarations, the whole idea was rejected, there was a supermajority vote against it, and my already-merged commits were even reverted (by one guy, without a vote). Now this PR does essentially the same, and surprisingly the exact same guy approves it. That is not just subjective (which would be okay), but also inconsistent and arbitrary.

We can agree or disagree on coding styles, that's okay, that's personal taste. But if, after a long discussion, a certain coding style is ultimately rejected (and even reverted), it should be rejected for everybody. It was not rejected here, that's why I complained.

nielsdos · 2024-07-05T11:40:05Z

But the (strong) wording in README is consistent with the (weak) wording in CONTRIBUTING. If README needs corrections, then CONTRIBUTING needs corrections as well, which is something I also tried (to allow PRs exactly like this one without an RFC!), but failed.

I agree both places need their wordings adjusted.

My real point here (before you dragged me into lawyering) was: when I submitted struct forward declarations, the whole idea was rejected, there was a supermajority vote against it, and my already-merged commits were even reverted (by one guy, without a vote).

I think this happened when I first started contributing, or at least before I started following discussions closely, so I don't know/remember all the details here. I can only say that the secondary votes of your RFC specified how to implement the include cleanup, and as part of that struct forward declarations were rejected. However, secondary votes only matter if primary votes are accepted. Note that even if the primary vote was accepted, my understanding is that it would only apply to include cleanup, so forward declarations would've only been disallowed for include cleanup in that case.

Now this PR does essentially the same, and surprisingly the exact same guy approves it. That is not just subjective (which would be okay), but also inconsistent and arbitrary.

I can't comment on this part though.

We can agree or disagree on coding styles, that's okay, that's personal taste. But if, after a long discussion, a certain coding style is ultimately rejected (and even reverted), it should be rejected for everybody. It was not rejected here, that's why I complained.

The include cleanup commits were reverted because you were forced to go through the RFC process after some developers didn't like it. Although the RFC process isn't ideal for these cases because it's a purely technical change instead of a user-facing change.

That said, coding style should be enforced consistently yes.

Bilge · 2024-11-25T02:16:49Z

Zend/Optimizer/zend_call_graph.h

@@ -38,6 +38,7 @@ struct _zend_call_info {
 	bool               send_unpack;  /* Parameters passed by SEND_UNPACK or SEND_ARRAY */
 	bool               named_args;   /* Function has named arguments */
 	bool               is_prototype; /* An overridden child method may be called */
+	bool               is_frameless; /* A frameless function sends arguments through operands */


Do we really mean operands here and not opcodes?

Yes, we do mean operands.

github-actions bot added Category: Build System Category: Engine Extension: standard Category: Optimizer labels Oct 17, 2023

dstogov reviewed Oct 23, 2023

View reviewed changes

iluuu1994 force-pushed the direct-calls branch 2 times, most recently from 60101e6 to bb24c28 Compare October 27, 2023 10:00

iluuu1994 added a commit that referenced this pull request Oct 27, 2023

Split complex regexes to multiple lines in zend_vm_gen.php

964e9d8

As requested by Dmitry in GH-12461 for easier reviewing.

iluuu1994 force-pushed the direct-calls branch from bb24c28 to 9708901 Compare October 27, 2023 16:17

github-actions bot added the Extension: opcache label Oct 28, 2023

iluuu1994 force-pushed the direct-calls branch from 6db666c to 4dfd4df Compare October 31, 2023 20:44

iluuu1994 mentioned this pull request Oct 31, 2023

Split strtr zpp #12583

Merged

iluuu1994 force-pushed the direct-calls branch from 75c9f74 to fc6452a Compare November 1, 2023 14:14

dstogov reviewed Nov 2, 2023

View reviewed changes

github-actions bot added the Extension: zend_test label Nov 2, 2023

nielsdos reviewed Nov 4, 2023

View reviewed changes

ext/opcache/jit/zend_jit_trace.c Show resolved Hide resolved

ext/opcache/jit/zend_jit_trace.c Show resolved Hide resolved

bwoebi reviewed Nov 4, 2023

View reviewed changes

Zend/zend_vm_def.h Outdated Show resolved Hide resolved

iluuu1994 force-pushed the direct-calls branch 3 times, most recently from dfc04c0 to 76ac618 Compare November 10, 2023 15:24

github-actions bot added the Extension: pcre label Nov 10, 2023

bwoebi reviewed Nov 13, 2023

View reviewed changes

Zend/zend_builtin_functions.c Show resolved Hide resolved

bwoebi reviewed Nov 13, 2023

View reviewed changes

iluuu1994 changed the title ~~[PoC] Implement stackless internal function calls~~ Implement stackless internal function calls Nov 13, 2023

dstogov reviewed Dec 18, 2023

View reviewed changes

iluuu1994 force-pushed the direct-calls branch from 33e41e7 to 4e73214 Compare December 21, 2023 10:18

iluuu1994 force-pushed the direct-calls branch from 4e73214 to b183904 Compare February 5, 2024 13:38

Implement stackless internal function calls

8a9d933

iluuu1994 force-pushed the direct-calls branch from b183904 to 8a9d933 Compare February 5, 2024 13:46

Avoid nesting jmp_frameless branches

64ec25c

Opcodes grow exponentially for nested calls.

iluuu1994 mentioned this pull request Feb 5, 2024

Implement "support doc comments for internal classes and functions" #13266

Merged

iluuu1994 closed this in 631bc81 Feb 6, 2024

javiereguiluz mentioned this pull request Mar 8, 2024

Change the way built-in functions are resolved #13632

Closed

Bilge reviewed Nov 25, 2024

View reviewed changes

Implement frameless internal function calls #12461

Implement frameless internal function calls #12461

Uh oh!

Conversation

iluuu1994 commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstogov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

staabm commented Nov 1, 2023

Uh oh!

iluuu1994 commented Nov 1, 2023

Uh oh!

dstogov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nielsdos commented Nov 11, 2023

Uh oh!

iluuu1994 commented Nov 11, 2023

Uh oh!

nielsdos commented Nov 11, 2023

Uh oh!

iluuu1994 commented Nov 11, 2023

Uh oh!

Uh oh!

bwoebi Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dstogov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dstogov commented Dec 18, 2023

Uh oh!

iluuu1994 commented Dec 18, 2023

Uh oh!

dstogov commented Dec 19, 2023

Uh oh!

iluuu1994 commented Dec 19, 2023

Uh oh!

iluuu1994 commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstogov commented Dec 25, 2023

Uh oh!

derickr commented Jan 31, 2024

Uh oh!

dstogov commented Feb 5, 2024

Uh oh!

iluuu1994 commented Feb 5, 2024

Uh oh!

dstogov commented Feb 6, 2024

Uh oh!

iluuu1994 commented Feb 6, 2024

Uh oh!

MaxKellermann commented Jul 5, 2024

Uh oh!

nielsdos commented Jul 5, 2024

iluuu1994 commented Oct 17, 2023 •

edited

Loading

bwoebi Nov 13, 2023 •

edited

Loading

iluuu1994 commented Dec 20, 2023 •

edited

Loading