Skip to content

Implement frameless internal function calls #12461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

iluuu1994
Copy link
Member

@iluuu1994 iluuu1994 commented Oct 17, 2023

  • Coercion. Coercion happens in-place. However, since args to frameless functions aren't copied coercing them in-place can escape the function boundary (for VAR|CONST). We should either copy these values in the handler for these op types, or handle the coercion+freeing in the functions. Edit: This was only relevant to string for the current frameless handlers. It was solved by creating a temporary zval that may be used if the parameter is not a string. After the function, the zval must be cleaned up if used, which unfortunately requires manual handling in the frameless handler.
  • We probably need observer support Frameless calls are disabled when there's an observer extension for now
  • JIT support
    • Tracing
    • Function
  • Try arity-completion (provide default-args for handlers with larger arity)

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea and the implementation is mostly good.
I don't like the compile time restrictions. I think it shouldn't be a big problem to get rid of them.

I would also recommend to add frameless handlers for trivial and offten used functions: trim(), ord(), chr(), strtolower() functions.

@iluuu1994 iluuu1994 force-pushed the direct-calls branch 2 times, most recently from 60101e6 to bb24c28 Compare October 27, 2023 10:00
iluuu1994 added a commit that referenced this pull request Oct 27, 2023
As requested by Dmitry in GH-12461 for easier reviewing.
@iluuu1994 iluuu1994 mentioned this pull request Oct 31, 2023
@staabm
Copy link
Contributor

staabm commented Nov 1, 2023

Do these frameless functions influence how stack limit works (e.g. max callstack size exceeded)?

@iluuu1994
Copy link
Member Author

@staabm This doesn't significantly affect stack limits. Recursion with VM reentry is subject to Cs stack overflow. Userspace recursion doesn't have a stack limit, as much as a memory limit.

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see major problems.
I would prefer some better naming against "frameless" and "flf".

JIT support shouldn't be a big problem. See comments.
In case of observer you may disable compilation to ZEND_FRAMLESS_ICALL_*.

@iluuu1994 iluuu1994 force-pushed the direct-calls branch 3 times, most recently from dfc04c0 to 76ac618 Compare November 10, 2023 15:24
@nielsdos
Copy link
Member

I wonder whether the reverse way should be done for RCN marking: instead of marking what args may increase the refcount, mark what args will never increase the refcount and assume RCN otherwise. That's safer in case the implementer forgets. OTOH it's more boilerplate. Just something to think about.

@iluuu1994
Copy link
Member Author

@nielsdos I agree. I will change this so that omitting the flags sets it to -1. Btw, any idea what's going on with the LLVM script?

@nielsdos
Copy link
Member

Btw, any idea what's going on with the LLVM script?

Huh, it worked yesterday. It complains about the distro version being 22.04.3, but that is the same version as yesterday...
Looking at the script's code, it fails at a wget to https://apt.llvm.org/jammy/, but that page loads locally for me...

@iluuu1994
Copy link
Member Author

@nielsdos It gave back a 403 just 10 minutes ago. I suppose they have (or had) server issues.

@@ -979,20 +971,41 @@ ZEND_FUNCTION(property_exists)
}
RETURN_FALSE;
}

/* {{{ Checks if the object or class has a property */
ZEND_FUNCTION(property_exists)
Copy link
Member

@bwoebi bwoebi Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce repetition, is it realistic to write a macro like:

#define ZEND_FUNCTION_CALL_FRAMELESS(name, minArgs, maxArgs) \
ZEND_FUNCTION(name) { \
    uint32_t args = ZEND_NUM_ARGS(); \
    if (args > maxArgs || args < minArgs) { /* handle arg count mismatch */ } \
    switch (args) { \
        /* here, do some macro magic to only compile the cases between min and maxargs, to avoid undefined function references */ \
        case 0: ZEND_FRAMELESS_FUNCTION_NAME(name, 0)(return_value); break; \
        case 1: ZEND_FRAMELESS_FUNCTION_NAME(name, 1)(return_value, EX_VAR(0)); break; \
        case 2: ZEND_FRAMELESS_FUNCTION_NAME(name, 2)(return_value, EX_VAR(0), EX_VAR(1)); break; \
        case 3: ZEND_FRAMELESS_FUNCTION_NAME(name, 3)(return_value, EX_VAR(0), EX_VAR(1), EX_VAR(2)); break; \
    } \
}

Making it more effortless and straightforward. Also probably avoids compiling the body of the function twice (not requiring an extra inline function too).

@iluuu1994 iluuu1994 changed the title [PoC] Implement stackless internal function calls Implement stackless internal function calls Nov 13, 2023
Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comments.

@dstogov
Copy link
Member

dstogov commented Dec 18, 2023

It looks like my idea with cache slot sharing doesn't make sense. In most cases after JMP_FRAMELESS, we will go to FRAMELESS_ICALL, so this really just introduces an additional load. Probably, it's better to revert the last commit. Sorry.

I like the idea of the patch and don't see technical problems, but unfortunately I don't see any visible speed improvement. Do you see?

Symfony demo looks even a bit slower for me. This may be caused by unrelated code locality changes, but also because of the hot code size increase (PHP becomes more than 64K larger). I tried to inline ZEND_FRAMELESS handlers using ZEND_VM_HOT_HANDLER but this didn't make the benchmarks faster.

I noticed, the patch somehow enables more ASSIGN -> QM_ASSIGN optimizations (callgrind on Wordpress without JIT shows significantly less calls to ASSIGN_SPEC_CV_VAR/TMP). I didn't find the reason. Do you know this?

Also JMP_FRAMELESS switching may lead to code explosion.

<?php
namespace T;

function foo($a, $b) {
	return strpos(trim($a . "foo"), $b);
}

php -d opcache.opt_debug_level=0x20000 test.php

T\foo:
     ; (lines=26, args=2, vars=2, tmps=3)
     ; (after optimizer)
0000 CV0($a) = RECV 1
0001 CV1($b) = RECV 2
0002 JMP_FRAMELESS 8 string("t\\strpos") 0016
0003 INIT_NS_FCALL_BY_NAME 2 string("T\\strpos")
0004 JMP_FRAMELESS 24 string("t\\trim") 0010
0005 INIT_NS_FCALL_BY_NAME 1 string("T\\trim")
0006 T2 = CONCAT CV0($a) string("foo")
0007 SEND_VAL_EX T2 1
0008 V2 = DO_FCALL_BY_NAME
0009 JMP 0012
0010 T3 = CONCAT CV0($a) string("foo")
0011 V2 = FRAMELESS_ICALL_1(trim) T3
0012 SEND_VAR_NO_REF_EX V2 1
0013 SEND_VAR_EX CV1($b) 2
0014 V2 = DO_FCALL_BY_NAME
0015 RETURN V2
0016 JMP_FRAMELESS 40 string("t\\trim") 0022
0017 INIT_NS_FCALL_BY_NAME 1 string("T\\trim")
0018 T3 = CONCAT CV0($a) string("foo")
0019 SEND_VAL_EX T3 1
0020 V3 = DO_FCALL_BY_NAME
0021 JMP 0024
0022 T4 = CONCAT CV0($a) string("foo")
0023 V3 = FRAMELESS_ICALL_1(trim) T4
0024 V2 = FRAMELESS_ICALL_2(strpos) V3 CV1($b)
0025 RETURN V2

So, I'm not sure if this should be merged or not.
Do you see any remaining opportunities for improvement? (may be restricting arguments to CV and CONST and disabling specialization to reduce the code size).

@iluuu1994
Copy link
Member Author

@dstogov

Probably, it's better to revert the last commit. Sorry.

No worries.

I noticed, the patch somehow enables more ASSIGN -> QM_ASSIGN optimizations (callgrind on Wordpress without JIT shows significantly less calls to ASSIGN_SPEC_CV_VAR/TMP). I didn't find the reason. Do you know this?

I don't know for sure, but it might be related to the fact that FRAMELESS_ICALL_n produce IS_TMP_VAR results rather than IS_VAR. Possibly, this enables more optimizations.

Do you see any remaining opportunities for improvement?

In terms of speed this implements all the related ideas I had. I saw better results without JMP_FRAMELESS for Symfony, with calls adjusted to be fully qualified (fqn-functions branch). I will try to create some benchmarks tomorrow to see how much, and whether reducing specializations might help, potentially reducing instruction cache misses.

I saw a big improvement for frameless function calls in tight, hot loops. This might be good for some algorithms, but is admittedly more synthetic. I will try to give some concrete numbers here too.

Also JMP_FRAMELESS switching may lead to code explosion.

Oof, I agree. Nesting frameless calls is bad. I haven't considered that.

@dstogov How do you measure real performance? I usually use hyperfine but on my computer the deviation can be quite large (at least 0.5%) which makes it hard to measure these kinds of optimizations.

@dstogov
Copy link
Member

dstogov commented Dec 19, 2023

I noticed, the patch somehow enables more ASSIGN -> QM_ASSIGN optimizations (callgrind on Wordpress without JIT shows significantly less calls to ASSIGN_SPEC_CV_VAR/TMP). I didn't find the reason. Do you know this?

I don't know for sure, but it might be related to the fact that FRAMELESS_ICALL_n produce IS_TMP_VAR results rather than IS_VAR. Possibly, this enables more optimizations.

It may make sense to investigate this and enable optimization for ven more cases.

Do you see any remaining opportunities for improvement?

In terms of speed this implements all the related ideas I had. I saw better results without JMP_FRAMELESS for Symfony, with calls adjusted to be fully qualified (fqn-functions branch). I will try to create some benchmarks tomorrow to see how much, and whether reducing specializations might help, potentially reducing instruction cache misses.

yes. I think the improvement may be achieved through reducing the code size.

I saw a big improvement for frameless function calls in tight, hot loops. This might be good for some algorithms, but is admittedly more synthetic. I will try to give some concrete numbers here too.

yes. please do.

Also JMP_FRAMELESS switching may lead to code explosion.

Oof, I agree. Nesting frameless calls is bad. I haven't considered that.

Nothing is wrong. This is a rare case, and I'm not sure if implementation may be improved to generate code for N2 cals instead of NN.

@dstogov How do you measure real performance? I usually use hyperfine but on my computer the deviation can be quite large (at least 0.5%) which makes it hard to measure these kinds of optimizations.

For now I run just a single thread php -T1,1000 on symfony demo and wordpress (deviation is really high. It's possible to reduce this limiting CPU frequency scaling. It's also possible to tie the process to concrete CPU core through tackset -c 1).
I may rerun the same benchmarks in multi back-end web environment through httpbench.

@iluuu1994
Copy link
Member Author

Disabling TurboBoost helps. Conveniently, this doesn't require a restart.

echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

I also used tuna and taskset to dedicate a specific CPU to the benchmark. This gives me deviations of <0.25%. It's not perfect, but better.

@iluuu1994
Copy link
Member Author

iluuu1994 commented Dec 20, 2023

Ok, here are the results.

  master (s) direct-calls (s) diff (%) direct-calls-unspecialized (s) diff (%) unshared-cache (s) diff (%) unshared-cache-unspecialized (s)  diff (%)
bench.php 44.31177 43.837141 -1.0711 43.822742 -1.1036 44.574883 0.5937 43.815165 -1.1207
bench.php+jit 14.999982 14.990234 -0.0649 15.027456 0.1831 15.064637 0.4310 14.993221 -0.0450
symfony 30.243476 30.05604 -0.6197 30.121743 -0.4025 29.904663 -1.1202 29.908698 -1.1069
symfony+jit 29.249913 28.96279 -0.9816 29.011227 -0.8160 28.743947 -1.7298 28.729461 -1.7793
symfony+fqn 30.252826 29.894223 -1.1853 30.029101 -0.7395 29.795143 -1.5128 29.801449 -1.4920
symfony+fqn+jit 29.2487 28.941631 -1.0498 28.884016 -1.2468 28.906286 -1.1706 29.034026 -0.7339
      -0.8287   -0.6875   -0.7514   -1.0463

From this benchmark, it does seem that the version without the shared cache slot, and without specializations offers the best trade-offs. The improvement isn't huge but it's measurable. Most significantly, it seems the code is faster in all variants, including with non-fqn function calls (i.e. JMP_FRAMELESS). bench.php was performed with more iterations for higher accuracy. As for hot loops I tested this by running sebastianbergmann/diff on a 5 000 line file, but without sebastianbergmann/diff#118, and by implementing a specialized version of max(a2).

  master (ms) unshared-cache-unspecialized (ms) diff (%)
plain 741.5 620.5 -16.3182
jit 335.3 307.8 -8.2016
fqn 670.5 574.6 -14.3027
fqn+jit 319.3 298.5 -6.5142

As expected, the difference is bigger when frameless calls may be used in hot loops.

Just to clarify, master here refers to the base of the branch (c00b0c7477 when tests were performed).

@dstogov Do you think this is enough to justify the added complexity?

@dstogov
Copy link
Member

dstogov commented Dec 25, 2023

@dstogov Do you think this is enough to justify the added complexity?

I don't have a string opinion, but I don't object. I see - this improves performance!

@bwoebi @derickr this may make troubles to debuggers/profiles and extending this with support for observer API may remove all the gain. Will you need any special support?

@derickr
Copy link
Member

derickr commented Jan 31, 2024

It doesn't seem that this patch introduces any breaking changes to Xdebug. At least, all the tests pass (although ABI changed, so I had segfaults before I recompiled Xdebug).

@dstogov
Copy link
Member

dstogov commented Feb 5, 2024

Let land this.
Let me know, if I should take a final look after the merge conflict resolution.

Opcodes grow exponentially for nested calls.
@iluuu1994
Copy link
Member Author

@dstogov There were only conflicts in generated files. I did add a CG(context) flag that avoids compiling nested JMP_FRAMELESS branches to avoid blowing up opcode count, as you've hinted to me when we last discussed this PR.
64ec25c

If that looks fine to you, I'll merge this PR.

@dstogov
Copy link
Member

dstogov commented Feb 6, 2024

go forward

@iluuu1994 iluuu1994 closed this in 631bc81 Feb 6, 2024
@iluuu1994
Copy link
Member Author

@dstogov Great! Thank you for all your reviews!

@MaxKellermann
Copy link
Contributor

This PR adds struct forward declarations, but a supermajority of voters said that it should not be "allowed to forward-declare structs/unions/typedefs": https://wiki.php.net/rfc/include_cleanup
As it violates the ruling of an RFC vote, this PR should not have been merged.

@nielsdos
Copy link
Member

nielsdos commented Jul 5, 2024

This PR adds struct forward declarations, but a supermajority of voters said that it should not be "allowed to forward-declare structs/unions/typedefs": https://wiki.php.net/rfc/include_cleanup
As it violates the ruling of an RFC vote, this PR should not have been merged.

Secondary votes only matter when the primary vote is accepted, otherwise they take no effect.

@MaxKellermann
Copy link
Contributor

Secondary votes only matter when the primary vote is accepted, otherwise they take no effect.

That is formally correct, but it ignores the will of the PHP community, even if it is not formally binding.
But if you want to talk about formal correctness: this PR neither adds a test, nor fixes a bug, nor implements an RFC. That alone means it should not have been merged, for formal reasons.

@nielsdos
Copy link
Member

nielsdos commented Jul 5, 2024

I suppose you refer to this part of the contributing guidelines?

PHP welcomes pull requests to add tests, fix bugs and to implement RFCs. Please be sure to include tests as appropriate!

You're right that it doesn't list anything other than tests, fixes and RFCs. However, "welcomes" doesn't mean "allows only".

@MaxKellermann
Copy link
Contributor

I suppose you refer to this part of the contributing guidelines?

That, and: "New features require an RFC" (README.md). This PR clearly implements a new feature.

However, "welcomes" doesn't mean "allows only".

If you look at just CONTRIBUTING.md, you can argue that, because the wording there is indeed weak; but it was argued differently when it was about my code submissions (which, unlike this PR, did not implement a new feature; it did not even add any any actual code; thus the stronger wording in README.md did not even apply to me).

@nielsdos
Copy link
Member

nielsdos commented Jul 5, 2024

The wording in README seems to be inconsistent with the actual applied practices and should be corrected.

An RFC is required when it's a non-self-contained feature or when there is opposition against the change. Although I admit that is subjective. E.g. looking at the mailing list you can every so often see an email asking if anyone opposes a feature, usually followed by two weeks. When someone raises an issue then they definitely have to go through the RFC process even if it's a small self-contained feature.

The most recent source talking about RFC-requiring vs non-RFC-requiring features is this: https://wiki.php.net/rfc/release_cycle_update

@MaxKellermann
Copy link
Contributor

But the (strong) wording in README is consistent with the (weak) wording in CONTRIBUTING. If README needs corrections, then CONTRIBUTING needs corrections as well, which is something I also tried (to allow PRs exactly like this one without an RFC!), but failed.

My real point here (before you dragged me into lawyering) was: when I submitted struct forward declarations, the whole idea was rejected, there was a supermajority vote against it, and my already-merged commits were even reverted (by one guy, without a vote). Now this PR does essentially the same, and surprisingly the exact same guy approves it. That is not just subjective (which would be okay), but also inconsistent and arbitrary.

We can agree or disagree on coding styles, that's okay, that's personal taste. But if, after a long discussion, a certain coding style is ultimately rejected (and even reverted), it should be rejected for everybody. It was not rejected here, that's why I complained.

@nielsdos
Copy link
Member

nielsdos commented Jul 5, 2024

But the (strong) wording in README is consistent with the (weak) wording in CONTRIBUTING. If README needs corrections, then CONTRIBUTING needs corrections as well, which is something I also tried (to allow PRs exactly like this one without an RFC!), but failed.

I agree both places need their wordings adjusted.

My real point here (before you dragged me into lawyering) was: when I submitted struct forward declarations, the whole idea was rejected, there was a supermajority vote against it, and my already-merged commits were even reverted (by one guy, without a vote).

I think this happened when I first started contributing, or at least before I started following discussions closely, so I don't know/remember all the details here. I can only say that the secondary votes of your RFC specified how to implement the include cleanup, and as part of that struct forward declarations were rejected. However, secondary votes only matter if primary votes are accepted. Note that even if the primary vote was accepted, my understanding is that it would only apply to include cleanup, so forward declarations would've only been disallowed for include cleanup in that case.

Now this PR does essentially the same, and surprisingly the exact same guy approves it. That is not just subjective (which would be okay), but also inconsistent and arbitrary.

I can't comment on this part though.

We can agree or disagree on coding styles, that's okay, that's personal taste. But if, after a long discussion, a certain coding style is ultimately rejected (and even reverted), it should be rejected for everybody. It was not rejected here, that's why I complained.

The include cleanup commits were reverted because you were forced to go through the RFC process after some developers didn't like it. Although the RFC process isn't ideal for these cases because it's a purely technical change instead of a user-facing change.

That said, coding style should be enforced consistently yes.

@@ -38,6 +38,7 @@ struct _zend_call_info {
bool send_unpack; /* Parameters passed by SEND_UNPACK or SEND_ARRAY */
bool named_args; /* Function has named arguments */
bool is_prototype; /* An overridden child method may be called */
bool is_frameless; /* A frameless function sends arguments through operands */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really mean operands here and not opcodes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do mean operands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants