Skip to content

tests/extmod/vfs_fat_finaliser.py: Fix on unix qemu-arm (WIP) #7659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

dpgeorge
Copy link
Member

No description provided.

@dpgeorge
Copy link
Member Author

Since #7649 was merged the following (unrelated?) test now fails on unix qemu-arm:

--- /home/runner/work/micropython/micropython/tests/results/extmod_vfs_fat_finaliser.py.exp	2021-08-14 07:57:14.855392574 +0000
+++ /home/runner/work/micropython/micropython/tests/results/extmod_vfs_fat_finaliser.py.out	2021-08-14 07:57:14.855392574 +0000
@@ -2,4 +2,4 @@
 x0
 x1
 x2
-x3
+

FAILURE /home/runner/work/micropython/micropython/tests/results/extmod_vfs_fat_finaliser.py

And the following (also unrelated?) on Windows (AppVeyor) release x86:

--- "C:/projects/micropython/tests/results\\micropython_extreme_exc.py.exp"	2021-08-14 07:58:33.538441300 +0000
+++ "C:/projects/micropython/tests/results\\micropython_extreme_exc.py.out"	2021-08-14 07:58:33.538441300 +0000
@@ -1,5 +1,7 @@
 Exception()
 TypeError(
-TypeError(
-RuntimeError(
-Exception('my exception',)
+Traceback (most recent call last):
+  File "micropython/extreme_exc.py", line 166, in <module>
+  File "micropython/extreme_exc.py", line 131, in main
+MemoryError: memory allocation failed, allocating 160 bytes
+CRASH
\ No newline at end of file
FAILURE C:/projects/micropython/tests/results\micropython_extreme_exc.py

@stinos
Copy link
Contributor

stinos commented Aug 14, 2021

Happened to me before that seemingly unrelated changes cause a finaliser test to fail. Assuming that's because binary layout changed just enough for certain pieces of code to now require more on the Python/C stack to push everything off, or a similar principle.

But that MemoryError is something different.

@jimmo
Copy link
Member

jimmo commented Aug 14, 2021

I can replicate this locally, but only when run via run-test.py (extra logging output showing the command line).

$ MICROPY_MICROPYTHON="../ports/unix/micropython-coverage" ./run-tests.py extmod/vfs_fat_finaliser.py
['qemu-arm', '../ports/unix/micropython-coverage', '-X', 'emit=bytecode', 'extmod/vfs_fat_finaliser.py']

fails, with

$ cat results/extmod_vfs_fat_finaliser.py.out 
MemoryError
x0
x1
x2

If I run

$ qemu-arm ../ports/unix/micropython-coverage -X emit=bytecode extmod/vfs_fat_finaliser.py
MemoryError
x0
x1
x2
x3

(which is the correct output).

Disabling MICROPY_PY_FSTRINGS makes it work in both cases, but with f-strings disabled and changing the test case to

N = 4
def foo():
    for i in range(N):
        n = "x%d" % i
        f = vfs.open(n, "w")
        f.write(n)
        f = None  # release f without closing
        [0, 1, 2, 3]  # use up Python stack so f is really gone
foo()

makes it fail in the same way (only in run-test.py, but fine if run directly). Adding some more locals makes it pass. So this agrees with what @stinos is saying... will require a bit more work to trace the gc collection and figure out exactly what's making the GC think the block is still live.. Different alignments result in different files not being finalised.

@dpgeorge Given that I can make this test fail with f-strings disabled, I don't think there's a new regression here. It might be worth just disabling this test for now on unix_qemu_arm.

@stinos
Copy link
Contributor

stinos commented Aug 15, 2021

But that MemoryError is something different.

Had a look into what extreme_exc.py actually does (only reproduces on Appveyor VMs, not on my machine), and after all it isn't a different problem: when mp_obj_exception_print tries to allocate memory for the string that leads to a gc_collect since there is no memory anymore (as intended) but the lst object doesn't get sweeped hence the failure. Using

    for i in range(len(lst)):
      lst[i] = None

instead of the current lst[0][0] = None on line 129 is a workaround.

Note: the test code says 'should use the heap and truncate the message' but that's a bit confusing; what is being tested there is that when the MemoryError gets allocated it actually doesn't use the heap (because that is exhausted) but instead falls back to using the emergency exception buffer instead, right?

@iabdalkader
Copy link
Contributor

iabdalkader commented Aug 15, 2021

Is this related ? Can't find any other reason why PRs are failing...

image

image

@jimmo
Copy link
Member

jimmo commented Aug 15, 2021

Is this related ? Can't find any other reason why PRs are failing...

Yes. I am going to fix this today...didn't get any time over the weekend.

@jimmo
Copy link
Member

jimmo commented Aug 16, 2021

(Note: For consistently, I'm going to refer to the stack "backwards", i.e. by memory layout -- the bottom of the stack is the lowest memory address and the most recently pushed items, and the top is MP_STATE_THREAD(stack_top) and the oldest items).

So the exact cause of the vfs_fat_finalizer.py failure is that the fourth file is being allocated on the heap exactly at the location of a previous allocation, and this allocation is still referenced higher in the stack. Specifically the situation I saw is that was previously an allocation for a parse node chunk allocated by parser_alloc during the compilation process. After compilation the stack pointer goes back up, but then by chance when the stack pointer goes down (and eventually ends up at gc_collect) it never happens to overwrite that word. Importantly, it's definitely not the program state holding the reference to the variable (i.e. it's not currently referenced in a register or anywhere near the bottom of the stack, rather something that's much closer to the top of the stack (257 / ~1400 words).

This is of course a known limation of the MicroPython garbage collector -- it has false positives. The problem goes away with very small changes to the layout, it's just unlucky that the change to enable f-strings made the lexer struct bigger etc.

The simplest fix is to just explicitly NULL this pointer in parser_alloc but that's just playing whack-a-mole.

So I think the best way to fix this is to ensure that the allocation for the file is at an address in the heap that will not have been used for a previous allocation, and the easiest way to do that is to allocate more blocks than were used in the parse/compile stage. Note that these must be made of lots of single-block allocations rather than e.g. bytearray(1024) in order to move the gc_last_free_atb_index.

Adding

for i in range(1024):
    []

fixes the test.

But wait a minute... that's pretty unlucky to have the stack grow back down and not manage to overwrite that value. And... what was that about a stack depth of 1400 words. And how come it takes 20 seconds of CPU time to objdump this binary. And why is the stack pointer at gc_helper_collect_regs_and_stack equal to 0xfffee474 but the dangling heap reference was at 0x fffef350 (nearly 4kiB away on a program that has no recursion).

It seems that arm-linux-gnueabi-gcc (at least up to 10.3.0 used in the ubuntu:rolling image) cannot do the optimisation required for error text compression to work, so every function that uses MP_ERROR_TEXT not only has the giant strcmp list, but also needs 1856 bytes of stack.

So the second fix is to disable MICROPY_ROM_TEXT_COMPRESSION on this compiler, which will mean the stack is more likely to overwrite these dangling pointers.

I checked the arm-linux-gnueabi-gcc for ubuntu:bionic, ubuntu:focus (ubuntu:latest), ubuntu:hirsuite (ubuntu:rolling) and they all miss this optimisation:

  arm-linux-gnueabi-gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
  arm-linux-gnueabi-gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
  arm-linux-gnueabi-gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0

Note that ubuntu:bionic's arm-none-eabi-gcc, even at a much older gcc version, has no trouble with it.

arm-none-eabi-gcc (15:6.3.1+svn253039-1build1) 6.3.1 20170620

jimmo added a commit to jimmo/micropython that referenced this pull request Aug 16, 2021
This compiler is unable to optimise out the giant strcmp match generated
by MP_MATCH_COMPRESSED.

See github.com/micropython/pull/7659#issuecomment-899479793

Signed-off-by: Jim Mussared <[email protected]>
jimmo added a commit to jimmo/micropython that referenced this pull request Aug 16, 2021
Prevents the finaliser from being missed if there's a dangling reference
on the stack to one of the blocks for the files (that this test checks
that they get finalised).

See github.com/micropython/pull/7659#issuecomment-899479793

Signed-off-by: Jim Mussared <[email protected]>
@stinos
Copy link
Contributor

stinos commented Aug 16, 2021

Hmm, that's peculiar. But good find!

@dpgeorge
Copy link
Member Author

Great find!

Is there a way we can detect if the compiler cannot optimise away the strcmp?

@jimmo
Copy link
Member

jimmo commented Aug 16, 2021

Great find!

Is there a way we can detect if the compiler cannot optimise away the strcmp?

I knew you'd ask that. :)

The simplest thing I can think of is to pick a symbol (e.g. in this case mp_call_function_n_kw, but anything that uses MP_ERROR_TEXT) and ensure it's less than N bytes.

@jimmo
Copy link
Member

jimmo commented Aug 16, 2021

More specifically

$(CROSS_COMPILE)nm -S ports/unix/micropython-coverage | grep " mp_call_function_n_kw" | cut -d' ' -f2

gives 00004e2c vs 00000148

@iabdalkader
Copy link
Contributor

iabdalkader commented Aug 16, 2021

at least up to 10.3.0 used in the ubuntu:rolling image

Is this due to a recent change in Micropython or something to do with Ubuntu gcc configuration, because I'm on 10.2.1 and it can optimize away the strcmp tree. Maybe this can be helpful:

Configured with: /mnt/workspace/workspace/GCC-10-pipeline/jenkins-GCC-10-pipeline-48_20201124_1606180641
/src/gcc/configure --target=arm-none-eabi --prefix=/mnt/workspace/workspace/GCC-10-pipeline/jenkins-GCC-10-pipeline-
48_20201124_1606180641/install-native --libexecdir=/mnt/workspace/workspace/GCC-10-pipeline/jenkins-GCC-10-
pipeline-48_20201124_1606180641/install-native/lib --infodir=/mnt/workspace/workspace/GCC-10-pipeline/jenkins-
GCC-10-pipeline-48_20201124_1606180641/install-native/share/doc/gcc-arm-none-eabi/info --mandir=/mnt/workspace
/workspace/GCC-10-pipeline/jenkins-GCC-10-pipeline-48_20201124_1606180641/install-native/share/doc/gcc-arm-none-
eabi/man --htmldir=/mnt/workspace/workspace/GCC-10-pipeline/jenkins-GCC-10-pipeline-48_20201124_1606180641
/install-native/share/doc/gcc-arm-none-eabi/html --pdfdir=/mnt/workspace/workspace/GCC-10-pipeline/jenkins-GCC-10-
pipeline-48_20201124_1606180641/install-native/share/doc/gcc-arm-none-eabi/pdf --enable-languages=c,c++ --enable-
plugins --disable-decimal-float --disable-libffi --disable-libgomp --disable-libmudflap --disable-libquadmath --disable-libssp 
--disable-libstdcxx-pch --disable-nls --disable-shared --disable-threads --disable-tls --with-gnu-as --with-gnu-ld --with-newlib
 --with-headers=yes --with-python-dir=share/gcc-arm-none-eabi --with-sysroot=/mnt/workspace/workspace/GCC-10-
pipeline/jenkins-GCC-10-pipeline-48_20201124_1606180641/install-native/arm-none-eabi --build=x86_64-linux-gnu 
--host=x86_64-linux-gnu --with-gmp=/mnt/workspace/workspace/GCC-10-pipeline/jenkins-GCC-10-pipeline-
48_20201124_1606180641/build-native/host-libs/usr --with-mpfr=/mnt/workspace/workspace/GCC-10-pipeline/jenkins-
GCC-10-pipeline-48_20201124_1606180641/build-native/host-libs/usr --with-mpc=/mnt/workspace/workspace/GCC-10-
pipeline/jenkins-GCC-10-pipeline-48_20201124_1606180641/build-native/host-libs/usr --with-isl=/mnt/workspace
/workspace/GCC-10-pipeline/jenkins-GCC-10-pipeline-48_20201124_1606180641/build-native/host-libs/usr --with-
libelf=/mnt/workspace/workspace/GCC-10-pipeline/jenkins-GCC-10-pipeline-48_20201124_1606180641/build-native
/host-libs/usr --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm' --with-pkgversion='GNU Arm Embedded Toolchain 10-2020-q4-major' --with-multilib-list=rmprofile,aprofile

@jimmo
Copy link
Member

jimmo commented Aug 16, 2021

because I'm on 10.2.1

@iabdalkader Is this arm-linux-gnueabi-gcc or arm-none-eabi-gcc ? I only see the issue on the former.
(I see target=arm-none-eabi in that snippet)

@iabdalkader
Copy link
Contributor

because I'm on 10.2.1

@iabdalkader Is this arm-linux-gnueabi-gcc or arm-none-eabi-gcc ? I only see the issue on the former.
(I see target=arm-none-eabi in that snippet)

Ah I see, it's arm-none-eabi-gcc sorry.

@dpgeorge
Copy link
Member Author

Importantly, it's definitely not the program state holding the reference to the variable (i.e. it's not currently referenced in a register or anywhere near the bottom of the stack, rather something that's much closer to the top of the stack

So this issue would be fixed also by #4723.

@dpgeorge
Copy link
Member Author

The simplest thing I can think of is to pick a symbol (e.g. in this case mp_call_function_n_kw, but anything that uses MP_ERROR_TEXT) and ensure it's less than N bytes.

Ok. Do you think it's worth adding this check and printing a warning? If you're building uPy to actually use then it's most likely you will notice it being very large, but only if you are familiar with what its "usual" size is (which is not going to be well known on a new target, eg RISC-V).

dpgeorge pushed a commit that referenced this pull request Aug 17, 2021
This compiler is unable to optimise out the giant strcmp match generated
by MP_MATCH_COMPRESSED.

See github.com//pull/7659#issuecomment-899479793

Signed-off-by: Jim Mussared <[email protected]>
dpgeorge pushed a commit that referenced this pull request Aug 17, 2021
Prevents the finaliser from being missed if there's a dangling reference
on the stack to one of the blocks for the files (that this test checks
that they get finalised).

See github.com//pull/7659#issuecomment-899479793

Signed-off-by: Jim Mussared <[email protected]>
@jimmo
Copy link
Member

jimmo commented Aug 17, 2021

So this issue would be fixed also by #4723.

Yes. I believe so.

Do you think it's worth adding this check and printing a warning?

Yes, I think so. The main concern is that it would have to run on every build, but that's probably OK. I will send a PR.

@jimmo
Copy link
Member

jimmo commented Aug 18, 2021

With the Windows/AppVeyor failure, it's the same issue (dangling stack pointer into the heap). I was suprised (given how the test works where it creates a linked list filling the entire heap) that a single pointer could pin the entire list. However what happens is that the vast majority of the list is allocated starting at the current heap mark, then a handful more get allocated after a collection, right in the area where previous allocations have been. This maximises the likelyhood that one of the most recent allocations will be where a potential dangling pointer might be.

********LhLhLhLhLh*****LhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLhLh
           ^    ^   ^  ^
           A    B   C  D

In the above diagram, A is the dangling pointer, B is the current lst, C is the bytecode, D is the first lst allocation. So we only get to free the two lists to the right of A.

So the fix suggested by @stinos is good. Also just adding a gc.collect() before the first list element is allocated will also solve it, as it will ensure that the most recent allocation is at the end of the heap.

However, I was curious as to why the f-string PR causes this (i.e. why is the heap in a different enough state to make this problematic), and realised that it leaves the fstring_args vstr for the GC, when all the other lexer & parser state is explcitly free'd. So #7679 fixes this, and also fixes this test in the process.

I think the test should also be fixed but @dpgeorge noticed that this test isn't actually testing the code path that it's supposed to (because it's triggering the emergency exception buf, not heap allocation), so for now we'll leave it for and re-write the test.

Thanks to @stinos for the advice on how to debug this on AppVeyor via RDP.

@dpgeorge dpgeorge force-pushed the extmod-vfs-fat-finaliser-test-unix-qemu-arm branch from 000ae2d to 04f11b0 Compare August 19, 2021 07:33
@dpgeorge
Copy link
Member Author

Fixed in 91a99fc and 5555f14

This now passes all CI.

@dpgeorge dpgeorge closed this Aug 19, 2021
@dpgeorge dpgeorge deleted the extmod-vfs-fat-finaliser-test-unix-qemu-arm branch August 19, 2021 08:49
@dpgeorge dpgeorge added the tests Relates to tests/ directory in source label Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Relates to tests/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants