Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 1 | # Debugging with Swarming |
| 2 | |
| 3 | This document outlines how to debug a test failure on a specific builder |
Ben Pastene | dd94f632 | 2025-01-02 20:15:05 | [diff] [blame] | 4 | configuration on Swarming using the [UTR tool](../../tools/utr/README.md) |
| 5 | without needing to repeatedly upload new CL revisions or do CQ dry runs. This |
| 6 | tool will automatically handle steps like replicating the right GN args, |
| 7 | building & uploading the test isolate, triggering & collecting the swarming test |
| 8 | tasks. |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 9 | |
| 10 | [TOC] |
| 11 | |
| 12 | ## Overview & Terms |
| 13 | |
| 14 | *Swarming* is a system operated by the infra team that schedules and runs tasks |
| 15 | under a specific set of constraints, like "this must run on a macOS 10.13 host" |
| 16 | or "this must run on a host with an intel GPU". It is somewhat similar to part |
| 17 | of [Borg], or to [Kubernetes]. |
| 18 | |
| 19 | An *isolate* is an archive containing all the files needed to do a specific task |
| 20 | on the swarming infrastructure. It contains binaries as well as any libraries |
| 21 | they link against or support data. An isolate can be thought of like a tarball, |
Junji Watanabe | 3221144 | 2021-01-13 07:31:47 | [diff] [blame] | 22 | but held by the CAS server and identified by a digest of its contents. The |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 23 | isolate also includes the command(s) to run, which is why the command is |
Ben Pastene | dd94f632 | 2025-01-02 20:15:05 | [diff] [blame] | 24 | specified when building the isolate, not when executing it. See the |
Louis Romero | ef34dc8 | 2025-01-10 17:20:33 | [diff] [blame] | 25 | [infra glossary](../infra/glossary.md) for the definitions of these terms and |
Ben Pastene | dd94f632 | 2025-01-02 20:15:05 | [diff] [blame] | 26 | more. |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 27 | |
| 28 | Normally, when you do a CQ dry run, something like this happens: |
| 29 | |
| 30 | ``` |
| 31 | for type in builders_to_run: |
| 32 | targets = compute_targets_for(type) |
| 33 | isolates = use_swarming_to_build(type, targets) # uploads isolates for targets |
| 34 | wait_for_swarming_to_be_done() |
| 35 | |
| 36 | for isolate in isolates: |
| 37 | use_swarming_to_run(type, isolate) # downloads isolates onto the bots used |
| 38 | wait_for_swarming_to_be_done() |
| 39 | ``` |
| 40 | |
| 41 | When you do a CQ retry on a specific set of bots, that simply constrains |
| 42 | `builders_to_run` in the pseudocode above. However, if you're trying to rerun a |
| 43 | specific target on a specific bot, because you're trying to reproduce a failure |
| 44 | or debug, doing a CQ retry will still waste a lot of time - the retry will still |
| 45 | build and run *all* targets, even if it's only for one bot. |
| 46 | |
| 47 | Fortunately, you can manually invoke some steps of this process. What you really |
| 48 | want to do is: |
| 49 | |
| 50 | ``` |
| 51 | isolate = use_swarming_to_build(type, target) # can't do this yet, see below |
| 52 | use_swarming_to_run(type, isolate) |
| 53 | ``` |
| 54 | |
| 55 | or perhaps: |
| 56 | |
| 57 | ``` |
Junji Watanabe | 3221144 | 2021-01-13 07:31:47 | [diff] [blame] | 58 | isolate = upload_to_cas(target_you_built_locally) |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 59 | use_swarming_to_run(type, isolate) |
| 60 | ``` |
| 61 | |
Fergal Daly | 2edab67 | 2019-10-21 14:12:16 | [diff] [blame] | 62 | ## A concrete example |
| 63 | |
Ben Pastene | dd94f632 | 2025-01-02 20:15:05 | [diff] [blame] | 64 | Here's how to run `chrome_public_unit_test_apk` on Android devices. By using the |
| 65 | config of the `android-arm64-rel` trybot, we can run it on Pixel 3 XLs running |
| 66 | Android Pie. |
Fergal Daly | 2edab67 | 2019-10-21 14:12:16 | [diff] [blame] | 67 | |
| 68 | ```sh |
Ben Pastene | dd94f632 | 2025-01-02 20:15:05 | [diff] [blame] | 69 | $ vpython3 tools/utr \ |
| 70 | -p chromium \ |
| 71 | -B try \ |
| 72 | -b android-arm64-rel \ |
| 73 | -t "chrome_public_unit_test_apk on Android device Pixel 3 XL" \ |
| 74 | compile-and-test |
Fergal Daly | 2edab67 | 2019-10-21 14:12:16 | [diff] [blame] | 75 | ``` |
| 76 | |
Ben Pastene | dd94f632 | 2025-01-02 20:15:05 | [diff] [blame] | 77 | You can find the UTR invocation for any test on the build UI under the step's |
| 78 | "reproduction instructions" (displayed by clicking the page icon in the UI). |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 79 | |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 80 | ## Other notes |
| 81 | |
| 82 | If you are looking at a Swarming task page, be sure to check the bottom of the |
| 83 | page, which gives you commands to: |
| 84 | |
| 85 | * Download the contents of the isolate the task used |
| 86 | * Reproduce the task's configuration locally |
| 87 | * Download all output results from the task locally |
| 88 | |
| 89 | [borg]: https://ai.google/research/pubs/pub43438 |
| 90 | [kubernetes]: https://kubernetes.io/ |
| 91 | [swarming bot list]: https://chromium-swarm.appspot.com/botlist |
Sven Zheng | 6c3b602 | 2023-07-28 18:53:55 | [diff] [blame] | 92 | |
| 93 | To find out repo checkout, gn args, etc for local compile, you can use |
| 94 | [how to repro bot failures](../testing/how_to_repro_bot_failures.md) |
| 95 | as a reference. |