Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 1 | # Debugging with Swarming |
| 2 | |
| 3 | This document outlines how to debug a test failure on a specific builder |
| 4 | configuration without needing to repeatedly upload new CL revisions or do CQ dry |
| 5 | runs. |
| 6 | |
| 7 | [TOC] |
| 8 | |
| 9 | ## Overview & Terms |
| 10 | |
| 11 | *Swarming* is a system operated by the infra team that schedules and runs tasks |
| 12 | under a specific set of constraints, like "this must run on a macOS 10.13 host" |
| 13 | or "this must run on a host with an intel GPU". It is somewhat similar to part |
| 14 | of [Borg], or to [Kubernetes]. |
| 15 | |
| 16 | An *isolate* is an archive containing all the files needed to do a specific task |
| 17 | on the swarming infrastructure. It contains binaries as well as any libraries |
| 18 | they link against or support data. An isolate can be thought of like a tarball, |
| 19 | but held by the "isolate server" and identified by a hash of its contents. The |
| 20 | isolate also includes the command(s) to run, which is why the command is |
| 21 | specified when building the isolate, not when executing it. |
| 22 | |
| 23 | Normally, when you do a CQ dry run, something like this happens: |
| 24 | |
| 25 | ``` |
| 26 | for type in builders_to_run: |
| 27 | targets = compute_targets_for(type) |
| 28 | isolates = use_swarming_to_build(type, targets) # uploads isolates for targets |
| 29 | wait_for_swarming_to_be_done() |
| 30 | |
| 31 | for isolate in isolates: |
| 32 | use_swarming_to_run(type, isolate) # downloads isolates onto the bots used |
| 33 | wait_for_swarming_to_be_done() |
| 34 | ``` |
| 35 | |
| 36 | When you do a CQ retry on a specific set of bots, that simply constrains |
| 37 | `builders_to_run` in the pseudocode above. However, if you're trying to rerun a |
| 38 | specific target on a specific bot, because you're trying to reproduce a failure |
| 39 | or debug, doing a CQ retry will still waste a lot of time - the retry will still |
| 40 | build and run *all* targets, even if it's only for one bot. |
| 41 | |
| 42 | Fortunately, you can manually invoke some steps of this process. What you really |
| 43 | want to do is: |
| 44 | |
| 45 | ``` |
| 46 | isolate = use_swarming_to_build(type, target) # can't do this yet, see below |
| 47 | use_swarming_to_run(type, isolate) |
| 48 | ``` |
| 49 | |
| 50 | or perhaps: |
| 51 | |
| 52 | ``` |
| 53 | isolate = upload_to_isolate_server(target_you_built_locally) |
| 54 | use_swarming_to_run(type, isolate) |
| 55 | ``` |
| 56 | |
Fergal Daly | 2edab67 | 2019-10-21 14:12:16 | [diff] [blame^] | 57 | ## The easy way |
| 58 | |
| 59 | A lot of the steps described in this doc have been bundled up into 2 |
| 60 | tools. Before using either of these you will need to |
| 61 | [authenticate](#authenticating). |
| 62 | |
| 63 | ### run-swarmed.py |
| 64 | |
| 65 | A lot of the logic below is wrapped up in `tools/run-swarmed.py`, which you can run |
| 66 | like this: |
| 67 | |
| 68 | ``` |
| 69 | $ tools/run-swarmed.py $outdir $target |
| 70 | ``` |
| 71 | |
| 72 | See the `--help` option of `run-swarmed.py` for more details about that script. |
| 73 | |
| 74 | ### mb.py run |
| 75 | |
| 76 | Similar to `tools/run_swarmed.py`, `mb.py run` bundles much of the logic into a |
| 77 | single command line. Unlike `tools/run_swarmed.py`, `mb.py run` allows the user |
| 78 | to specify extra arguments to pass to the test, but has a messier command line. |
| 79 | |
| 80 | To use it, run: |
| 81 | ``` |
| 82 | $ tools/mb/mb.py run \ |
| 83 | -s --no-default-dimensions \ |
| 84 | -d pool $pool \ |
| 85 | $criteria \ |
| 86 | $outdir $target \ |
| 87 | -- $extra_args |
| 88 | ``` |
| 89 | |
| 90 | ## A concrete example |
| 91 | |
| 92 | Here's how to run `chrome_public_test_apk` on a bot with a Nexus 5 running KitKat. |
| 93 | |
| 94 | ```sh |
| 95 | $ tools/mb/mb.py run \ |
| 96 | -s --no-default-dimensions \ |
| 97 | -d pool Chrome \ |
| 98 | -d device_os_type userdebug -d device_os KTU84P -d device_type hammerhead \ |
| 99 | out/Android-arm-dbg chrome_public_test_apk |
| 100 | ``` |
| 101 | |
| 102 | This assumes you have an `out/Android-arm-dbg/args.gn` like |
| 103 | |
| 104 | ``` |
| 105 | ffmpeg_branding = "Chrome" |
| 106 | is_component_build = false |
| 107 | is_debug = true |
| 108 | proprietary_codecs = true |
| 109 | strip_absolute_paths_from_debug_symbols = true |
| 110 | symbol_level = 1 |
| 111 | system_webview_package_name = "com.google.android.webview" |
| 112 | target_os = "android" |
| 113 | use_goma = true |
| 114 | ``` |
| 115 | |
| 116 | ## Bot selection criteria |
| 117 | |
| 118 | The examples in this doc use `$criteria`. To figure out what values to use, you |
| 119 | can go to an existing swarming run |
| 120 | ([recent tasks page](https://chromium-swarm.appspot.com/tasklist)) and |
| 121 | look at the `Dimensions` section. Each of these becomes a `-d dimension_name |
| 122 | dimension_value` in your `$criteria`. Click on `bots` (or go |
| 123 | [here](https://chromium-swarm.appspot.com/botlist)) to be taken to a UI that |
| 124 | allows you to try out the criteria interactively, so that you can be sure that |
| 125 | there are bots matching your criteria. Sometimes the web page shows a |
| 126 | human-friendly name rather than the name required on the commandline. [This |
| 127 | file](https://cs.chromium.org/chromium/infra/luci/appengine/swarming/ui2/modules/alias.js) |
| 128 | contains the mapping to human-friendly names. You can test your commandline by |
| 129 | entering `dimension_name:dimension_value` in the interactive UI. |
| 130 | |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 131 | ## Building an isolate |
| 132 | |
| 133 | At the moment, you can only build an isolate locally, like so (commands you type |
| 134 | begin with `$`): |
| 135 | |
| 136 | ``` |
| 137 | $ tools/mb/mb.py isolate //$outdir $target |
| 138 | ``` |
| 139 | |
| 140 | This will produce some files in $outdir. The most pertinent two are |
Elly Fong-Jones | ef5bed3 | 2019-09-10 19:27:44 | [diff] [blame] | 141 | `$outdir/$target.isolate` and `$outdir/target.isolated`. If you've already built |
| 142 | $target, you can save some CPU time and run `tools/mb/mb.py` with `--no-build`: |
| 143 | |
| 144 | ``` |
| 145 | $ tools/mb/mb.py isolate --no-build //$outdir $target |
| 146 | ``` |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 147 | |
| 148 | Support for building an isolate using swarming, which would allow you to build |
| 149 | for a platform you can't build for locally, does not yet exist. |
| 150 | |
Fergal Daly | 2edab67 | 2019-10-21 14:12:16 | [diff] [blame^] | 151 | ## Authenticating |
| 152 | |
| 153 | You may need to log in to `https://isolateserver.appspot.com` to do this: |
| 154 | |
| 155 | ``` |
| 156 | $ python tools/swarming_client/auth.py login \ |
| 157 | --service=https://isolateserver.appspot.com |
| 158 | ``` |
| 159 | |
| 160 | Use your google.com account for this. |
| 161 | |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 162 | ## Uploading an isolate |
| 163 | |
| 164 | You can then upload the resulting isolate to the isolate server: |
| 165 | |
| 166 | ``` |
| 167 | $ tools/swarming_client/isolate.py archive \ |
| 168 | -I https://isolateserver.appspot.com \ |
| 169 | -i $outdir/$target.isolate \ |
| 170 | -s $outdir/$target.isolated |
| 171 | ``` |
| 172 | |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 173 | The `isolate.py` tool will emit something like this: |
| 174 | |
| 175 | ``` |
| 176 | e625130b712096e3908266252c8cd779d7f442f1 unit_tests |
| 177 | ``` |
| 178 | |
Elly Fong-Jones | f278f71 | 2019-09-09 21:08:49 | [diff] [blame] | 179 | Do not ctrl-c it after it does this, even if it seems to be hanging for a |
| 180 | minute - just let it finish. |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 181 | |
| 182 | ## Running an isolate |
| 183 | |
| 184 | Now that the isolate is on the isolate server with hash `$hash` from the |
| 185 | previous step, you can run on bots of your choice: |
| 186 | |
| 187 | ``` |
| 188 | $ tools/swarming_client/swarming.py trigger \ |
| 189 | -S https://chromium-swarm.appspot.com \ |
| 190 | -I https://isolateserver.appspot.com \ |
| 191 | -d pool $pool \ |
| 192 | $criteria \ |
| 193 | -s $hash |
| 194 | ``` |
| 195 | |
| 196 | There are two more things you need to fill in here. The first is the pool name; |
| 197 | you should pick "Chrome" unless you know otherwise. The pool is the collection |
| 198 | of hosts from which swarming will try to pick bots to run your tasks. |
| 199 | |
| 200 | The second is the criteria, which is how you specify which bot(s) you want your |
| 201 | task scheduled on. These are specified via "dimensions", which are specified |
| 202 | with `-d key val` or `--dimension=key val`. In fact, the `-d pool $pool` in the |
| 203 | command above is selecting based on the "pool" dimension. There are a lot of |
| 204 | possible dimensions; one useful one is "os", like `-d os Linux`. Examples of |
| 205 | other dimensions include: |
| 206 | |
| 207 | * `-d os Mac10.13.6` to select a specific OS version |
| 208 | * `-d device_type "Pixel 3"` to select a specific Android device type |
| 209 | * `-d gpu 8086:1912` to select a specific GPU |
| 210 | |
| 211 | The [swarming bot list] allows you to see all the dimensions and the values they |
| 212 | can take on. |
| 213 | |
Brian Sheedy | 00a51e4 | 2019-09-09 19:09:17 | [diff] [blame] | 214 | If you need to pass additional arguments to the test, simply add |
| 215 | `-- $extra_args` to the end of the `swarming.py trigger` command line - anything |
| 216 | after the `--` will be passed directly to the test. |
| 217 | |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 218 | When you invoke `swarming.py trigger`, it will emit two pieces of information: a |
| 219 | URL for the task it created, and a command you can run to collect the results of |
| 220 | that task. For example: |
| 221 | |
| 222 | ``` |
| 223 | Triggered task: [email protected]/os=Linux_pool=Chrome/e625130b712096e3908266252c8cd779d7f442f1 |
| 224 | To collect results, use: |
| 225 | tools/swarming_client/swarming.py collect -S https://chromium-swarm.appspot.com 46fc393777163310 |
| 226 | Or visit: |
| 227 | https://chromium-swarm.appspot.com/user/task/46fc393777163310 |
| 228 | ``` |
| 229 | |
| 230 | The 'collect' command given there will block until the task is complete, then |
| 231 | produce the task's results, or you can load that URL and watch the task's |
| 232 | progress. |
| 233 | |
Elly Fong-Jones | 9996b217 | 2019-09-05 13:24:43 | [diff] [blame] | 234 | ## Other notes |
| 235 | |
| 236 | If you are looking at a Swarming task page, be sure to check the bottom of the |
| 237 | page, which gives you commands to: |
| 238 | |
| 239 | * Download the contents of the isolate the task used |
| 240 | * Reproduce the task's configuration locally |
| 241 | * Download all output results from the task locally |
| 242 | |
| 243 | [borg]: https://ai.google/research/pubs/pub43438 |
| 244 | [kubernetes]: https://kubernetes.io/ |
| 245 | [swarming bot list]: https://chromium-swarm.appspot.com/botlist |