Blame - docs/clang_gardening.md - chromium/src

blob: a8eadb4eaff0ac6f79c5de0b0790a4755b94f2f5 [file] [log] [blame] [view]

Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	1	# Clang Gardening
				2
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	3	Chromium bundles its own pre-built version of [Clang](clang.md) and
				4	[Rust](rust.md). This is done so that Chromium developers have access to the
				5	latest and greatest developer tools provided by Clang and LLVM (ASan, CFI,
				6	coverage, etc). In order to [update the compiler](updating_clang.md)
				7	(roll clang), it has to be tested so that we can be confident that it works
				8	in the configurations that Chromium cares about.
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	9
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	10	The Clang gardener is responsible for monitoring the health of the latest
				11	versions of Clang + Rust, and how they work with the latest version of
				12	Chromium; raise any issues by filing bugs; address those issues or find someone
				13	to do so; and ultimately attempt to update the compiler by performing [a Clang
				14	roll](updating_clang.md).
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	15
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	16	There are two main sources of information about the state of the build:
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	17
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	18	1. Buildbots on the [tip-of-tree clang
				19	waterfall](https://ci.chromium.org/p/chromium/g/chromium.clang/console)
				20	continuously build the latest version of Clang and use that to build and
				21	test Chromium in various build configurations. These provide the fastest
				22	signal about problems such as the compiler crashing, new warnings causing
				23	build failures, miscompiles causing test failures, etc. Unlike production
				24	buildbots, these build Clang with assertions enabled to detect as many
				25	problems as possible. (Clicking 'Log in' in the top right corner with a
				26	Google account will reveal a few more bots.)
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	27
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	28	1. Automatically generated [Clang roll
				29	CLs](https://chromium-review.googlesource.com/q/path:tools/clang/scripts/update.py)
				30	("dry run CLs"). These are generated every few hours by a Cron job and
				31	attempt to package the latest version of Clang and Rust. That process can
				32	fail for many reasons, especially due to failures in the compilers' test
				33	suites. If the CLs stop being generated, that also needs to be addressed.
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	34
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	35	Issues should be filed in the [Chromium > Tools >
				36	LLVM](https://g-issues.chromium.org/issues?q=status:open%20componentid:1457173)
				37	bug tracker component, and marked as blockers of the tracking bug for the next
				38	toolchain update. That bug is typically named "roll clang and rust again"
				39	[example](https://crbug.com/404285928). The tracking bug should be filed as a
				40	P1 Process bug, and blockers should be filed and treated as P1 bugs.
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	41
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	42	Here is a suggested set of steps to iterate over while gardening:
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	43
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	44	* If there is no bug for tracking the next toolchain update, file one.
				45
				46	* Go over the blockers of the toolchain update tracking bug. Close obsolete
				47	ones, try to fix or find someone to fix the remaining ones.
				48
				49	* Check the [tip-of-tree clang
				50	waterfall](https://ci.chromium.org/p/chromium/g/chromium.clang/console) and
				51	file bugs for any issues.
				52
				53	* Check the automatic [Clang roll
				54	CLs](https://chromium-review.googlesource.com/q/path:tools/clang/scripts/update.py).
				55	File a bug for any packaging issues. File a bug if the CLs stop being produced.
				56
				57	* When packaging succeeds on a roll CL, follow the instructions in [update the
				58	compiler](updating_clang.md) to push the packages to production and do a
				59	commit queue dry run. File a bug for any issues that come up.
				60
				61	* If the commit queue dry run was successful, review and land the CL.
				62
				63	The key to success is to detect as many problems as early as possible. Rather
				64	than stopping to dig deeply into the first problem encountered, it's better to
				65	do a broad sweep to find all the problems. That way they can be shared among
				66	the team. Also, problems are often much easier to address when found early.
				67
				68	The other key to success is communication. The bug tracker is the main tool for
				69	that, and the other is the team chat room. When it's not clear whether
				70	something is an issue or not, how to resolve an problem, how something works,
				71	etc., just ask away.
				72
				73	The gardener is also responsible for taking notes during the weekly Chrome
				74	toolchain sync meeting.
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	75
				76	[TOC]
				77
Arthur Eubanks	c839977	2025-03-20 16:10:06	[diff] [blame]	78	## Clang packaging test failures
				79
				80	When packaging clang/LLVM on our various supported platforms (`upload_*_clang`
				81	tryjobs), we run the entire LLVM test suite and block the build if any
				82	test failed. The most common test failures we see are Mac and Windows-specific
				83	tests since upstream LLVM is mostly Linux-focused. There are public bots that
				84	also run LLVM tests, mostly accessible from https://lab.llvm.org/buildbot.
				85	There are also some Apple bots running at
				86	http://green.lab.llvm.org/job/llvm.org/ which mirror test failures we see on
				87	Mac. Reverting the culprit change upstream with a pointer to a public bot
				88	showing the test failure is encouraged.
				89
Hans Wennborg	fbda621	2025-04-04 14:30:44	[diff] [blame]	90	See LLVM's [Patch reversion
				91	policy](https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy).
				92
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	93	## Disk out of space
				94
				95	If there are any issues with disk running out of space, file a go/bug-a-trooper
				96	bug, for example https://crbug.com/1105134.
				97
				98	## Is it the compiler?
				99
				100	Chromium does not always build and pass tests in all configurations that
				101	everyone cares about. Some configurations simply take too long to build
				102	(ThinLTO) or be tested (dbg) on the CQ before committing. And, some tests are
				103	flaky. So, our console is often filled with red boxes, and the boxes don't
				104	always need to be green to roll clang.
				105
				106	Oftentimes, if a bot is red with a test failure, it's not a bug in the compiler.
				107	To check this, the easiest and best thing to do is to try to find a
				108	corresponding builder that doesn't use ToT clang. For standard configurations,
				109	start on the waterfall that corresponds to the OS of the red bot, and search
				110	from there. If the failing bot is Google Chrome branded, go to the (Google
				111	internal) [official builder
				112	list](https://uberchromegw.corp.google.com/i/official.desktop.continuous/builders/)
				113	and start searching from there.
				114
				115	If you are feeling charitable, you can try to see when the test failure was
				116	introduced by looking at the history in the bot. One way to do this is to add
				117	`?numbuilds=200` to the builder URL to see more history. If that isn't enough
				118	history, you can manually binary search build numbers by editing the URL until
				119	you find where the regression was introduced. If it's immediately clear what CL
				120	introduced the regression (i.e. caused tests to fail reliably in the official
				121	build configuration), you can simply load the change in gerrit and revert it,
				122	linking to the first failing build that implicates the change being reverted.
				123
				124	If the failure looks like a compiler bug, these are the common failures we see
				125	and what to do about them:
				126
				127	1. compiler crash
				128	1. compiler warning change
				129	1. compiler error
				130	1. miscompile
				131	1. linker errors
				132
				133	## Compiler crash
				134
				135	This is probably the most common bug. The standard procedure is to do these
				136	things:
				137
				138	1. Open the `gclient runhooks` stdout log from the first red build. Near the
				139	top of that log you can find the range of upstream llvm revisions. For
				140	example:
				141
				142	From https://github.com/llvm/llvm-project
				143	f917356f9ce..292e898c16d master -> origin/master
				144
				145	1. File a crbug documenting the crash. Include the range, and any other bots
				146	displaying the same symptoms.
				147	1. All clang crashes on the Chromium bots are automatically uploaded to
				148	Cloud Storage. On the failing build, click the "stdout" link of the
				149	"process clang crashes" step right after the red compile step. It will
				150	print something like
				151
				152	processing heap_page-65b34d... compressing... uploading... done
				153	gs://chrome-clang-crash-reports/v1/2019/08/27/chromium.clang-ToTMac-20955-heap_page-65b34d.tgz
				154	removing heap_page-65b34d.sh
				155	removing heap_page-65b34d.cpp
				156
				157	Use
				158	`gsutil.py cp gs://chrome-clang-crash-reports/v1/2019/08/27/chromium.clang-ToTMac-20955-heap_page-65b34d.tgz .`
				159	to copy it to your local machine. Untar with
				160	`tar xzf chromium.clang-ToTMac-20955-heap_page-65b34d.tgz` and change the
				161	included shell script to point to a locally-built clang. Remove the
				162	`-Xclang -plugin` flags. If you re-run the shell script, it should
				163	reproduce the crash.
				164	1. Identify the revision that introduced the crash. First, look at the commit
				165	messages in the LLVM revision range to see if one modifies the code near the
				166	point of the crash. If so, try reverting it locally, rebuild, and run the
				167	reproducer to see if the crash goes away.
				168
				169	If that doesn't work, use `git bisect`. Use this as a template for the bisect
				170	run script:
				171	```shell
				172	#!/bin/bash
				173	cd $(dirname $0) # get into llvm build dir
				174	ninja -j900 clang \|\| exit 125 # skip revisions that don't compile
				175	./t-8f292b.sh \|\| exit 1 # exit 0 if good, 1 if bad
				176	```
				177	1. File an upstream bug like http://llvm.org/PR43016. Usually the unminimized repro
				178	is too large for LLVM's bugzilla, so attach it to a (public) crbug and link
				179	to that from the LLVM bug. Then revert with a commit message like
				180	"Revert r368987, it caused PR43016."
				181	1. If you want, make a reduced repro using CReduce. Clang contains a handy wrapper around
				182	CReduce that you can invoke like so:
				183
				184	clang/utils/creduce-clang-crash.py --llvm-bin bin \
				185	angle_deqp_gtest-d421b0.sh angle_deqp_gtest-d421b0.cpp
				186
Zequan Wu	d3e671e	2024-05-15 19:11:19	[diff] [blame]	187	Attach the reproducer to the llvm bug you filed in the previous step. You can
				188	disable Creduce's renaming passes with the options
				189	`--remove-pass pass_clang rename-fun --remove-pass pass_clang rename-param
				190	--remove-pass pass_clang rename-var --remove-pass pass_clang rename-class
				191	--remove-pass pass_clang rename-cxx-method --remove-pass pass_clex
				192	rename-toks` which makes it easier for the author to reason about and to
				193	further reduce it manually.
Hans Wennborg	3c65cd4	2023-05-31 17:46:08	[diff] [blame]	194
				195	If you need to do something the wrapper doesn't support,
				196	follow the [official CReduce docs](https://embed.cs.utah.edu/creduce/using/)
				197	for writing an interestingness test and use creduce directly.
				198
				199	## Compiler warning change
				200
				201	New Clang versions often find new bad code patterns to warn on. Chromium builds
				202	with `-Werror`, so improvements to warnings often turn into build failures in
				203	Chromium. Once you understand the code pattern Clang is complaining about, file
				204	a bug to do either fix or silence the new warning.
				205
				206	If this is a completely new warning, disable it by adding `-Wno-NEW-WARNING` to
				207	[this list of disabled
				208	warnings](https://cs.chromium.org/chromium/src/build/config/compiler/BUILD.gn?l=1479)
				209	if `llvm_force_head_revision` is true. Here is [an
				210	example](https://chromium-review.googlesource.com/1251622). This will keep the
				211	ToT bots green while you decide what to do.
				212
				213	Sometimes, behavior changes and a pre-existing warning changes to warn on new
				214	code. In this case, fixing Chromium may be the easiest and quickest fix. If
				215	there are many sites, you may consider changing clang to put the new diagnostic
				216	into a new warning group so you can handle it as a new warning as described
				217	above.
				218
				219	If the warning is high value, then eventually our team or other contributors
				220	will end up fixing the crbug and there is nothing more to do. If the warning
				221	seems low value, pass that feedback along to the author of the new warning
				222	upstream. It's unlikely that it should be on by default or enabled by `-Wall` if
				223	users don't find it valuable. If the warning is particularly noisy and can't be
				224	easily disabled without disabling other high value warnings, you should consider
				225	reverting the change upstream and asking for more discussion.
				226
				227	## Compiler error
				228
				229	This rarely happens, but sometimes clang becomes more strict and no longer
				230	accepts code that it previously did. The standard procedure for a new warning
				231	may apply, but it's more likely that the upstream Clang change should be
				232	reverted, if the C++ code in question in Chromium looks valid.
				233
				234	## Miscompile
				235
				236	Miscompiles tend to result in crashes, so if you see a test with the CRASHED
				237	status, this is probably what you want to do.
				238
				239	1. Bisect object files to find the object with the code that changed. LLVM
				240	contains `llvm/utils/rsp_bisect.py` which may be useful for bisecting object
				241	files using an rsp file.
				242	1. Debug it with a traditional debugger
				243
				244	## Linker error
				245
				246	`ld.lld`'s `--reproduce` flag makes LLD write a tar archive of all its inputs
				247	and a file `response.txt` that contains the link command. This allows people to
				248	work on linker bugs without having to have a Chromium build environment.
				249
				250	To use `ld.lld`'s `--reproduce` flag, follow these steps:
				251
				252	1. Locally (build Chromium with a locally-built
				253	clang)[https://chromium.googlesource.com/chromium/src.git/+/main/docs/clang.md#Using-a-custom-clang-binary]
				254
				255	1. After reproducing the link error, build just the failing target with
				256	ninja's `-v -d keeprsp` flags added:
				257	`ninja -C out/gn base_unittests -v -d keeprsp`.
				258
				259	1. Copy the link command that ninja prints, `cd out/gn`, paste it, and manually
				260	append `-Wl,--reproduce,repro.tar`. With `lld-link`, instead append
				261	`/reproduce:repro.tar`. (`ld.lld` is invoked through the `clang` driver, so
				262	it needs `-Wl` to pass the flag through to the linker. `lld-link` is called
				263	directly, so the flag needs no prefix.)
				264
				265	1. Zip up the tar file: `gzip repro.tar`. This will take a few minutes and
				266	produce a .tar.gz file that's 0.5-1 GB.
				267
				268	1. Upload the .tar.gz to Google Drive. If you're signed in with your @google
				269	address, you won't be able to make a world-shareable link to it, so upload
				270	it in a Window where you're signed in with your @chromium account.
				271
				272	1. File an LLVM bug linking to the file. Example: http://llvm.org/PR43241
				273
				274	TODO: Describe object file bisection, identify obj with symbol that no longer
				275	has the section.
				276
				277	## ThinLTO Trouble
				278
				279	Sometimes, problems occur in ThinLTO builds that do not occur in non-LTO builds.
				280	These steps can be used to debug such problems.
				281
				282	Notes:
				283
				284	- All steps assume they are run from the output directory (the same directory args.gn is in).
				285
				286	- Commands have been shortened for clarity. In particular, Chromium build commands are
				287	generally long, with many parts that you just copy-paste when debugging. These have
				288	largely been omitted.
				289
				290	- The commands below use "clang++", where in practice there would be some path prefix
				291	in front of this. Make sure you are invoking the right clang++. In particular, there
				292	may be one in the PATH which behaves very differently.
				293
				294	### Get the full command that is used for linking
				295
				296	To get the command that is used to link base_unittests:
				297
				298	```sh
				299	$ rm base_unittests
				300	$ ninja -n -d keeprsp -v base_unittests
				301	```
				302
				303	This will print a command line. It will also write a file called `base_unittests.rsp`, which
				304	contains additional parameters to be passed.
				305
				306	### Remove ThinLTO cache flags
				307
				308	ThinLTO uses a cache to avoid compilation in some cases. This can be confusing
				309	when debugging, so make sure to remove the various cache flags like
				310	`-Wl,--thinlto-cache-dir`.
				311
				312	### Expand Thin Archives on Command Line
				313
				314	Expand thin archives mentioned in the command line to their individual object files.
				315	The script `tools/clang/scripts/expand_thin_archives.py` can be used for this purpose.
				316	For example:
				317
				318	```sh
				319	$ ../../tools/clang/scripts/expand_thin_archives.py -p=-Wl, -- @base_unittests.rsp > base_unittests.expanded.rsp
				320	```
				321
				322	The `-p` parameter here specifies the prefix for parameters to be passed to the linker.
				323	If you are invoking the linker directly (as opposed to through clang++), the prefix should
				324	be empty.
				325
				326	```sh
				327	$ ../../tools/clang/scripts/expand_thin_archives.py -p='', -- @base_unittests.rsp > base_unittests.expanded.rsp
				328	```
				329
				330	### Remove -Wl,--start-group and -Wl,--end-group
				331
				332	Edit the link command to use the expanded command line, and remove any mention of `-Wl,--start-group`
				333	and `-Wl,--end-group` that surround the expanded command line. For example, if the original command was:
				334
				335	clang++ -fuse-ld=lld -o ./base_unittests -Wl,--start-group @base_unittests.rsp -Wl,--end-group
				336
				337	the new command should be:
				338
				339	clang++ -fuse-ld=lld -o ./base_unittests @base_unittests.expanded.rsp
				340
				341	The reason for this is that the `-start-lib` and `-end-lib` flags that expanding the command
				342	line produces cannot be nested inside `--start-group` and `--end-group`.
				343
				344	### Producing ThinLTO Bitcode Files
				345
				346	In a ThinLTO build, what is normally the compile step that produces native object files
				347	instead produces LLVM bitcode files. A simple example would be:
				348
				349	```sh
				350	$ clang++ -c -flto=thin foo.cpp -o foo.o
				351	```
				352
				353	In a Chromium build, these files reside under `obj/`, and you can generate them using ninja.
				354	For example:
				355
				356	```sh
				357	$ ninja obj/base/base/lock.o
				358	```
				359
				360	These can be fed to `llvm-dis` to produce textual LLVM IR:
				361
				362	```
				363	$ llvm-dis -o - obj/base/base/lock.o \| less
				364	```
				365
				366	When using split LTO unit (`-fsplit-lto-unit`, which is required for
				367	some features, CFI among them), this may produce a message like:
				368
				369	llvm-dis: error: Expected a single module
				370
				371	In that case, you can use `llvm-modextract`:
				372
				373	```sh
				374	$ llvm-modextract -n 0 -o - obj/base/base/lock.o \| llvm-dis -o - \| less
				375	```
				376
				377	### Saving Intermediate Bitcode
				378
				379	The ThinLTO linking process proceeds in a number of stages. The bitcode that is
				380	generated during these stages can be saved by passing `-save-temps` to the linker:
				381
				382	```
				383	$ clang++ -fuse-ld=lld -Wl,-save-temps -o ./base_unittests @base_unittests.expanded.rsp
				384	```
				385
				386	This generates files such as:
				387	- lock.o.0.preopt.bc
				388	- lock.o.3.import.bc
				389	- lock.o.5.precodegen.bc
				390
				391	in the directory where lock.o is (obj/base/base).
				392
				393	These can be fed to `llvm-dis` to produce textual LLVM IR. They show
				394	how the code is transformed as it progresses through ThinLTO stages.
				395	Of particular interest are:
				396	- .3.import.bc, which shows the IR after definitions have been imported from
				397	other modules, but before optimizations. Running this through LLVM's `opt`
				398	tool with the right optimization level can often reproduce issues.
				399	- .5.precodegen.bc, which shows the IR just before it is transformed to native
				400	code. Running this through LLVM's `llc` tool with the right optimization level
				401	can often reproduce issues.
				402
				403	The same `-save-temps` command also produces `base_unittests.resolution.txt`, which
				404	shows symbol resolutions. These look like:
				405
				406	-r=obj/base/test/run_all_base_unittests/run_all_base_unittests.o,main,plx
				407
				408	In this example, run_all_base_unittests.o contains a symbol named
				409	main, with flags plx.
				410
				411	The possible flags are:
				412	- p: prevailing: of symbols with this name, this one has been chosen.
				413	- l: final definition in this linkage unit.
				414	- r: redefined by the linker.
				415	- x: visible to regular (that is, non-LTO) objects.
				416
				417	### Code Generation for a Single Module
				418
				419	To speed up debugging, it may be helpful to limit code generation to a single
				420	module if you know the name of the module (e.g. the module name is in a crash
				421	dump).
				422
				423	`-Wl,--thinlto-single-module=foo` tells ThinLTO to only run
				424	optimizations/codegen on files matching the pattern and skip linking. This is
				425	helpful especially in combination with `-Wl,-save-temps`.
				426
				427	```sh
				428	$ clang++ -fuse-ld=lld -Wl,--thinlto-single-module=obj/base/base/lock.o -o ./base_unittests @base_unittests.expanded.rsp
				429	```
				430
				431	You should see
				432
				433	```sh
				434	[ThinLTO] Selecting obj/base/base/lock.o to compile
				435	```
				436
				437	being printed.
				438
				439	## Tips and tricks
				440
				441	Finding what object files differ between two directories:
				442
				443	```
				444	$ diff -u <(cd out.good && find . -name "*.o" -exec sha1sum {} \; \| sort -k2) \
				445	<(cd out.bad && find . -name "*.o" -exec sha1sum {} \; \| sort -k2)
				446	```
				447
				448	Or with cmp:
				449
				450	```
				451	$ find good -name "*.o" -exec bash -c 'cmp -s $0 ${0/good/bad} \|\| echo $0' {} \;
				452	```