# Why sanitize when you can validate? ## Background A sanitizer takes in a string in a language and puts out a *safe* version. Occasionally people ask for a function, that instead of returning a safe version of the input, just labels the input as *safe* or *unsafe*. Herein I address why I think the latter is a bad idea for HTML specifically. Hopefully this will prompt a discussion, and I'm interested why people want validators. Please let me know your thoughts on use cases and how that relates to the definition of *valid*. ## Defining "Valid" The sanitizer promises that it's output can be safely embedded in a larger document. It seems to me that any *valid* input should also have this property. #### Valid means idempotent One naïve way to define *valid* is thus: > A valid input is any input such that `input.equals(sanitized(input))`. This is sound, but not very useful. Intuitively, it seems that there must be a lot of inputs that don't have this property but are not unsafe. For example, maybe the sanitizer takes as an input ```html For example ``` and returns ```html For example ``` This difference seems unimportant. #### Valid according to policy Instead, we could try to define *valid* thus: > An input is valid when the policy rejects no part of it. This misses part of the picture. A string is safe because of the way browsers parse it, **not** the way the sanitizer parses it. ```html ``` contains a script tag when served to Internet Explorer, but contains no tags at all when served to other browsers. If the sanitizer interprets all comments as ignorable content, then the policy never sees the `