package HTML::StripScripts; use strict; use warnings FATAL => 'all'; use vars qw($VERSION); $VERSION = '1.06'; =head1 NAME HTML::StripScripts - Strip scripting constructs out of HTML =head1 SYNOPSIS use HTML::StripScripts; my $hss = HTML::StripScripts->new({ Context => 'Inline' }); $hss->input_start_document; $hss->input_start(''); $hss->input_text('hello, world!'); $hss->input_end(''); $hss->input_end_document; print $hss->filtered_document; =head1 DESCRIPTION This module strips scripting constructs out of HTML, leaving as much non-scripting markup in place as possible. This allows web applications to display HTML originating from an untrusted source without introducing XSS (cross site scripting) vulnerabilities. You will probably use L rather than using this module directly. The process is based on whitelists of tags, attributes and attribute values. This approach is the most secure against disguised scripting constructs hidden in malicious HTML documents. As well as removing scripting constructs, this module ensures that there is a matching end for each start tag, and that the tags are properly nested. Previously, in order to customise the output, you needed to subclass C and override methods. Now, most customisation can be done through the C option provided to C. (See examples/declaration/ and examples/tags/ for cases where subclassing is necessary.) The HTML document must be parsed into start tags, end tags and text before it can be filtered by this module. Use either L or L instead if you want to input an unparsed HTML document. See examples/direct/ for an example of how to feed tokens directly to HTML::StripScripts. =head1 CONSTRUCTORS =over =item new ( CONFIG ) Creates a new C filter object, bound to a particular filtering policy. If present, the CONFIG parameter must be a hashref. The following keys are recognized (unrecognized keys will be silently ignored). $s = HTML::Stripscripts->new({ Context => 'Document|Flow|Inline|NoTags', BanList => [qw( br img )] | {br => '1', img => '1'}, BanAllBut => [qw(p div span)], AllowSrc => 0|1, AllowHref => 0|1, AllowRelURL => 0|1, AllowMailto => 0|1, EscapeFiltered => 0|1, Rules => { See below for details }, }); =over =item C A string specifying the context in which the filtered document will be used. This influences the set of tags that will be allowed. If present, the C value must be one of: =over =item C If C is C then the filter will allow a full HTML document, including the C tag and C and C sections. =item C If C is C then most of the cosmetic tags that one would expect to find in a document body are allowed, including lists and tables but not including forms. =item C If C is C then only inline tags such as C and C are allowed. =item C If C is C then no tags are allowed. =back The default C value is C. =item C If present, this option must be an arrayref or a hashref. Any tag that would normally be allowed (because it presents no XSS hazard) will be blocked if the lowercase name of the tag is in this list. For example, in a guestbook application where C
tags are used to separate posts, you may wish to prevent posts from including C
tags, even though C
is not an XSS risk. =item C If present, this option must be reference to an array holding a list of lowercase tag names. This has the effect of adding all but the listed tags to the ban list, so that only those tags listed will be allowed. =item C By default, the filter won't allow constructs that cause the browser to fetch things automatically, such as C attributes in C tags. If this option is present and true then those constructs will be allowed. =item C By default, the filter won't allow constructs that cause the browser to fetch things if the user clicks on something, such as the C attribute in C tags. Set this option to a true value to allow this type of construct. =item C By default, the filter won't allow relative URLs such as C<../foo.html> in C and C attribute values. Set this option to a true value to allow them. C and / or C also need to be set to true for this to have any effect. =item C By default, C links are not allowed. If C is set to a true value, then this construct will be allowed. This can be enabled separately from AllowHref. =item C By default, any filtered tags are outputted as C<< >>. If C is set to a true value, then the filtered tags are converted to HTML entities. For instance:
--> <br> =item C The C option provides a very flexible way of customising the filter. The focus is safety-first, so it is applied after all of the previous validation. This means that you cannot all malicious data should already have been cleared. Rules can be specified for tags and for attributes. Any tag or attribute not explicitly listed will be handled by the default C<*> rules. The following is a synopsis of all of the options that you can use to configure rules. Below, an example is broken into sections and explained. Rules => { tag => 0 | 1 | sub { tag_callback } | { attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, required => [qw(attrname attrname)], tag => sub { tag_callback } }, '*' => 0 | 1 | sub { tag_callback } | { attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, tag => sub { tag_callback } } } EXAMPLE: Rules => { ########################## ##### EXPLICIT RULES ##### ########################## ## Allow
tags, reject tags br => 1, img => 0, ## Send all
tags to a sub div => sub { tag_callback }, ## Allow
tags,and allow the 'cite' attribute ## All other attributes are handled by the default C<*> blockquote => { cite => 1, }, ## Allow tags, and a => { ## Allow the 'title' attribute title => 1, ## Allow the 'href' attribute if it matches the regex href => '^http://yourdomain.com' OR href => qr{^http://yourdomain.com}, ## 'style' attributes are handled by a sub style => sub { attr_callback }, ## All other attributes are rejected '*' => 0, ## Additionally, the tag should be handled by this sub tag => sub { tag_callback}, ## If the tag doesn't have these attributes, filter the tag required => [qw(href title)], }, ########################## ##### DEFAULT RULES ##### ########################## ## The default '*' rule - accepts all the same options as above. ## If a tag or attribute is not mentioned above, then the default ## rule is applied: ## Reject all tags '*' => 0, ## Allow all tags and all attributes '*' => 1, ## Send all tags to the sub '*' => sub { tag_callback }, ## Allow all tags, reject all attributes '*' => { '*' => 0 }, ## Allow all tags, and '*' => { ## Allow the 'title' attribute title => 1, ## Allow the 'href' attribute if it matches the regex href => '^http://yourdomain.com' OR href => qr{^http://yourdomain.com}, ## 'style' attributes are handled by a sub style => sub { attr_callback }, ## All other attributes are rejected '*' => 0, ## Additionally, all tags should be handled by this sub tag => sub { tag_callback}, }, =over =item Tag Callbacks sub tag_callback { my ($filter,$element) = (@_); $element = { tag => 'tag', content => 'inner_html', attr => { attr_name => 'attr_value', } }; return 0 | 1; } A tag callback accepts two parameters, the C<$filter> object and the C$element>. It should return C<0> to completely ignore the tag and its content (which includes any nested HTML tags), or C<1> to accept and output the tag. The C<$element> is a hash ref containing the keys: =item C This is the tagname in lowercase, eg C, C
, C. If you set the tag value to an empty string, then the tag will not be outputted, but the tag contents will. =item C This is the equivalent of DOM's innerHTML. It contains the text content and any HTML tags contained within this element. You can change the content or set it to an empty string so that it is not outputted. =item C C contains a hashref containing the attribute names and values =back If for instance, you wanted to replace C<< >> tags with C<< >> tags, you could do this: sub b_callback { my ($filter,$element) = @_; $element->{tag} = 'span'; $element->{attr}{style} = 'font-weight:bold'; return 1; } =item Attribute Callbacks sub attr_callback { my ( $filter, $tag, $attr_name, $attr_val ) = @_; return undef | '' | 'value'; } Attribute callbacks accept four parameters, the C<$filter> object, the C<$tag> name, the C<$attr_name> and the C<$attr_value>. It should return either C to reject the attribute, or the value to be used. An empty string keeps the attribute, but without a value. =item C vs C vs C It is not necessary to use C or C - everything can be done via C, however it may be simpler to write: BanAllBut => [qw(p div span)] The logic works as follows: * If BanAllBut exists, then ban everything but the tags in the list * Add to the ban list any elements in BanList * Any tags mentioned explicitly in Rules (eg a => 0, br => 1) are added or removed from the BanList * A default rule of { '*' => 0 } would ban all tags except those mentioned in Rules * A default rule of { '*' => 1 } would allow all tags except those disallowed in the ban list, or by explicit rules =back =cut sub new { my ( $pkg, $cfg ) = @_; my $self = bless {}, ref $pkg || $pkg; $self->hss_init($cfg); return $self; } =back =head1 METHODS This class provides the following methods: =over =item hss_init () This method is called by new() and does the actual initialisation work for the new HTML::StripScripts object. =cut sub hss_init { my ( $self, $cfg ) = @_; $cfg ||= {}; $self->{_hssCfg} = $cfg; $self->{_hssContext} = $self->init_context_whitelist; $self->{_hssAttrib} = $self->init_attrib_whitelist; $self->{_hssAttVal} = $self->init_attval_whitelist; $self->{_hssStyle} = $self->init_style_whitelist; $self->{_hssDeInter} = $self->init_deinter_whitelist; $self->{_hssBanList} = $self->_hss_prepare_ban_list($cfg); $self->{_hssRules} = $self->_hss_prepare_rules($cfg); } =item input_start_document () This method initializes the filter, and must be called once before starting on each HTML document to be filtered. =cut sub input_start_document { my ( $self, $context ) = @_; $self->{_hssStack} = [ { NAME => '', CTX => $self->{_hssCfg}{Context} || 'Flow', CONTENT => '', } ]; $self->{_hssOutput} = ''; $self->output_start_document; } =item input_start ( TEXT ) Handles a start tag from the input document. TEXT must be the full text of the tag, including angle-brackets. =cut sub input_start { my ( $self, $text ) = @_; $self->_hss_accept_input_start($text) or $self->reject_start($text); } sub _hss_accept_input_start { my ( $self, $text ) = @_; return 0 unless $text =~ m|^<([a-zA-Z0-9]+)\b(.*)>$|m; my ( $tag, $attr ) = ( lc $1, $self->strip_nonprintable($2) ); return 0 if $self->{_hssSkipToEnd}; if ( $tag eq 'script' or $tag eq 'style' ) { $self->{_hssSkipToEnd} = $tag; return 0; } return 0 if $self->_hss_tag_is_banned($tag); my $allowed_attr = $self->{_hssAttrib}{$tag}; return 0 unless defined $allowed_attr; return 0 unless $self->_hss_get_to_valid_context($tag); my $default_filters = $self->{_hssRules}{'*'}; my $tag_filters = $self->{_hssRules}{$tag} || $default_filters; my %filtered_attr; while ( $attr =~ s#^\s*([\w\-]+)(?:\s*=\s*(?:([^"'>\s]+)|"([^"]*)"|'([^']*)'))?## ) { my $key = lc $1; my $val = ( defined $2 ? $self->unquoted_to_canonical_form($2) : defined $3 ? $self->quoted_to_canonical_form($3) : defined $4 ? $self->quoted_to_canonical_form($4) : '' ); my $value_class = $allowed_attr->{$key}; next unless defined $value_class; my $attval_handler = $self->{_hssAttVal}{$value_class}; next unless defined $attval_handler; my $attr_filter; if ($tag_filters) { $attr_filter = $self->_hss_get_attr_filter( $default_filters, $tag_filters, $key ); # filter == 0 next unless $attr_filter; } my $filtered_value = &{$attval_handler}( $self, $tag, $key, $val ); next unless defined $filtered_value; # send value to filter if sub if ( $tag_filters && ref $attr_filter ) { $filtered_value = $attr_filter->( $self, $tag, $key, $filtered_value ); next unless defined $filtered_value; } $filtered_attr{$key} = $filtered_value; } # Check required attributes if ( my $required = $tag_filters->{required} ) { foreach my $key (@$required) { return 0 unless defined $filtered_attr{$key} && length($filtered_attr{$key}); } } # Check for callback my $tag_callback = $tag_filters && $tag_filters->{tag} || $default_filters->{tag}; my $new_context = $self->{_hssContext}{ $self->{_hssStack}[0]{CTX} }{$tag}; my %stack_entry = ( NAME => $tag, ATTR => \%filtered_attr, CTX => $new_context, CALLBACK => $tag_callback, CONTENT => '', ); if ( $new_context eq 'EMPTY' ) { $self->output_stack_entry( \%stack_entry ); } else { unshift @{ $self->{_hssStack} }, \%stack_entry; } return 1; } =item input_end ( TEXT ) Handles an end tag from the input document. TEXT must be the full text of the end tag, including angle-brackets. =cut sub input_end { my ( $self, $text ) = @_; $self->_hss_accept_input_end($text) or $self->reject_end($text); } sub _hss_accept_input_end { my ( $self, $text ) = @_; return 0 unless $text =~ m#^$#; my $tag = lc $1; if ( $self->{_hssSkipToEnd} ) { if ( $self->{_hssSkipToEnd} eq $tag ) { delete $self->{_hssSkipToEnd}; } return 0; } # Ignore a close without an open return 0 unless grep { $_->{NAME} eq $tag } @{ $self->{_hssStack} }; # Close open tags up to the matching open my @close = (); while ( scalar @{ $self->{_hssStack} } ) { my $entry = shift @{ $self->{_hssStack} }; push @close, $entry; $self->output_stack_entry($entry); $entry->{CONTENT} = ''; last if $entry->{NAME} eq $tag; } # Reopen any we closed early if all that were closed are # configured to be auto de-interleaved. unless ( grep { !$self->{_hssDeInter}{ $_->{NAME} } } @close ) { pop @close; unshift @{ $self->{_hssStack} }, @close; } return 1; } =item input_text ( TEXT ) Handles some non-tag text from the input document. =cut sub input_text { my ( $self, $text ) = @_; return if $self->{_hssSkipToEnd}; $text = $self->strip_nonprintable($text); if ( $text =~ /^(\s*)$/ ) { $self->output_text($1); return; } unless ( $self->_hss_get_to_valid_context('CDATA') ) { $self->reject_text($text); return; } my $filtered = $self->filter_text( $self->text_to_canonical_form($text) ); $self->output_text( $self->canonical_form_to_text($filtered) ); } =item input_process ( TEXT ) Handles a processing instruction from the input document. =cut sub input_process { my ( $self, $text ) = @_; $self->reject_process($text); } =item input_comment ( TEXT ) Handles an HTML comment from the input document. =cut sub input_comment { my ( $self, $text ) = @_; $self->reject_comment($text); } =item input_declaration ( TEXT ) Handles an declaration from the input document. =cut sub input_declaration { my ( $self, $text ) = @_; $self->reject_declaration($text); } =item input_end_document () Call this method to signal the end of the input document. =cut sub input_end_document { my ($self) = @_; delete $self->{_hssSkipToEnd}; while ( @{ $self->{_hssStack} } > 1 ) { $self->output_stack_entry( shift @{ $self->{_hssStack} } ); } $self->output_end_document; my $last_entry = shift @{ $self->{_hssStack} }; $self->{_hssOutput} = $last_entry->{CONTENT}; delete $self->{_hssStack}; } =item filtered_document () Returns the filtered document as a string. =cut sub filtered_document { my ($self) = @_; $self->{_hssOutput}; } =back =cut =head1 SUBCLASSING The only reason for subclassing this module now is to add to the list of accepted tags, attributes and styles (See L). Everything else can be achieved with L. The C class is subclassable. Filter objects are plain hashes and C reserves only hash keys that start with C<_hss>. The filter configuration can be set up by invoking the hss_init() method, which takes the same arguments as new(). =head1 OUTPUT METHODS The filter outputs a stream of start tags, end tags, text, comments, declarations and processing instructions, via the following C methods. Subclasses may override these to intercept the filter output. The default implementations of the C methods pass the text on to the output() method. The default implementation of the output() method appends the text to a string, which can be fetched with the filtered_document() method once processing is complete. If the output() method or the individual C methods are overridden in a subclass, then filtered_document() will not work in that subclass. =over =item output_start_document () This method gets called once at the start of each HTML document passed through the filter. The default implementation does nothing. =cut sub output_start_document { } =item output_end_document () This method gets called once at the end of each HTML document passed through the filter. The default implementation does nothing. =cut *output_end_document = \&output_start_document; =item output_start ( TEXT ) This method is used to output a filtered start tag. =cut sub output_start { $_[0]->output( $_[1] ) } =item output_end ( TEXT ) This method is used to output a filtered end tag. =cut *output_end = \&output_start; =item output_text ( TEXT ) This method is used to output some filtered non-tag text. =cut *output_text = \&output_start; =item output_declaration ( TEXT ) This method is used to output a filtered declaration. =cut *output_declaration = \&output_start; =item output_comment ( TEXT ) This method is used to output a filtered HTML comment. =cut *output_comment = \&output_start; =item output_process ( TEXT ) This method is used to output a filtered processing instruction. =cut *output_process = \&output_start; =item output ( TEXT ) This method is invoked by all of the default C methods. The default implementation appends the text to the string that the filtered_document() method will return. =cut sub output { $_[0]->{_hssStack}[0]{CONTENT} .= $_[1]; } =item output_stack_entry ( TEXT ) This method is invoked when a tag plus all text and nested HTML content within the tag has been processed. It adds the tag plus its content to the content for its parent tag. =cut sub output_stack_entry { my ( $self, $tag ) = @_; my %entry; @entry{qw(tag attr content)} = @{$tag}{qw(NAME ATTR CONTENT)}; if ( my $tag_callback = $tag->{CALLBACK} ) { $tag_callback->( $self, \%entry ) or return; } my $tagname = $entry{tag}; my $filtered_attrs = $self->_hss_join_attribs( $entry{attr} ); if ( $tag->{CTX} eq 'EMPTY' ) { $self->output_start("<$tagname$filtered_attrs />") if $entry{tag}; return; } if ($tagname) { $self->output_start("<$tagname$filtered_attrs>"); } if ( defined $entry{content} ) { $self->{_hssStack}[0]{CONTENT} .= $entry{content}; } if ($tagname) { $self->output_end(""); } } =back =head1 REJECT METHODS When the filter encounters something in the input document which it cannot transform into an acceptable construct, it invokes one of the following C methods to put something in the output document to take the place of the unacceptable construct. The TEXT parameter is the full text of the unacceptable construct. The default implementations of these methods output an HTML comment containing the text C. If L is set to true, then the rejected text is HTML escaped instead. Subclasses may override these methods, but should exercise caution. The TEXT parameter is unfiltered input and may contain malicious constructs. =over =item reject_start ( TEXT ) =item reject_end ( TEXT ) =item reject_text ( TEXT ) =item reject_declaration ( TEXT ) =item reject_comment ( TEXT ) =item reject_process ( TEXT ) =back =cut sub reject_start { $_[0]->{_hssCfg}{EscapeFiltered} ? $_[0]->output_text( $_[0]->escape_html_metachars( $_[1] ) ) : $_[0]->output_comment(''); } *reject_end = \&reject_start; *reject_text = \&reject_start; *reject_declaration = \&reject_start; *reject_comment = \&reject_start; *reject_process = \&reject_start; =head1 WHITELIST INITIALIZATION METHODS The filter refers to various whitelists to determine which constructs are acceptable. To modify these whitelists, subclasses can override the following methods. Each method is called once at object initialization time, and must return a reference to a nested data structure. These references are installed into the object, and used whenever the filter needs to refer to a whitelist. The default implementations of these methods can be invoked as class methods. See examples/tags/ and examples/declaration/ for examples of how to override these methods. =over =item init_context_whitelist () Returns a reference to the C whitelist, which determines which tags may appear at each point in the document, and which other tags may be nested within them. It is a hash, and the keys are context names, such as C and C. The values in the hash are hashrefs. The keys in these subhashes are lowercase tag names, and the values are context names, specifying the context that the tag provides to any other tags nested within it. The special context C as a value in a subhash indicates that nothing can be nested within that tag. =cut use vars qw(%_Context); BEGIN { my %pre_content = ( 'br' => 'EMPTY', 'span' => 'Inline', 'tt' => 'Inline', 'i' => 'Inline', 'b' => 'Inline', 'u' => 'Inline', 's' => 'Inline', 'strike' => 'Inline', 'em' => 'Inline', 'strong' => 'Inline', 'dfn' => 'Inline', 'code' => 'Inline', 'q' => 'Inline', 'samp' => 'Inline', 'kbd' => 'Inline', 'var' => 'Inline', 'cite' => 'Inline', 'abbr' => 'Inline', 'acronym' => 'Inline', 'ins' => 'Inline', 'del' => 'Inline', 'a' => 'Inline', 'CDATA' => 'CDATA', ); my %inline = ( %pre_content, 'img' => 'EMPTY', 'big' => 'Inline', 'small' => 'Inline', 'sub' => 'Inline', 'sup' => 'Inline', 'font' => 'Inline', 'nobr' => 'Inline', ); my %flow = ( %inline, 'ins' => 'Flow', 'del' => 'Flow', 'div' => 'Flow', 'p' => 'Inline', 'h1' => 'Inline', 'h2' => 'Inline', 'h3' => 'Inline', 'h4' => 'Inline', 'h5' => 'Inline', 'h6' => 'Inline', 'ul' => 'list', 'ol' => 'list', 'menu' => 'list', 'dir' => 'list', 'dl' => 'dt_dd', 'address' => 'Inline', 'hr' => 'EMPTY', 'pre' => 'pre.content', 'blockquote' => 'Flow', 'center' => 'Flow', 'table' => 'table', ); my %table = ( 'caption' => 'Inline', 'thead' => 'tr_only', 'tfoot' => 'tr_only', 'tbody' => 'tr_only', 'colgroup' => 'colgroup', 'col' => 'EMPTY', 'tr' => 'th_td', ); my %head = ( 'title' => 'NoTags', ); %_Context = ( 'Document' => { 'html' => 'Html' }, 'Html' => { 'head' => 'Head', 'body' => 'Flow' }, 'Head' => \%head, 'Inline' => \%inline, 'Flow' => \%flow, 'NoTags' => { 'CDATA' => 'CDATA' }, 'pre.content' => \%pre_content, 'table' => \%table, 'list' => { 'li' => 'Flow' }, 'dt_dd' => { 'dt' => 'Inline', 'dd' => 'Flow' }, 'tr_only' => { 'tr' => 'th_td' }, 'colgroup' => { 'col' => 'EMPTY' }, 'th_td' => { 'th' => 'Flow', 'td' => 'Flow' }, ); } sub init_context_whitelist { return \%_Context; } =item init_attrib_whitelist () Returns a reference to the C whitelist, which determines which attributes each tag can have and the values that those attributes can take. It is a hash, and the keys are lowercase tag names. The values in the hash are hashrefs. The keys in these subhashes are lowercase attribute names, and the values are attribute value class names, which are short strings describing the type of values that the attribute can take, such as C or C. =cut use vars qw(%_Attrib); BEGIN { my %attr = ( 'style' => 'style' ); my %font_attr = ( %attr, 'size' => 'size', 'face' => 'wordlist', 'color' => 'color', ); my %insdel_attr = ( %attr, 'cite' => 'href', 'datetime' => 'text', ); my %texta_attr = ( %attr, 'align' => 'word', ); my %cellha_attr = ( 'align' => 'word', 'char' => 'word', 'charoff' => 'size', ); my %cellva_attr = ( 'valign' => 'word', ); my %cellhv_attr = ( %attr, %cellha_attr, %cellva_attr ); my %col_attr = ( %attr, %cellhv_attr, 'width' => 'size', 'span' => 'number', ); my %thtd_attr = ( %attr, 'abbr' => 'text', 'axis' => 'text', 'headers' => 'text', 'scope' => 'word', 'rowspan' => 'number', 'colspan' => 'number', %cellhv_attr, 'nowrap' => 'novalue', 'bgcolor' => 'color', 'width' => 'size', 'height' => 'size', 'bordercolor' => 'color', 'bordercolorlight' => 'color', 'bordercolordark' => 'color', ); %_Attrib = ( 'br' => { 'clear' => 'word' }, 'em' => \%attr, 'strong' => \%attr, 'dfn' => \%attr, 'code' => \%attr, 'samp' => \%attr, 'kbd' => \%attr, 'var' => \%attr, 'cite' => \%attr, 'abbr' => \%attr, 'acronym' => \%attr, 'q' => { %attr, 'cite' => 'href' }, 'blockquote' => { %attr, 'cite' => 'href' }, 'sub' => \%attr, 'sup' => \%attr, 'tt' => \%attr, 'i' => \%attr, 'b' => \%attr, 'big' => \%attr, 'small' => \%attr, 'u' => \%attr, 's' => \%attr, 'strike' => \%attr, 'font' => \%font_attr, 'table' => { %attr, 'frame' => 'word', 'rules' => 'word', %texta_attr, 'bgcolor' => 'color', 'background' => 'src', 'width' => 'size', 'height' => 'size', 'cellspacing' => 'size', 'cellpadding' => 'size', 'border' => 'size', 'bordercolor' => 'color', 'bordercolorlight' => 'color', 'bordercolordark' => 'color', 'summary' => 'text', }, 'caption' => { %attr, 'align' => 'word', }, 'colgroup' => \%col_attr, 'col' => \%col_attr, 'thead' => \%cellhv_attr, 'tfoot' => \%cellhv_attr, 'tbody' => \%cellhv_attr, 'tr' => { %attr, bgcolor => 'color', %cellhv_attr, }, 'th' => \%thtd_attr, 'td' => \%thtd_attr, 'ins' => \%insdel_attr, 'del' => \%insdel_attr, 'a' => { %attr, href => 'href', title => 'text' }, 'h1' => \%texta_attr, 'h2' => \%texta_attr, 'h3' => \%texta_attr, 'h4' => \%texta_attr, 'h5' => \%texta_attr, 'h6' => \%texta_attr, 'p' => \%texta_attr, 'div' => \%texta_attr, 'span' => \%texta_attr, 'ul' => { %attr, 'type' => 'word', 'compact' => 'novalue', }, 'ol' => { %attr, 'type' => 'text', 'compact' => 'novalue', 'start' => 'number', }, 'li' => { %attr, 'type' => 'text', 'value' => 'number', }, 'dl' => { %attr, 'compact' => 'novalue' }, 'dt' => \%attr, 'dd' => \%attr, 'address' => \%attr, 'hr' => { %texta_attr, 'width' => 'size', 'size' => 'size', 'noshade' => 'novalue', }, 'pre' => { %attr, 'width' => 'size' }, 'center' => \%attr, 'nobr' => {}, 'img' => { 'src' => 'src', 'alt' => 'text', 'width' => 'size', 'height' => 'size', 'border' => 'size', 'hspace' => 'size', 'vspace' => 'size', 'align' => 'word', }, 'body' => { 'bgcolor' => 'color', 'background' => 'src', 'link' => 'color', 'vlink' => 'color', 'alink' => 'color', 'text' => 'color', }, 'head' => {}, 'title' => {}, 'html' => {}, ); } sub init_attrib_whitelist { return \%_Attrib; } =item init_attval_whitelist () Returns a reference to the C whitelist, which is a hash that maps attribute value class names from the C whitelist to coderefs to subs to validate (and optionally transform) a particular attribute value. The filter calls the attribute value validation subs with the following parameters: =over =item C A reference to the filter object. =item C The lowercase name of the tag in which the attribute appears. =item C The name of the attribute. =item C The attribute value found in the input document, in canonical form (see L). =back The validation sub can return undef to indicate that the attribute should be removed from the tag, or it can return the new value for the attribute, in canonical form. =cut use vars qw(%_AttVal); BEGIN { %_AttVal = ( 'style' => \&_hss_attval_style, 'size' => \&_hss_attval_size, 'number' => \&_hss_attval_number, 'color' => \&_hss_attval_color, 'text' => \&_hss_attval_text, 'word' => \&_hss_attval_word, 'wordlist' => \&_hss_attval_wordlist, 'wordlistq' => \&_hss_attval_wordlistq, 'href' => \&_hss_attval_href, 'src' => \&_hss_attval_src, 'stylesrc' => \&_hss_attval_stylesrc, 'novalue' => \&_hss_attval_novalue, ); } sub init_attval_whitelist { return \%_AttVal; } =item init_style_whitelist () Returns a reference to the C