Everything2
Near Matches
Ignore Exact
Full Text
Everything2

edev: Tables and HTML Validation

created by call

(idea) by call (4.9 hr) (print)   ?   I like it! Thu Sep 11 2003 at 23:32:54

Tables and HTML validation

HTML in E2 writeups is filtered; only a subset of the available HTML tags are actually allowed. There are approximately two reasons for this:

  1. Malice: to avoid allowing untrustworthy users to spoil the E2 experience for the rest of us by using scripts, images, and external links (to, eg. goatse.cx)
  2. Functionality: Scripts or tables can quite easily break an E2 page, either by acccident or on purpose. For example, since layout is performed mostly with tables, a spare </tr> tag could wreak havoc with the layout of the rest of the page, particularly nodelets etc.

Tables would be fine if they could be validated to prove that they're well-formed and hence will not break tables outside the writeup. Tables in particular aren't allowed because:

  • Table validation code was believed to be non-trivial to implement.
  • It was believed that it'd be too expensive to implement in terms of CPU time on the web server

Also, another issue which was not discussed is that allowing tables in writeups might be "exploited" for forcing layout within writeups; eg. to split a writeup into columns, etc.

Having written some fairly simple code to validate table HTML, running some performance tests on it seems to show that the 'worst case' HTML (a nightmare of tables, taken from an entire pageload on Community2) indicates that the table validation takes about a fifth of the time the normal HTML validation takes.

Approach

The design of the table validation code emerges from a slightly different approach than is used for general HTML validation. The existing HTML validation code in ecore strips out disallowed tags to create valid HTML. Transforming the markup in such a way can be fairly expensive, and for tables could be quite complex to perform. Hence, a much simpler approach was adopted based on observations of writeup content.

  1. Most writeups (currently all of them) do not contain any tables whatsoever.
  2. Writeups which contain tables will mostly contain valid tables, and thus require no transformation to ensure valid and innocuous HTML is the result.

Thus the two cases for which performance should be optimised are the corresponding cases of no table tags at all, and tables which are valid and well-formed. Hence the main code merely scans the structure of the tables in the HTML and determines whether tables have valid structure or not. When a malformed table is discovered, the code adopts an entirely different approach (and one which may in the long run be more useful...) of rendering the table tags visible in the displayed HTML, and adding <div> tags with dashed outlines around the elements to aid debugging.

With this approach, the majority of the web server's activity will be in simply checking the validity of any table tags.

Here's the source to the htmlcode:

# Okay, in brief:
# fast 'cause it's optimised to the 'common' cases:
# Most writeups have no tables. Zoooom!
# Writeups that have tables will mostly have valid tables:
#   => Only a quick parse to validate.
# We 'enforce' the validity of tables by outputting debug info
#   for badly formed tables. This is UGLY so writeup authors will
#   fix 'em quick.
# In an HTMLcode, so compilation of this code is amortised.
# [screenHTML] should still be used, and can be used to control
#   attributes in the tags. Ideally this works on the output of
#   screenHTML, but only because the 'debug' output uses <div>s
#   with dashed outlines to help HTML writers find their oopsies.


# Should be reasonably fast: scans through the HTML using a m''g, which
# is about as fast as anything in perl can be. Stacks the tags (only
# looks at table tags) and checks the structural validity by 
# matching a two-level context descriptor (stack . tag) against
# an RE describing valid contexts. (again, perl and RE => faster than
# a bunch of ifs or whatever)
sub tableWellFormed ($) {
    my (@stack);
    for ($_[0] =~ m{<(/?table|/?tr|/?th|/?td)[\s>]}ig) {
        my $tag = lc $_;
        my $top = $stack[$#stack];

        if (substr($tag, 0, 1) eq '/') {
            # Closing tag. Pop from stack and check that they match.
            return (0, "$top closed with $tag")
              if pop @stack ne substr($tag, 1);
        } else {
            # Opening tag. Push, and check context is valid.
            push @stack, $tag;
            return (0, "$tag inside $top") 
                if (($top.$tag) !~ /^(table(tr)?|tr(td|th)|(td|th)(table))$/);
        }
    }
    return (0, "Unclosed table elements: " . join ", ", @stack)
        if ($#stack != -1);
    return 1;
}

sub debugTag ($) {
    my ($tag) = @_;
    my $htmltag = $tag;
    $htmltag =~ s/</amp;lt;/g; # should be encodeHTML, but of course
                            # I don't have that in my standalone testbench.
    $htmltag = "<strong><small>amp;lt;" . $htmltag . "amp;gt;</small></strong>";

    if (substr($tag, 0, 1) ne '/') {
        return $htmltag . "<div style=\"margin-left: 16px; border: dashed 1px grey\">";
    } else {
        return "</div>". $htmltag;
    }
}

sub debugTable ($$) {
    my ($error, $html) = @_;
    $html =~ s{<((/?)(table|tr|td|th)((\s[^>]*)|))>}{debugTag $1}ige;
    return "<p><strong>Table formatting error: $error</strong></p>".$html;
}

my ($text) = @_;
my ($valid, $error) = tableWellFormed($text);
$text = debugTable ($error, $text) if ! $valid;

$text;

A similar approach might yield speed improvements to htmlScreen too. It should be possible, for instance, to construct a single regular expression to determine if some HTML consists entirely of valid tags and attributes (and this regular expression can be constructed automatically, of course). However, since there's a large base of existing writeups that may have invalid tags and attributes that nobody has noticed (since the existing htmlScreen provides no real feedback), it's unlikely that this will be a particularly palatable idea.

As a last word, caching of computed results in ecore is something that could be done better, and more consistently; but we all know this by now, and it's late in the evening so I'm not going to tell you things you already know.

Y'all can try it out over at

http://community2.org?node=test+screenTable

or

http://kahani.org?node=test+screenTable

In particular, I want y'all to try and break it, defeat the validation to get unpleasant HTML through the validator. And remember that in the final version it will be used in conjunction with htmlScreen... ;)


printable version
chaos

Community2 goatse.cx edev: CSS tables TABLE tags in Everything2
edev: EDEV FAQ foundations HTML E2 HTML tags My farmworking 20th century immigrant forebears had no responsibility for the plight of your pre-20th century slave forebears
table E2 HTML tags : Chapter 6 edev
Y'know, if you log in, you can write something here, or contact authors directly on the site. Create a New User if you don't already have an account.
  Epicenter
Login
Password

password reminder
register

Everything2 Help

Cool Staff Picks
What you are reading:
Ad resistance training
Septinsular Republic
Slacking as Civil Disobedience
SR-71 Blackbird
Alex Chilton
What Everything does better than anything else
Simone Weil
music theory
Cookery
Someone please kill me
catting weird things to /dev/audio
Whole wheat flour
karoshi
New Writeups
Arienette
I Wish You Were Here(personal)
Arienette
Helping a loved one with depression(personal)
Scientist
futureme.org(thing)
cryforhelp
Why I love Everything2(idea)
Jet-Poop
Uncle Sam(person)
nailbiter
direct marketing(thing)
Ouzo
Existential Dilemmas(personal)
shaogo
Robert Mondavi(person)
Ouzo
Goodwill Hunting, Thrift Store(ies)(log)
Pandeism Fish
How conatus compels divine ketosis through a radical kenosis(essay)
cryforhelp
Major dictionaries of the world(review)
Glowing Fish
The Uncanny X-Men and the New Teen Titans(thing)
WolfKeeper
Launch loop(idea)
TendoKing
Katana(person)
Wuukiee
Highly ornamental cultivars of brambles still have as many thorns as their wild counterparts(idea)
E2 is a by-product of the existence of The Everything Development Company