Everything2
Near Matches
Ignore Exact
Full Text
Everything2

Microsoft HTML de-bastardization

created by Jetifi

(thing) by Jetifi (1.4 wk) (print)   ?   (I like it!) 1 C! Mon Aug 06 2001 at 15:49:51

15 September 2002: see the below wu.

I made this for work. It's an HTML/JavaScript 'thing' designed to take all the nasty stuff out of Microsoft HTML, as generated by Microsoft Word, among other things.

It's tested to work on Internet Explorer 5.5. The assumption is that if you have Word, you will have the aforementioned browser.

All you have to do is create a file with the below contents, save it as an HTML file, open it, and follow the simple instructions on the screen. It will delete <DIV> and <SPAN> tags, and leave in (I'm afraid) the XML-style mess created by indexes, table of contents, and other stuff. It should be fairly easy to customize (or correct :-) if you know what you're doing. I have absolutely no idea what will happen to diagrams created in word using the Microsoft Draw-alike.

Because I haven't figured out how to make regexps span abitrary numbers of line-breaks, you'll still have some ''style'' attributes in the HTML unless you check ''Remove line-breaks''. But if you do check it, you won't be able to edit the results without further regexp-fu. Sigh...

If there are any stupid bugs, please tell me. If you have any constructive criticism (i.e. other than 'your code SUCKS, fool!' (I know)), please tell me.

Enjoy. the headache


<html><head><title>DeMSHTML</title><script>

var mshtml="";

function doParse() {

    var regexpTagsToDelete = new Array( "div", "span", "!", "o:", "/o:" );
    var normalTagsToDelete = new Array( "</div>", "</span>" );
    var tagsToDeMunge = new Array ( "p","b","i", "br")

    //this is the bit that causes
    //your browser to grind to a halt.
    mshtml = document.forms['jsRep']['html'].value

    if( document.forms['jsRep']['checkLine'].checked == true ) {
        var re= /\n|\r/gi

        mshtml = mshtml.replace( re, " " )
    }

    execTagRegExp( tagsToDeMunge, false )

    execTagRegExp( regexpTagsToDelete, true )

    for( var i = 0; i != normalTagsToDelete.length; i++ ) {
        deleteStr( normalTagsToDelete[i] )
    }

    document.forms['jsRep']['html'].value = mshtml

}

function execTagRegExp( tagsToFind, deleteTag ) {
    for( var i = 0; i != tagsToFind.length; i++ ) {
        var re = new RegExp( "<" + tagsToFind[i] + "[^>]*>", "gi" );
        if( deleteTag ) {
            mshtml = mshtml.replace( re, "" );
        } else {
            mshtml = mshtml.replace( re, "<" + tagsToFind[i] + ">" );
        }
    }
}

function deleteStr( strToDel ) {
    var lastIndex = 0;
    var nextIndex = 0;
    var strToReturn="";
    var lenStrToDel = strToDel.length;

    while( (nextIndex = mshtml.indexOf( strToDel, lastIndex ) ) != -1 ) {

        strToReturn += mshtml.substring( lastIndex, nextIndex )

        lastIndex = nextIndex + lenStrToDel;

    }

    strToReturn += mshtml.substring( lastIndex, mshtml.length );

    mshtml = strToReturn;

}


</script></head><body>

<p>Enter your text here...</p>

<form name='jsRep'>

<textarea name="html" rows="20" style="width:100%"/></textarea>

<p>...and then click this button:</p>

<p><input name='goButton' value="demunge" type="button" onclick="doParse()"/></p>

<p><input name='checkLine' value="remove line-breaks" type="checkbox" /> Remove line-breaks</p>

</form></body></html>

For bigger documents, it is best to use Mozilla. Tested with 0.9.3. and 1.1 IE5.x just crashes with that much input, whereas Mozilla just slows down horribly. For bigger documents, use sed or something.

Also, this doesn't convert all the special characters like smart quotes and stuff.


(idea) by pms (3.2 y) (print)   ?   (I like it!) Sun Sep 15 2002 at 8:22:21

The best way to de-bastardize Microsoft HTML (MS-HTML) or any crappy HTML is to use the wonderful open source, W3C approved program HTML Tidy.

http://www.w3.org/People/Raggett/tidy/
http://tidy.sourceforge.net/

Tidy can now perform wonders on HTML saved from Microsoft Word 2000! Word bulks out HTML files with stuff for round-tripping presentation between HTML and Word. If you are more concerned about using HTML on the Web, check out Tidy's "Word-2000" config option! Of course Tidy does a good job on Word'97 files as well!

To use from a command line, just add --word-2000 yes


printable version
chaos

MS-HTML Microsoft fire safety The four 'R's of Microsoft Everything Versus The Web
The great web site background music conspiracy HTML Tidy Smart quotes You say "the internet" but you mean "the world wide web"
What if all the web designers just gave up and went home? Microsoft Linux When Web Designers Attack: A New Fox Special web design
Workplace Shell Microsoft Pledge on HTML Standards div Microsoft Certified Systems Engineer
Microsoft Works Mozilla Microsoft Research Catholic Online Web Mail
Israeli draw Word blockquote HTML editor
Y'know, if you log in, you can write something here, or contact authors directly on the site. Create a New User if you don't already have an account.
  Epicenter
Login
Password

password reminder
register

Everything2 Help


cooled by liveforever

Cool Staff Picks
What you are reading:
Oliver Cromwell
European Union
The name that lasted a million years
November 3, 2004
Aqua Teen Hunger Force
Sampoerna X-tras
Kids from the short bus
hysterical pregnancy
Pyrrhonism
Islam's internal conflict
wasabi
Temporary Autonomous Zone
I really have to do you now
New Writeups
Tildeee
IANAL(idea)
antigravpussy
One fly amongst many(person)
sam512
Moon Base Shackleton, 1978(fiction)
Pavlovna
toy boy(person)
XWiz
tear jerker(review)
Heitah
Anarchy is Order(idea)
jessicaj
July 26, 2008(dream)
Berek
ABBA(person)
devolution
k-hole(place)
Nadine_2
The Sound Of Madness(review)
SwimmingMonkey
Conversations with Fo Fo, the Loneliest dog in Purgatory(fiction)
locke baron
lynx(thing)
Simulacron3
Reality, Dimensions and the Natural Ontology(essay)
SubSane
Making Love to a 9-Foot Woman(person)
Ouzo
Thoughts(idea)
E2 is a by-product of the existence of The Everything Development Company