Examining balanceTags()

Prompted by WP Patch Day, I felt compelled to try to patch a bug with WordPress.

I choose one that has already caused me numerous headaches, bug 0000053, entitled “WordPress deletes some text when HTML tags incorrectly nested”.

In a nutshell… In the Admin sections “Options” menu, “Writing” submenu, a checkbutton exists to indicate if “WordPress should correct invalidly nested XHTML automatically”. This setting is active by default.

The thing is, not only does it not always correct invalidly nested XHTML, it can actually delete some of the text in the post. As the function operates on the post as the post is inserted into the database, that text is gone.

Background… I first tripped up on a variety of formatting problems back when I started using WordPress in earnest, to post about my first set of plugins. At the time I devised various examples exploiting instances of WP eating text or producing invalid (or at least less-than-optimal) XHTML. I quickly discovered that the function balanceTags(), located in the WP file wp-includes/functions-formatting.php was the culprit, being the sole function controlled by the abovementioned checkbutton setting. [Note: Some of the examples I have on the formatting bugs page are related to other aspects of WP, i.e. wpautop() improperly inserting paragraph <p></p> tags.]

Anaylsis… I pondered over balanceTags() for a good while to get a handle on what it was doing and why it was doing it. I believe I fixed the problems. However, I have reservations about putting it forth as a patch just yet because, as this function actually modifies user text permanently (unlike a plugin filtering text on its way to being displayed), I wanted it thoroughly tested.

Before I get to the fixes, let me list the problems associated with balanceTags(). To test these out yourself (easily enough done on a temporary basis that won’t affect your site), simply switch on the “WordPress should correct invalidly nested XHTML automatically” option if it isn’t on already, and compose a new post. When you try one of the samples below (or are making up your own), do a “Save and Continue Editing” (the post won’t be published, but you’ll see how your text was modified by balanceTags()). When you are done testing, turn off the setting if you originally had it off.


Problem #1 : User text gets irretrievably eaten

When a tag is closed and there have been intervening tags that haven’t been closed, then balanceTags() will eat user text that follows that close tag (equal to the size of the close tag and the unbalanced tags it will close).

<blockquote>xxx<b>yyy<i>zzzz</blockquote> <a href="http://www.example.com">0123456789ABCDEFGHIJLKM</a> you should see 0123456789ABCDEFGHIJLKM as the linked text.

becomes:

<blockquote>xxx<b>yyy<i>zzzz</i></b></blockquote> <a href="http://www.example.com">KM</a> you should see 0123456789ABCDEFGHIJLKM as the linked text.

Notice the characters 0-L have been deleted. 21 characters. Which is equal to strlen('</i></b></blockquote>');.

Another example:

<ul> <li>xxx <li>yyy <li>zzz </ul> <a href="http://www.example.com">0123456789ABCDEFGHIJLKM</a> you should see 0123456789ABCDEFGHIJLKM as the linked text.

becomes:

<ul> <li>xxx <li>yyy <li>zzz </li></li></li></ul> <a href="http://www.example.com">LKM</a> you should see 0123456789ABCDEFGHIJLKM as the linked text.

20 characters eaten, which is equal to strlen('</li></li></li></ul>');


Probelm #2: Insertion of non-sensical </> “tag”

When an HTML comment (which could be < !--more--> or <!--nexpage--> appears within the open and close tags for a tag, balanceTags() inserts </> in a misguided attempt to balance the comment as a tag.

<b>text <!--comment--> </b>

becomes:

<b>text <!--comment--> </></b>


Problem #3: Tags not getting balanced in certain situations

When an HTML comment is the last “tag” in a post, any remaining unbalanced tags will NOT get balanced.

<b>text <!--comment--> texttexttext

Will remain that way.


Problem #4: Improper balancing of tags immediately nested within itself

When a tag is located immediately within itself, it usually means improper balancing of serialized identical tags…

<p>xxx<p>yyy<p>zzz

becomes:

<p>xxx<p>yyy<p>zzz</p></p></p>

Desired result, to my mind, should be this:

<p>xxx</p><p>yyy</p><p>zzz</p>

For another example, see the second example under the section where I talk about user text getting eaten. In the <ul>, you can see how the <li>’s are all closed as a bunch in the end just before </ul>, so you get <ul><li><li><li></li></li></li></ul> instead of the desired <ul><li></li><li></li><li></li></ul>.


Problem #5: Insertion of close tag for singular entity tag <input />

If the singlular entity HTML tag <input /> is used, it gets matched with a closing </input> which is invalid.

<form id="sendmsg" action="" method="post"> <input name="name" id="name" type="text" /> <input type="submit" name="submit" id="submit" value="Send" /> <input type="reset" name="reset" value="Reset" /> </form>

becomes:

<form id="sendmsg" action="" method="post"> <input name="name" id="name" type="text" /> <input type="submit" name="submit" id="submit" value="Send" /> <input type="reset" name="reset" value="Reset" /> </input></input></input></form>


Problem #6: Unclosed single-entity tags do not get closed

Tags such as <br>, <hr>, and <img> do not get closed in the event the user forgot to close them, despite those tags being expressly identified in balanceTags().

Examples:

<hr> <img src="/images/picture.gif"><br>

All examples would remain that way.


In my next post, I explain my fixes that address all of the above problems.