Categories
WordPress

Patch: Fixing balanceTags()

Continued from my previous post in which I examined, and provided examples for, various problems related to the WordPress core function balanceTags(), here’s an explanation of the changes I made to the function.

First, here is a copy of the original balanceTags() function.

Here is a copy of the modified balanceTags() function.

Here is a diff of the changes to the file wp-includes/functions-formatting.php diff of the changes to the file wp-includes/functions-formatting.php (modified from and then compared against the latest version as obtained from CVS for WP 1.3).

Here’s my justification for the changes I made. The first of my changes takes place early, in the while() where it iterates through all the functions in the text. The remaining fixes are in the function related to finding an open tag. These fixes address all issues I am aware of with the balanceTags() function (except for situations I mention at the very end).


1.) The third line of code of the outermost while() was originally:

$l = strlen($tagqueue) + strlen($regex[0]);

I modified it to be:

$l = strlen($regex[0]);

The inclusion of strlen($tagqueue) in determining the value of $l is perplexing to me. $l is used later in the function to determine how much of the text being processed ($text) should be skipped. As far as I can see, it should only skip the tag it just read in, not also the size the queue of tags it had previously read in or created to close unbalanced tags, since text before and after the tag are properly taken into account with the offending term.


2.) In the section handling open tags, I changed:

// Push if not img or br or hr if($tag != 'br' && $tag != 'img' && $tag != 'hr') { $stacksize = array_push ($tagstack, $tag); }

To this:

// Push if not img or br or hr or input or '' if($tag != 'br' && $tag != 'img' && $tag != 'hr' && $tag != 'input' && $tag != '') { // If the top of the stack is the same as the tag we want to push, close previous tag if (($stacksize > 0) && ($tag != 'div') && ($tagstack[$stacksize - 1] == $tag)) { $tagqueue = '</' . array_pop ($tagstack) . '>'; $stacksize--; } $stacksize = array_push ($tagstack, $tag); }

This set of changes is taking into account three fixes at once. The inclusion of && $tag != 'input' in the outer if() prevents balanceTags() from trying to close <input /> tags (problem #5 from my previous post). The inclusion of && $tag != '' in the outer if() prevents balanceTags() from trying to close an HTML comment occurring in an unbalanced section of the text with </> (problem #2 from my previous post).


UPDATE: On the wp-hackers mailing lists, in an exchange with Jamie Talbot, he recommended introspecting the tag to determine if the tag is a single-entity tag. I concurred that that would be a preferable way of handling the situation — cleaner and more robust than explicitly listing the valid single-entity tags. The fix would be to change this:

if($tag != 'br' && $tag != 'img' && $tag != 'hr' && $tag != 'input' && $tag != '') {

to

if((substr($regex[2],-1) != '/') && ($tag != '')) {

However, it occurs to me that this could introduce some invalid HTML in that if a user forgot to close a single-entity tag, since this change no longer recognizes those tags, balanceTags() would close the tag. (i.e. <img src="file.gif"> becomes <img src="file.gif"></img>). This brought to light a “problem” with the old balanceTags(): despite identifying single-entity tags (and not closing them when they appear), it does not close them if they weren’t closed. So… I updated my new balanceTags() with the code that will recognize any properly-formed single-entity tags, but merged a modification of the old balanceTags() to recognize the XHTML single-entity tags and properly close them if they aren’t closed. The aforementioned line changed above now gets changed to:

// If self-closing or '', don't do anything. if((substr($regex[2],-1) == '/') || ($tag == '')) { } // ElseIf it's a known single-entity tag but it doesn't close itself, do so elseif ($tag == 'br' || $tag == 'img' || $tag == 'hr' || $tag == 'input') { $regex[2] .= '/'; } else {

The third change, the new inner if(), addresses problem #4 from my previous post. The gist of the fix is that if we encounter an open tag and the exact same tag is already sitting at the top of the $tagstack, then close the tag on the stack and then push this new tag. This means <p>xxx<p>yyy<p>zzz will get closed like <p>xxx</p><p>yyy</p><p>zzz</p> instead of <p>xxx<p>yyy<p>zzz</p></p></p>. I make an exception for <div>, since there can reasonably be an intentional string of consecutive unclosed <div> tags. I couldn’t reason that being the case for any other tag, but I could be wrong.


3.) At the end of the section that handles open tags, just after this line at the end of the else {}:

$tag = '<'.$tag.$attributes.'>';

I added these lines:

//If already queuing a close tag, then put this tag on, too if ($tagqueue) { $tagqueue .= $tag; $tag = ''; }

If a tag has bbeen queued, then the current tag should be added to the queue so that they can both be added to $newtext.


4.) Sue me, I changed a line of code (appearing in two places) to minimize it:

$newtext = $newtext . $tagqueue;

Which is the first line of the outermost while() and again near the end of the function. I prefer it as:

$newtext .= $tagqueue;

(In a similar fashion I turned:

$newtext = $newtext . '</' . $x . '>'; // Add remaining tags to close

into:

$newtext .= '</' . $x . '>'; // Add remaining tags to close

)

5.) Sue me again, I got rid of the second line from the below:

// Attributes // $attributes = $regex[2]; $attributes = $regex[2];


I should note, however, that there are more complex scenrios in which it becomes increasingly more difficult for balanceTags() to handle. The simplest of them would be alternating open tags…

<b>xxx<i>yyy<b>zzz<i>aaa

which the updated balanceTags() would fix as:

<b>xxx<i>yyy<b>zzz<i>aaa</i></b></i></b>

As you can see, at worst the function balances the tags, though in this instance probably not as we’d desire:

<b>xxx<i>yyy</i></b><b>zzz<i>aaa</i></b>

I suppose I could give it more thought and properly balance deeply nested same-tags and all intervening unbalanced tags. I just know that there are hairier situations to get into.


Also, making the function tag-aware would put it further onto the road of being a true XHTML fixer-upper.

4 replies on “Patch: Fixing balanceTags()”

No problem. FYI, the patch will be applied to the latest WordPress code, version 1.3, currently only available via CVS as it is still in development. And the patch hasn’t actually been applied there just yet. You could always replace your current balanceTags() function with the one I modified, if you were interested in taking advantage of the changes. (The code hasn’t been “approved” yet so use it at your own risk…)

Comments are closed.