ch@tter (aka story time)
Cleaning up HTML fragments
Some of our products, such as OnEvent, make use of a wysiwyg to allow users to enter descriptions of their events/products/etc. The wysiwyg will then store an html fragment which can be pulled onto a frontend page and all the formatting will appear as it did in the wysiwyg. What if I just want a part of that html fragment? This is useful if I want to list multiple events/items/etc. on a page and just display a short bit of their descriptions. The problem I discovered is that just splitting the html fragment at an arbitrary length may split an html tag in half or leave a tag unclosed. This just leads to bad html on the page and makes it very hard to predict what a given browser will actually display. I also found that the styles in the unclosed tags from the included html fragment were bleeding into the rest of the page and really screwing things up. One solution is to use php's strip_tags function to remove all html tags, but this seemed like overkill. My first impulse was to use regular expressions to clean up the split html fragment. This actually worked just fine since the html fragments coming from the wysiwyg follow very strict formatting rules. But what if the html fragment was not perfectly formatted? In a worst case scenario, the code I had written could close a tag that was already closed and break the page. After a little research it was immediately obvious that attempting to account for every possible valid piece of html is just not an option with regular expressions. Further research led me to the DOM extension for php, which among its many uses can parse an html string into an editable object. No point in reinventing the wheel, so after a little more research I found this little bit of code solved all my problems.
|
$dom = new domDocument(); $dom->loadHTML($str); //convert your string into an editable xml document object $ret = preg_replace("/(</?html[^>]*>|<!DOCTYPE[^>]*>|</?body[^>]*>)/", ‘', $dom->saveHTML()); //convert the document object back into a string |
Since loadHTML converts your string into a full html document it adds the doctype, body, and html tags. I couldn't find any way to stop this built into the Dom extension, so the call to preg_replace strips them off.
Voila, I now have one perfectly formatted html fragment ready to be inserted onto a page.
--Ben Matics
Posted by Ben Matics on June 11, 2009 at 05:34 pm EST
Copyright © Antharia. All rights reserved.
No part of this blog may be reproduced without prior written permission.
