php - First getElementsByTagName() returns all elements in HTML (Strange behaviour) -
i using php parse html provided me wordpress.
this post's php returned wordpress:
<p>test</p> <p> <img class="alignnone size-thumbnail wp-image-39" src="img.png"/> </p> <p>ok.</p>
this parsing function (with debugging left in):
function get_parsed_blog_post() { $html = ob_wp_content(false); print_r(htmlspecialchars($html)); echo '<hr/><hr/><hr/>'; $parse = new domdocument(); $parse->loadhtml($html, libxml_html_noimplied | libxml_html_nodefdtd); $xpath = new domxpath($parse); $ps = $xpath->query('//p'); foreach ($ps $p) { $imgs = $p->getelementsbytagname('img'); print($imgs->length); echo '<br/>'; if ($imgs->length > 0) { $p->setattribute('class', 'image-content'); foreach ($imgs $img) { $img->removeattribute('class'); } } } $htmlfinal = $parse->savehtml(); print_r(htmlspecialchars($htmlfinal)); echo '<hr/><hr/><hr/>'; return $htmlfinal; }
the purpose of code remove classes wordpress adds <img>
s, , set <p>
contains image class of image-content
.
and returns:
1 1 0 <p class="image-content">test <p class="image-content"> <img src="img.png"> </p> <p>ok.</p></p>
somehow, has wrapped first occurrence of <p>
around entire parsed post, causing first <p>
have image-content
class incorrectly applied. why happening? how stop it?
method 1
as use code, have done changes make working.
if print out each $p
able see first element contain html. simplest solution add blank <p>
before html , skip when foreach
.
function get_parsed_blog_post() { $page_content_html = ob_wp_content(false); $html = "<p></p>".$page_content_html; print_r(htmlspecialchars($html)); echo '<hr/><hr/><hr/>'; $parse = new domdocument(); $parse->loadhtml($html, libxml_html_noimplied | libxml_html_nodefdtd); $xpath = new domxpath($parse); $ps = $xpath->query('//p'); $i = 0; foreach ($ps $p) { if($i != 0) { $imgs = $p->getelementsbytagname('img'); print($imgs->length); echo '<br/>'; if ($imgs->length > 0) { $p->setattribute('class', 'image-content'); foreach ($imgs $img) { $img->removeattribute('class'); } } } $i++; } $htmlfinal = $parse->savehtml(); print_r(htmlspecialchars($htmlfinal)); echo '<hr/><hr/><hr/>'; return $htmlfinal; }
total execution time in seconds: 0.00034999847412109
method 2
the problem caused libxml_html_noimplied | libxml_html_nodefdtd
(which making first <p>
parent too), can remove document tags without this. so, can here:
function get_parsed_blog_post() { $page_content_html = ob_wp_content(false); $doc = new domdocument(); $doc->loadhtml($page_content_html); foreach($doc->getelementsbytagname('p') $paragraph) { $imgs = $paragraph->getelementsbytagname('img'); if ($imgs->length > 0) { $paragraph->setattribute('class', 'image-content'); foreach ($imgs $img) { $img->removeattribute('class'); } } } /* removing doctype, html , body tags */ // removing doctype $doc->removechild($doc->doctype); // removing html tag $doc->replacechild($doc->firstchild->firstchild, $doc->firstchild); // removing body tag $html = $doc->getelementsbytagname("body")->item(0); $fragment = $doc->createdocumentfragment(); while ($html->childnodes->length > 0) { $fragment->appendchild($html->childnodes->item(0)); } $html->parentnode->replacechild($fragment, $html); $htmlfinal = $doc->savehtml(); print_r(htmlspecialchars($htmlfinal)); echo '<hr/><hr/><hr/>'; return $htmlfinal; }
total execution time in seconds: 0.00026822090148926
Comments
Post a Comment