php - First getElementsByTagName() returns all elements in HTML (Strange behaviour) -

May 15, 2010

i using php parse html provided me wordpress.

this post's php returned wordpress:

<p>test</p>  <p>     <img class="alignnone size-thumbnail wp-image-39" src="img.png"/> </p>  <p>ok.</p>

this parsing function (with debugging left in):

function get_parsed_blog_post() {     $html = ob_wp_content(false);      print_r(htmlspecialchars($html));     echo '<hr/><hr/><hr/>';      $parse = new domdocument();     $parse->loadhtml($html, libxml_html_noimplied | libxml_html_nodefdtd);      $xpath = new domxpath($parse);     $ps = $xpath->query('//p');      foreach ($ps $p)      {         $imgs = $p->getelementsbytagname('img');          print($imgs->length);         echo '<br/>';          if ($imgs->length > 0)         {             $p->setattribute('class', 'image-content');              foreach ($imgs $img)             {                 $img->removeattribute('class');             }         }             }      $htmlfinal = $parse->savehtml();      print_r(htmlspecialchars($htmlfinal));     echo '<hr/><hr/><hr/>';      return $htmlfinal; }

the purpose of code remove classes wordpress adds <img>s, , set  contains image class of image-content.

and returns:

1 1 0 <p class="image-content">test <p class="image-content">     <img src="img.png"> </p> <p>ok.</p></p>

somehow, has wrapped first occurrence of  around entire parsed post, causing first  have image-content class incorrectly applied. why happening? how stop it?

method 1

as use code, have done changes make working.

if print out each $p able see first element contain html. simplest solution add blank  before html , skip when foreach.

function get_parsed_blog_post() {     $page_content_html = ob_wp_content(false);     $html = "<p></p>".$page_content_html;     print_r(htmlspecialchars($html));     echo '<hr/><hr/><hr/>';      $parse = new domdocument();     $parse->loadhtml($html, libxml_html_noimplied | libxml_html_nodefdtd);      $xpath = new domxpath($parse);     $ps = $xpath->query('//p');     $i = 0;     foreach ($ps $p)      {         if($i != 0) {             $imgs = $p->getelementsbytagname('img');              print($imgs->length);             echo '<br/>';              if ($imgs->length > 0)             {                 $p->setattribute('class', 'image-content');                  foreach ($imgs $img)                 {                     $img->removeattribute('class');                 }             }         }         $i++;     }      $htmlfinal = $parse->savehtml();      print_r(htmlspecialchars($htmlfinal));                  echo '<hr/><hr/><hr/>';      return $htmlfinal; }

total execution time in seconds: 0.00034999847412109

method 2

the problem caused libxml_html_noimplied | libxml_html_nodefdtd (which making first  parent too), can remove document tags without this. so, can here:

function get_parsed_blog_post() { $page_content_html = ob_wp_content(false); $doc = new domdocument(); $doc->loadhtml($page_content_html); foreach($doc->getelementsbytagname('p') $paragraph) {     $imgs = $paragraph->getelementsbytagname('img');     if ($imgs->length > 0)     {         $paragraph->setattribute('class', 'image-content');          foreach ($imgs $img)         {             $img->removeattribute('class');         }     }         }   /* removing doctype, html , body tags */  // removing doctype $doc->removechild($doc->doctype);  // removing html tag $doc->replacechild($doc->firstchild->firstchild, $doc->firstchild);  // removing body tag $html = $doc->getelementsbytagname("body")->item(0); $fragment = $doc->createdocumentfragment(); while ($html->childnodes->length > 0) {     $fragment->appendchild($html->childnodes->item(0)); } $html->parentnode->replacechild($fragment, $html);  $htmlfinal = $doc->savehtml();  print_r(htmlspecialchars($htmlfinal));              echo '<hr/><hr/><hr/>';  return $htmlfinal; }

total execution time in seconds: 0.00026822090148926

Search This Blog

Macro