SimpleHTMLDOM PHP Scraping Library | PHP - Hypertext Preprocessor

Standards Based Development

Simple HTML DOM Parser repository on SourceForge.

Notes

Loading html from a file will create the initial object, which can be done via url, or local file system:


$request_url = './html_files_to_be_edited/news.html';
$html = file_get_html($request_url);

Three ways to create html dom Object:


// Create a DOM object from a string  
$html = str_get_html('<html><body>Hello!</body></html>');  

// Create a DOM object from a URL  
$html = file_get_html('http://www.google.com/');  

// Create a DOM object from a HTML file  
$html = file_get_html('test.htm');

Four ways to create an object-oriented html dom object:


// Create a DOM object 
$html = new simple_html_dom(); 

// Load HTML from a string 
$html->load('<html><body>Hello!</body></html>'); 

// Load HTML from a URL  
$html->load_file('http://www.google.com/'); 

// Load HTML from a HTML file  
$html->load_file('test.htm');

find() Method

Use find() method to work with the dom Object and create collections; collections are groups of objects found via a selector; example below uses find() to create a collection


$element = $html->find('#news',0)->innertext = 'My new text!'; 
FilterDescription
[attribute]Matches elements that have the specified attribute.
[!attribute]Matches elements that don't have the specified attribute.
[attribute=value]Matches elements that have the specified attribute with a certain value.
[attribute!=value]Matches elements that don't have the specified attribute with a certain value.
[attribute^=value]Matches elements that have the specified attribute and it starts with a certain value.
[attribute$=value]Matches elements that have the specified attribute and it ends with a certain value.
[attribute*=value]Matches elements that have the specified attribute and it contains a certain value.

Ten ways to find html elements using find()


// Find all anchors, returns a array of element objects 
$ret = $html->find('a'); 

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0); 

// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1); 

// Find all <div> with the id attribute 
$ret = $html->find('div[id]'); 

// Find all <div> which attribute id=foo 
$ret = $html->find('div[id=foo]');

// Find all element which id=foo 
$ret = $html->find('#foo'); 

// Find all element which class=foo 
$ret = $html->find('.foo'); 

// Find all element has attribute id 
$ret = $html->find('*[id]');  

// Find all anchors and images  
$ret = $html->find('a, img');  

// Find all anchors and images with the "title" attribute 
$ret = $html->find('a[title], img[title]');

Access html Element Attributes

Get, Set, and Remove Attributes


// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href; 

// Set a attribute(If the attribute is non-value attribute (eg. checked, selected...), set it's value as true or false)
$e->href = 'my link'; 

// Remove a attribute, set it's value as null!  
$e->href = null; 

// determine whether attribute exists
if(isset($e->href))      
    echo 'href exist!';

Magic Attributes

Attribute NameUsage
$e->tagRead or write the tag name of element.
$e->outertextRead or write the outer HTML text of element.
$e->innertextRead or write the inner HTML text of element.
$e->plaintextRead or write the plain text of element.

// Example 
$html = str_get_html("<div>foo <b>bar</b></div>");  
$e = $html->find("div", 0); 

echo $e->tag; // Returns: " div" 

echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>" 

echo $e->innertext; // Returns: " foo <b>bar</b>"

echo $e->plaintext; // Returns: " foo bar"

Descendant Selectors

Here are some ways to find elements using descendant selectors


// Find all <li> in <ul>   
$es = $html->find('ul li');  

// Find Nested <div> elements  
$es = $html->find('div div div');   

// Find all <td> in <table> which class=hello   
$es = $html->find('table.hello td');  

// Find all <td> elements with attribute align=center in <table> elements
$es = $html->find('table td[align=center]');

Nested Selectors Demo

save() Method

save() method is used to save the html collection(s) that you create.


$html->save($request_url);

White Space

To preserve white space in your saved html, pass the $stripRN var as false:


$html = file_get_html($request_url, NULL, NULL, NULL, NULL, NULL, NULL, NULL, false);

Tips

Extract Contents From html


// Extract contents from HTML  
echo $html->plaintext; 

Wrap an Element


// wrap an element 
$e->outertext = '<div class="wrap">' . $e->outertext . '<div>'; 

Remove an Element and Set It's Outertext as an Empty String


// Remove a element, set it's outertext as an empty string  
$e->outertext = ''; 

Append an Element


// Append a element 
$e->outertext = $e->outertext . '<div>foo<div>'; 

Insert an Element


// Insert a element 
$e->outertext = '<div>foo<div>' . $e->outertext;

Examples

Find all img / and Print src Values Demo


$html = file_get_html('http://www.google.com/');
foreach($html->find('img') as $element)
    echo $element->src . '<br />';

Find all a and Print href Values Demo


$html = file_get_html('http://www.google.com/');
foreach($html->find('a') as $element)
    echo $element->href . '<br />';

Modify html Elements

The code below modifies a div's contents and adds a class to another div:


$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
$html->find('div[id=hello]', 0)->innertext = 'foo';   
$html->find('div', 1)->class = 'bar';   
echo $html;

And the above code's output is:


<div id="hello">foo</div><div id="world" class="bar">World</div>

Extract Content, Leave html


echo file_get_html('http://www.google.com/')->plaintext;

Scraping Slashdot


dom Tree Traversal

php.net html dom reference.


// Example 
echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id; 

// or  
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');

Traversing the dom Tree

MethodDescription
mixed $e->children ( [int $index] )Returns the Nth child object if index is set, otherwise return an array of children.
element $e->parent ()Returns the parent of element.
element $e->first_child ()Returns the first child of element, or null if not found.
element $e->last_child ()Returns the last child of element, or null if not found.
element $e->next_sibling ()Returns the next sibling of element, or null if not found.
element $e->prev_sibling ()Returns the previous sibling of element, or null if not found.

Dump Contents of dom Object

Quick dom Object Content Dump


// Dumps the internal DOM tree back into string  
$str = $html; 

// Print it! 
echo $html; 

Object-Oriented dom Object Content Dump


// Dumps the internal DOM tree back into string  
$str = $html->save(); 

// Dumps the internal DOM tree back into a file  
$html->save('result.htm');

Customized Parsing

You can customize parsing behavior(s) using callback function(s), like so:


// Write a function with parameter "$element" 
function my_callback($element) {         
    // Hide all <b> tags          
    if ($element->tag=='b')                 
        $element->outertext = ''; 
}  

// Register the callback function with it's function name 
$html->set_callback('my_callback'); 

// Callback function will be invoked while dumping 
echo $html;

Customized Parsing Behavior Using Callback Functions Demo

References and Resources