An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye.
- Very fast parsing and lookup
- Parses broken HTML
- jQuery-like style of DOM traversal
- Low memory usage
- Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
- Doesn't require cURL to be installed
- Automatically handles redirects (301, 302, 303)
- Caches response for multiple processing tasks
- PHP 5+
- No dependencies
Just include_once 'hquery.php';
in your project and start using hQuery
.
Alternatively composer require duzun/hquery
or using npm install hquery.php
, require_once 'node_modules/hquery.php/hquery.php';
.
// Either use commposer, either include this file:
include_once '/path/to/libs/hquery.php';
// Optionally use namespaces (PHP >= 5.3.0 only)
use duzun\hQuery;
// Set the cache path - must be a writable folder
hQuery::$cache_path = "/path/to/cache";
hQuery::fromFile( string $filename
, boolean $use_include_path
= false, resource $context
= NULL )
// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');
// Remote
$doc = hQuery::fromFile('https://example.com/');
Where $context
is created with stream_context_create().
Fon an example of using $context
to make a HTTP request with proxy see #26.
hQuery::fromHTML( string $html
, string $url
= NULL )
$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');
// Set base_url, in case the document is loaded from local source.
// Note: The base_url is used to retrive absolute URLs from relative ones
$doc->base_url = 'http://desired-host.net/path';
hQuery::fromUrl( string $url
, array $headers
= NULL, array|string $body
= NULL, array $options
= NULL )
use duzun\hQuery; // Optional (PHP 5.3+)
// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);
var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request
// with POST
$doc = hQuery::fromUrl(
'http://example.com/someDoc.html', // url
['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);
For building advanced requests (POST, parameters etc) see hQuery::http_wr(),
though I recomend using a specialized library for making requests
and hQuery::fromHTML($html, $url=NULL)
for processing results.
See Guzzle for eg.
Another option is to use stream_context_create()
to create a $context
, then call hQuery::fromFile($url, false, $context)
.
hQuery::find( string $sel
, array|string $attr
= NULL, hQuery_Node $ctx
= NULL )
// Find all banners (images inside anchors)
$banners = $doc->find('a > img:parent');
// Extract links and images
$links = array();
$images = array();
$titles = array();
// If the result of find() is not empty
// $banners is a collection of elements (hQuery_Element)
if ( $banners ) {
// Iterate over the result
foreach($banners as $pos => $a) {
$links[$pos] = $a->attr('href'); // get absolute URL from href property
$titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text
// Filter the result
if ( !$a->hasClass('logo') ) {
$img = $a->find('img')[0]; // ArrayAccess
if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src')
}
}
// If at least one element has the class .home
if ( $banners->hasClass('home') ) {
echo 'There is .home button!', PHP_EOL;
// ArrayAccess for elements and properties.
if ( $banners[0]['href'] == '/' ) {
echo 'And it is the first one!';
}
}
}
// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;
// Get the size of the document ( strlen($html) )
$size = $doc->size;
On DUzun.Me
A lot of people ask for sources of my Live Demo page. Here we go:
view-source:https://duzun.me/playground/hquery
#TODO
- Unit tests everything
- Document everything
Cookie support(implemented in mem for redirects)- Use HTTPlug internally
- Add more selectors
- Improve selectors to be able to select by attributes
I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub).
If you like what I'm doing and want to encorage me, please consider to: