hQuery.php

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye.

API Documentation

Features

Very fast parsing and lookup
Parses broken HTML
jQuery-like style of DOM traversal
Low memory usage
Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
Doesn't require cURL to be installed
Automatically handles redirects (301, 302, 303)
Caches response for multiple processing tasks
PHP 5+
No dependencies

Install

Just include_once 'hquery.php'; in your project and start using hQuery.

Alternatively composer require duzun/hquery

or using npm install hquery.php, require_once 'node_modules/hquery.php/hquery.php';.

Usage

Basic setup:

// Either use commposer, either include this file:
include_once '/path/to/libs/hquery.php';

// Optionally use namespaces (PHP >= 5.3.0 only)
use duzun\hQuery;

// Set the cache path - must be a writable folder
hQuery::$cache_path = "/path/to/cache";

Load HTML from a file

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');

// Remote
$doc = hQuery::fromFile('https://example.com/');

Where $context is created with stream_context_create().

Fon an example of using $context to make a HTTP request with proxy see #26.

Load HTML from a string

hQuery::fromHTML( string `$html`, string `$url` = NULL )

$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');

// Set base_url, in case the document is loaded from local source.
// Note: The base_url is used to retrive absolute URLs from relative ones
$doc->base_url = 'http://desired-host.net/path';

Load a remote HTML document

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

use duzun\hQuery; // Optional (PHP 5.3+)

// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);

var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request

// with POST
$doc = hQuery::fromUrl(
    'http://example.com/someDoc.html', // url
    ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
    ['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
    ['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);

For building advanced requests (POST, parameters etc) see hQuery::http_wr(), though I recomend using a specialized library for making requests and hQuery::fromHTML($html, $url=NULL) for processing results. See Guzzle for eg.

Another option is to use stream_context_create() to create a $context, then call hQuery::fromFile($url, false, $context).

Processing the results

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

// Find all banners (images inside anchors)
$banners = $doc->find('a > img:parent');

// Extract links and images
$links  = array();
$images = array();
$titles = array();

// If the result of find() is not empty
// $banners is a collection of elements (hQuery_Element)
if ( $banners ) {
    
    // Iterate over the result
    foreach($banners as $pos => $a) {
        $links[$pos] = $a->attr('href'); // get absolute URL from href property
        $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text

        // Filter the result
        if ( !$a->hasClass('logo') ) {
            $img = $a->find('img')[0]; // ArrayAccess
            if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src')
        }
    }

    // If at least one element has the class .home
    if ( $banners->hasClass('home') ) {
        echo 'There is .home button!', PHP_EOL;

        // ArrayAccess for elements and properties.
        if ( $banners[0]['href'] == '/' ) {
            echo 'And it is the first one!';
        }
    }
}

// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;

// Get the size of the document ( strlen($html) )
$size = $doc->size;

Live Demo

On DUzun.Me

A lot of people ask for sources of my Live Demo page. Here we go:

view-source:https://duzun.me/playground/hquery

#TODO

Unit tests everything
Document everything
~~Cookie support~~ (implemented in mem for redirects)
Use HTTPlug internally
Add more selectors
Improve selectors to be able to select by attributes

💖 Support my projects

I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub).

If you like what I'm doing and want to encorage me, please consider to:

Star and Share the projects you like (and use)
Send me some Bitcoin at this addres: bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa (or using the QR below)

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
docs		docs
examples		examples
psr-4		psr-4
tests		tests
tools		tools
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
gulpfile.js		gulpfile.js
hquery.php		hquery.php
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hQuery.php

Features

Install

Usage

Basic setup:

Load HTML from a file

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

Load HTML from a string

hQuery::fromHTML( string `$html`, string `$url` = NULL )

Load a remote HTML document

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

Processing the results

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

Live Demo

💖 Support my projects

About

Releases

Packages

Languages

License

iCasa/hQuery.php

Folders and files

Latest commit

History

Repository files navigation

hQuery.php

Features

Install

Usage

Basic setup:

Load HTML from a file

hQuery::fromFile( string $filename, boolean $use_include_path = false, resource $context = NULL )

Load HTML from a string

hQuery::fromHTML( string $html, string $url = NULL )

Load a remote HTML document

hQuery::fromUrl( string $url, array $headers = NULL, array|string $body = NULL, array $options = NULL )

Processing the results

hQuery::find( string $sel, array|string $attr = NULL, hQuery_Node $ctx = NULL )

Live Demo

💖 Support my projects

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

hQuery::fromHTML( string `$html`, string `$url` = NULL )

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

Packages