Indexing Web with Head-r

Head-r is a free Perl program that recursively follows links located at (HTML) Web pages hosted on an HTTP server, and performs HEAD requests upon links of interest to the user.

The intended use for this program is to create URI lists for later selective mirroring of file-hosting sites.

Synopsis

edit
head-r [-v|--verbose] [-j|--bzip2|-z|--gzip]
    [--include-re=RE] [--exclude-re=RE]
    [--depth=N] [--info-re=RE] [--descend-re=RE]
    [-i|--input=FILE]... [-o|--output=FILE]
    [-P|--no-proxy] [-U|--user-agent=USER-AGENT]
    [-w|--wait=DELAY]
    [--] [URI]...

Basic usage

edit

Arguably, the most important Head-r options are --info-re= and --descend-re=, which determine (by means of regular expressions) which URIs will be considered for mere HEAD requests, and which ones Head-r will try to get more URIs from.

Simplistic, no-recursion example

edit

For the following example, we’ll use . – a regular expression that matches any non-empty string – to allow Head-r to make HEAD requests to both of the URIs given.

$ head-r --info-re=. \
      -- http://example.org/ http://example.net/ 
http://example.org/	1381334900	1	1270	200
http://example.net/	1381334903	1	1270	200

The fields are delimited with ASCII HT (also known as TAB) codes, and are as follows:

  1. URI;
  2. timestamp (in seconds since system-dependent epoch; see also Unix time);
  3. recursion depth used when considering this URI;
  4. the length of the response in octets (as per the Content-Length: HTTP reply header);
  5. HTTP status code of the reply.

Recurse once example

edit

For the following example, we’ll also enable actual recursion (still at maximum depth of 1), by using the --descend-re=/\$ option.

$ head-r --info-re=. --descend-re=/\$ \
      -- http://example.org/ http://example.net/ 
http://example.org/	1381337824	1	1270	200
http://www.iana.org/domains/example	1381337829	0	200
http://example.net/	1381337830	1	1270	200

As could be seen, at http://example.org/ Head-r found another URI to consider: http://www.iana.org/domains/example, which it followed and issued a HEAD request for.

It’s easy to check that http://example.net/ actually also references the same URI. However, as Head-r remembers the URIs it processes (along with the recursion depth at the point) no other request was issued.

Limiting HEAD requests

edit

Consider now that the resource we’re to recurse through references URIs that are out of our interest. For the following example, we’ll use a more selective regular expression than . we’ve used above.

$ head-r --{info,descend}-re=wikipedia\\.org/wiki/ \
      -- http://en.wikipedia.org/wiki/Main_Page 
http://en.wikipedia.org/wiki/Main_Page	1381339589	1	61499	200
. . .
http://en.wikipedia.org/w/api.php?action=rsd
http://creativecommons.org/licenses/by-sa/3.0/
. . .
http://meta.wikimedia.org/
http://en.wikipedia.org/wiki/Wikipedia	1381339589	0	609859	200
http://en.wikipedia.org/wiki/Free_content	1381339589	0	124407	200
. . .

(Please note that we’ve just used the Bash {,} expansion to pass the same regular expression to both --info-re= and --descend-re=. Be sure to adjust to the command line interpreter actually in use.)

In the output above, a number of URIs came without any of the usual information. These URIs were found by Head-r, but as they matched neither “info” (--info-re=) nor “descend” (--descend-re=) regular expressions specified, no action was done to them. The URIs are still output, however, just in case we may decide to adjust the regular expressions themselves.

Skipping unwanted URIs altogether

edit

The --include-re= and --exclude-re= regular expressions are considered before all the other ones, and currently have the following semantics:

  1. the inclusion regular expression is applied first; the URI will be considered if it matches one;
  2. unless decided at the step above, the exclusion regular expression is then applied; the URI will not be considered if it matches one;
  3. unless decided by the rules above, the URI will be considered.

If none of these options are given, any URI will be considered by Head-r.

The following example exploits these options to further limit the output of Head-r for the case above.

$ head-r --{include,descend}-re=wikipedia\\.org/wiki/ \
      --{info,exclude}-re=. \
      -- http://en.wikipedia.org/wiki/Main_Page 
http://en.wikipedia.org/wiki/Main_Page	1381341336	1	61499	200
http://en.wikipedia.org/wiki/Wikipedia	1381341337	0	609859	200
http://en.wikipedia.org/wiki/Free_content	1381341337	0	124407	200
http://en.wikipedia.org/wiki/Encyclopedia	1381341337	0	151164	200
http://en.wikipedia.org/wiki/Wikipedia:Introduction	1381341337	0	50687	200
. . .

Saving state between sessions

edit

Head-r is capable of reading its own output, so to avoid issuing duplicate HEAD requests, and also to discover the URIs of the resources to recurse into.

Restoring what was saved

edit

Let us revisit one of our previous examples, which we’ll now alter to only issue a HEAD request to a couple of pages:

$ head-r --output=state.a \
      --info-re='/(Free_content|Wikipedia)$' \
      --descend-re=wikipedia\\.org/wiki/ \
      -- http://en.wikipedia.org/wiki/Main_Page 
$ grep -E \\s < state.a 
http://en.wikipedia.org/wiki/Main_Page	1381417546	1	61499	200
http://en.wikipedia.org/wiki/Wikipedia	1381417546	0	609859	200
http://en.wikipedia.org/wiki/Free_content	1381417546	0	124407	200
$ 

Now, why not to include a few more pages, such as all the pages with the names starting with F?

$ head-r \
      --input=state.a --output=state.b \
      --info-re=/wiki/F \
      --descend-re=wikipedia\\.org/wiki/ 
$ grep -E \\s < state.b 
http://en.wikipedia.org/wiki/File:Diary_of_a_Nobody_first.jpg	1381417906	0	34344	200
http://en.wikipedia.org/wiki/File:Progradungula_otwayensis_cropped.png	1381417906	0	30604	200
http://en.wikipedia.org/wiki/File:AW_TW_PS.jpg	1381417907	0	33297	200
http://en.wikipedia.org/wiki/Fran%C3%A7ois_Englert	1381417907	0	87860	200
http://en.wikipedia.org/wiki/File:Washington_Monument_Dusk_Jan_2006.jpg	1381417907	0	83137	200
http://en.wikipedia.org/wiki/File:Walt_Disney_Concert_Hall,_LA,_CA,_jjron_22.03.2012.jpg	1381417907	0	67225	200
http://en.wikipedia.org/wiki/Frank_Gehry	1381417907	0	152838	200
$ 

Note that while our --info-re= has obviously covered http://en.wikipedia.org/wiki/Free_content, no HEAD request was made to the page, as our --input=state.a file already had the relevant information.

Also, as all the URIs we wanted for Head-r to consider were already listed in state.a, it was unnecessary to specify any URIs at the command line. When the URIs come from both command line arguments and --input= files, those coming from command line are considered first.

Compression

edit

As recursing through large Web sites may result in large output lists, Head-r provides support for compression of output data.

The --bzip2 (-j) and --gzip (-z) options select the compression method to use for the output file (either specified with --output=, or standard output.) Head-r, however, will exit with an error if compression is enabled and the output goes to a terminal device.

Head-r transparently decompresses the files given as inputs (--input=), thanks to the IO::Uncompress::AnyUncompress library.

Adjusting HTTP client behavior

edit

There’re two options which influence the behavior of the HTTP client used by Head-r: --wait= (-w) and --user-agent= (-U.)

The --wait= option specifies the amount of time, in seconds, to wait between two consecutive HTTP requests. The default is about 2.7 seconds.

The --user-agent= option specifies the value for the User-Agent: header to use in HTTP requests, and may come handy should the target server block access based on this header’s data. The default is composed of the string HEAD-R-Bot/, the Head-r’s own version, and the identity of the libwww-perl library used. For example: HEAD-R-Bot/0.1 libwww-perl/6.05.

Bugs

edit

Please consider reporting any bugs in the Head-r software not listed below via the CPAN RT, https://rt.cpan.org/Public/Dist/Display.html?Name=head-r. The bugs in this documentation should be reported to the respective Wikibooks Talk page – or you may actually fix them yourself!

As for any other automatic retrieval tool, it isn’t impossible to abuse Head-r to cause excessive load on third-party servers. The user is advised to consider the network environment when using the tool, and especially when lowering the --wait= setting, and raising the maximum recursion --depth= beyond reasonable values.

There’s currently no way to disable the /robots.txt file processing.

The code only tries to retrieve URIs from content marked with text/html media type, even though it seems as if the support for application/xhtml+xml (and perhaps several other XML-based types, such as SVG) could be implemented rather easily.

The resource to retrieve URIs from is first loaded into memory, while it should be possible to process it on-the-fly.

The handling of recursion depths retrieved from --input= files may be somewhat unintuitive, and out of the user’s control. (Although it’s still possible to edit such files using third-party tools, such as AWK.)

The code implements a trivial work-around for the long-standing Net::HTTP bug #29468.

Availability

edit

The latest stable version of the code is available from CPAN. Check, for instance, the respective Metacpan page at https://metacpan.org/release/head-r.

The latest development version could be downloaded from a Git repository, like:

$ git clone -- \
      http://am-1.org/~ivan/archives/git/head-r-2013.git/ head-r 

A Gitweb interface is available at http://am-1.org/~ivan/archives/git/gitweb.cgi?p=head-r-2013.git.

Author

edit

Head-r is written by Ivan Shmakov.

Head-r is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This documentation is a free collaborative project going on at Wikibooks, and is available under the Creative Commons Attribution/Share-Alike License (CC BY-SA) version 3.0.