linbot

Installing and using Linbot varies depending on which setup you have on your system. If you have Python on your system, the recommended way of using it is to run the modules through your Python interpreter. Python 1.5 is required and may be download freely at http://www.python.org/

If you do not have Python and are running a Linux 2.x ELF (Intel) system you may use the included 'frozen' executable. This executable includes a built-in Python interpreter and all the modules needed to run Linbot.

Note that, due to the nature of Python and frozen exectuatables, it is not possible to run the frozen version of Linbot on a system that has a version of Python earlier than 1.5 installed. This will be fixed in a later version of Python.


Installing Linbot

Installation is relatively easy.

  1. Unpack the gzipped tar archive into a directory. Recommended directories are /usr/lib/linbot or ~/linbot. Be sure to add this directory to your PYTHONPATH environment variable:

    $ tar zxvf linbot-0.8.tar.gz -C /usr/lib/linbot
    $ PYTHONPATH="/usr/lib/linbot:$PYTHONPATH"
    $ export PYTHONPATH
    
  2. Add a symbolic link to <main> some place in your PATH where <main> is:

    • "linbot.py" if you have Python on your system
    • "linbot" if you are using the frozen Linux executable

    Note: Starting with 0.8, there will be no frozen linux executable for linbot until all major distributions have switched to libc6.

    $ ln -s /usr/lib/linbot/linbot.py /usr/local/bin/linbot

    or

    $ ln -s /usr/lib/linbot/linbot /usr/local/bin/linbot
  3. Edit the config.py file to your choosing. Most of the defaults are safe, the important items can be overridden with command-line flags. You may want to keep a copy of the original config.py just in case. The config.py options are documented within the file.


Running Linbot

It is simple to run Linbot.

Executing Linbot without any command-line arguments will cause it to give a simple synopsis of it's usage and then quit:

$ linbot
linbot [-x regex]... [-y regex]... [-b][-a][-o dir][-w sec] url [location]...

Before running Linbot on a site, you should need to do a little preparation.

One thing that Linbot needs is a directory in which to publish its reports. It is recommended that you choose a directory that is empty. Note that this directory must exist and be writable by Linbot.

$ mkdir /usr/local/httpd/htdocs/linbot

The report can be viewed using most Web browsers. Browsers using frames technology should initially open the "index.html" file. Browsers not using frames or with frames disabled can initially open the "navbar.html" file. Note these are the default filenames for Linbot and may be changed via the config file.

Secondly it should be decided beforehand which structures on your site should be considered "internal" and which should be considered "external". Linbot defines internal and external links as such:

An internal link is a part of your site that you have control of and should be checked, as well as the links that it points to. Basically an internal link is one that, if broken, you have the power to fix.

An external link is one that you site points to, but you have no jurisdiction over. It can also be a link that you may have power to change, but need not be checked for broken links, such as CGI scripts or pages that were generated by an automated tool (such as Linbot or any program that converts a document of one format to HTML.

Your base url is the url that is the top level of your web site. Commonly referred to as the "home page", it is the url that points to all other pages either directly or indirectly. A base url can be on one server but may point to pages that are on another server but should still be considered internal. An example would be a main server www.someplaceonthenet.com in which there may be links to an alternate or load balancing server called www2.someplaceonthenet.com. In this example www2.someplaceonthenet.com would host internal links even though your "home page" may be http://www.someplaceonthenet.com

That said, you should have a basic idea of what you do and do not want Linbot to check. Don't be surprised if you don't get it exactly right the first time. Also, consider using the robots.txt file/protocol as Linbot honors this protocol as well as other web robots that may run across your site. This protocol is useful to indicate to robots that some parts or your site, such as CGI scripts, internal documents, or server stats, should not be explored. The robots.txt protocol is explained at http://info.webcrawler.com/mak/projects/robots/exclusion-admin.htm Currently Linbot identifies itself as User-Agent: Linbot.

You can allow Linbot to search a directory but restrict other bots, for example, like this:

   User-agent: *
   Disallow: /

   User-agent: Linbot
   Allow: /

Okay, you've heard enough and you want to run the darn thing. The simplest way to run Linbot is:

$ linbot http://www.someplaceonthenet.com/

This will first read the robots.txt file at www.someplaceonthenet.com and then proceed to examine every link pointed to on that site, except links denied by robots.txt, if that file exists.

The exact usage for Linbot is explained below:


SYNOPSIS

linbot [-x regex]... [-y regex][-b][-a][-o dir][-w sec] url [location[:port]]...

-x regex

Use this option to tell Linbot to consider any url matching with <regex> to be external. This option can be used multiple times

-y regex

Like the -x switch, though this option will cause linbot to not check the link at all, whereas -x will check the link, but not its children.

-b

Base URLs only. Tells Linbot to consider any url that does not start with the base url to be external. For example, if you run 'linbot -b http://www.someplaceonthenet.com/~someuser/' then http://www.someplaceonthenet.com/~someuser/misc/index.html will be considered internal whereas http://www.someplaceonthenet.com/ would be considered external.

-a

Avoid external links. Normally, if Linbot is examining an HTML page and if finds a link that points to an external document, Linbot will not examine the external document. However, it will check to see if that document exists, since you may not want to point to broken links whether internal or external. However, sometimes this default behavior may not be desirable. If the -a option is chosen, Linbot will not check for the existence of external links.

-o

Output Directory. Used to specify the directory where Linbot will dump its report files. The default is the current directory or as specified in config.py

-w sec

Wait sec seconds. Usually, Linbot will processs a URL and immediately move on to the next one. However, on some loaded systems, it may be more desirable to have Linbot wait a while between requests. This option should be set to any non-negative number (in seconds).

url

The base url. Linbot checks this link first, then all the links it points to on down the "tree".

location

This specifies that urls pointed to at <location> are to be considered internal. This can be useful, for example, it the base url is on one server but points to "internal" documents on another server. location is the name of that server, for example www2.someplaceonthenet.com. This can also be used, for example, if you have an intranet where some urls may point to http://www.someplaceonthenet.com whereas some urls may point to just 'www'. This option may be used more than once, but must follow the base url

The switches (and other options) can be changed in the config.py file. It is recommended that you look at (and edit) this file.


Examples

Here are some examples of running Linbot.

$ linbot http://manson.ddns.org/ \
  -x /linbot starship.skyport.net

$ linbot -o /stats/altavista/ \
  http://altavista.digital.com/

$ linbot -o ~/Lang/Python/linbot \
  -b http://manson.ddns.org/~marduk/ manson

Running Periodically

Linbot may be safely run periodically or on off-peak hours using cron or at. It may be safely run unattended. You may want to redirect Linbot's output to a null device, log file or have it emailed to an account. Consult your operating system manuals on how this can be done on your system.


Questions/Bug Reports

If you have any questions about Linbot or would like to report a bug, send electronic mail to the mailing list. You should also check the archives to make sure that the bug was not already reported. In order to assist in tracking down bugs, please include either a URL where the problem can be found, an HTML file where the error occurs or a (small) tar file of a site where the error occurs. Suggestions for improvements are also welcomed. Do not send email to marduk directly concerning bug reports!.

top  
index download documentation samples mail available