
Installing and using Linbot varies depending on which setup you have on your
system. If you have Python on your system, the recommended way of using it
is to run the modules through your Python interpreter. Python 1.5 is required
and may be download freely at
http://www.python.org/
If you do not have Python and are running a
Linux 2.x ELF (Intel)
system you may use the included 'frozen' executable. This executable includes
a built-in Python interpreter and all the modules needed to run Linbot.
Note that, due to the nature of Python and frozen exectuatables, it is not
possible to run the frozen version of Linbot on a system that has a version
of Python earlier than 1.5 installed. This will be fixed in a later version
of Python.
Installing Linbot
Installation is relatively easy.
Unpack the gzipped tar archive into a directory. Recommended directories are
/usr/lib/linbot or ~/linbot. Be sure to add this directory to
your PYTHONPATH environment variable:
$ tar zxvf linbot-0.8.tar.gz -C /usr/lib/linbot
$ PYTHONPATH="/usr/lib/linbot:$PYTHONPATH"
$ export PYTHONPATH
Add a symbolic link to <main> some place in your PATH where <main>
is:
- "linbot.py" if you have Python on your system
- "linbot" if you are using the frozen Linux executable
Note: Starting with 0.8, there will be no frozen linux
executable for linbot until all major distributions have switched to
libc6.
$ ln -s /usr/lib/linbot/linbot.py /usr/local/bin/linbot
or
$ ln -s /usr/lib/linbot/linbot /usr/local/bin/linbot
Edit the config.py file to your choosing. Most of the defaults are
safe, the important items can be overridden with command-line flags.
You may want to keep a copy of the original config.py just in case.
The config.py options are documented within the file.
Running Linbot
It is simple to run Linbot.
Executing Linbot without any command-line arguments will cause it to
give a simple synopsis of it's usage and then quit:
$ linbot
linbot [-x regex]... [-y regex]... [-b][-a][-o dir][-w sec] url [location]...
Before running Linbot on a site, you should need to do a little
preparation.
One thing that Linbot needs is a directory in which to publish its
reports. It is recommended that you choose a directory that is
empty. Note that this directory must exist and be writable by Linbot.
$ mkdir /usr/local/httpd/htdocs/linbot
The report can be viewed using most Web browsers. Browsers using frames
technology should initially open the "index.html" file. Browsers not using
frames or with frames disabled can initially open the "navbar.html" file. Note
these are the default filenames for Linbot and may be changed via the config file.
Secondly it should be decided beforehand which structures on your site
should be considered "internal" and which should be considered
"external". Linbot defines internal and external links as such:
An internal link is a part of your site that you have control of and
should be checked, as well as the links that it points to. Basically
an internal link is one that, if broken, you have the power to fix.
An external link is one that you site points to, but you have no
jurisdiction over. It can also be a link that you may have power to
change, but need not be checked for broken links, such as CGI scripts
or pages that were generated by an automated tool (such as Linbot or
any program that converts a document of one format to HTML.
Your base url is the url that is the top level of your web site.
Commonly referred to as the "home page", it is the url that points to
all other pages either directly or indirectly. A base url can be on
one server but may point to pages that are on another server but
should still be considered internal. An example would be a main
server www.someplaceonthenet.com in which there may be links to an
alternate or load balancing server called www2.someplaceonthenet.com.
In this example www2.someplaceonthenet.com would host internal links
even though your "home page" may be http://www.someplaceonthenet.com
That said, you should have a basic idea of what you do and do not want
Linbot to check. Don't be surprised if you don't get it exactly right
the first time. Also, consider using the robots.txt file/protocol as
Linbot honors this protocol as well as other web robots that may run
across your site. This protocol is useful to indicate to robots that
some parts or your site, such as CGI scripts, internal documents, or
server stats, should not be explored. The robots.txt protocol is
explained at
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.htm
Currently Linbot identifies itself as User-Agent: Linbot.
You can allow Linbot to search a directory but restrict other bots, for example,
like this:
User-agent: *
Disallow: /
User-agent: Linbot
Allow: /
Okay, you've heard enough and you want to run the darn thing. The
simplest way to run Linbot is:
$ linbot http://www.someplaceonthenet.com/
This will first read the robots.txt file at www.someplaceonthenet.com
and then proceed to examine every link pointed to on that site, except
links denied by robots.txt, if that file exists.
The exact usage for Linbot is explained below:
SYNOPSIS
linbot [-x regex]... [-y regex][-b][-a][-o dir][-w sec] url [location[:port]]...
-
-x regex
Use this option to tell Linbot to consider any url matching
with <regex> to be external. This option can be used multiple
times
-
-y regex
Like the -x switch, though this option will cause linbot to not
check the link at all, whereas -x will check the link, but not
its children.
-
-b
Base URLs only.
Tells Linbot to consider any url that does not start with the
base url to be external. For example, if you run
'linbot -b http://www.someplaceonthenet.com/~someuser/' then
http://www.someplaceonthenet.com/~someuser/misc/index.html
will be considered internal whereas
http://www.someplaceonthenet.com/ would be considered
external.
-
-a
Avoid external links. Normally, if Linbot is examining an
HTML page and if finds a link that points to an external
document, Linbot will not examine the external document.
However, it will check to see if that document exists, since
you may not want to point to broken links whether internal or
external. However, sometimes this default behavior may not
be desirable. If the -a option is chosen, Linbot will not
check for the existence of external links.
-
-o
Output Directory. Used to specify the directory where Linbot
will dump its report files. The default is the current
directory or as specified in config.py
-
-w sec
Wait sec seconds. Usually, Linbot will processs a URL and immediately
move on to the next one. However, on some loaded systems, it may be more
desirable to have Linbot wait a while between requests. This option
should be set to any non-negative number (in seconds).
-
url
The base url. Linbot checks this link first, then all the
links it points to on down the "tree".
-
location
This specifies that urls pointed to at <location> are to be
considered internal. This can be useful, for example, it the
base url is on one server but points to "internal" documents
on another server. location is the name of that server, for
example www2.someplaceonthenet.com. This can also be used,
for example, if you have an intranet where some urls may point
to http://www.someplaceonthenet.com whereas some urls may
point to just 'www'. This option may be used more than once,
but must follow the base url
The switches (and other options) can be changed in the config.py
file. It is recommended that you look at (and edit) this file.
Examples
Here are some examples of running Linbot.
$ linbot http://manson.ddns.org/ \
-x /linbot starship.skyport.net
$ linbot -o /stats/altavista/ \
http://altavista.digital.com/
$ linbot -o ~/Lang/Python/linbot \
-b http://manson.ddns.org/~marduk/ manson
Running Periodically
Linbot may be safely run periodically or on off-peak hours using cron
or at. It may be safely run unattended. You may want to redirect
Linbot's output to a null device, log file or have it emailed to an
account. Consult your operating system manuals on how this can be
done on your system.
Questions/Bug Reports
If you have any questions about Linbot or would like to report a bug,
send electronic mail to the
mailing list.
You should also check the archives to make sure that the
bug was not already reported. In order to assist in tracking down bugs,
please include either a URL where the problem can be found, an HTML
file where the error occurs or a (small) tar file of a site where the
error occurs. Suggestions for improvements are also welcomed. Do not
send email to marduk directly concerning bug reports!.
|