5.5. Use Cases

To give you a few examples on how to use the parser in real life situations we will describe a number of different use cases. We start with showing how you pluck files located on your own system and then continues with examples for how to pluck files on remote hosts.

In all the examples we will use /home/pilot/.plucker/ as our Plucker Home (and Plucker Directory). In this directory we have two subdirectories, one called HTML where we store all our different description files and one called DB used for storing the resulting documents.

5.5.1. Pluck local targets

The parser can handle any HTML or text document you have on your desktop system, whether it is simple text files or documents from a local web server you run on your desktop.

5.5.1.1. Creating an E-book

Handling E-books on your Palm is one of the things that Plucker does well. To create such E-books you first have to get the book in either text or (preferable) in HTML format -- Project Gutenberg: http://www.promo.net/pg/ is a good place to find several old classics and other free books. Some books can be found in the Open E-book (OEB) format. That format is close enough to HTML to be usable by the parser.

In this example we will convert Lewis Carroll's Alice's Adventures in Wonderland using a copy in OEB format that we got from http://www.jeffkirvin.com/writingonyourpalm/recommends.htm. After unpacking the file in the HTML subdirectory we have one large OEB file called alices_adventures_in_wonderland.htm and the procedure to convert this file into a Plucker document is very simple:

% Spider.py -v --no-urlinfo -H plucker:/HTML/alices_adventures_in_wonderland.htm \
  -N "Alice in Wonderland" -f DB/Wonderland 

Working for pluckerdir /home/pilot/.plucker
Processing plucker:/HTML/alices_adventures_in_wonderland.htm.
           0 collected, 0 still to do
  Retrieved ok

Writing out collected data...
Writing db 'Alice in Wonderland' to file /home/pilot/.plucker/DB/Wonderland.pdb
Converted plucker:/HTML/alices_adventures_in_wonderland.htm
Converted plucker:/~parts~/plucker%3a%2fHTML%2fali.....onderland.htm/1
Converted plucker:/~parts~/plucker%3a%2fHTML%2fali.....onderland.htm/2
Converted plucker:/~parts~/plucker%3a%2fHTML%2fali.....onderland.htm/3
Converted plucker:/~parts~/plucker%3a%2fHTML%2fali.....onderland.htm/4
Wrote 1 <= plucker:/~special~/index
Wrote 2 <= plucker:/HTML/alices_adventures_in_wonderland.htm
Wrote 11 <= plucker:/~parts~/plucker%3a%2fHTML%2fali.....onderland.htm/1
Wrote 12 <= plucker:/~parts~/plucker%3a%2fHTML%2fali.....onderland.htm/2
Wrote 13 <= plucker:/~parts~/plucker%3a%2fHTML%2fali.....onderland.htm/3
Wrote 14 <= plucker:/~parts~/plucker%3a%2fHTML%2fali.....onderland.htm/4
Done!

We give it a different name than the document itself and also exclude the URL info (we don't need that in an E-book). From the output above you can also see that the document is split into several parts, since we for internal reasons must keep the text documents below 32 kB in size.

The document can be found in the DB directory.

5.5.1.2. An Admin Guide

The previous example was only using one single file and this example will show you that it is just as simple when the book is divided up into several documents. We will use The Linux System Administrators' Guide by Lars Wirzenius and Joanna Oja as our test object.

Unpacking the files in a separate directory (we will use /tmp/sag/ for this example) we will find several HTML documents and also a bunch of GIF images. We are interested in the images to find out what bit depth we have to use and also their size so we know if they will be scaled down. In this case all images are black and white, so we can use the default bit depth of 1. A few of the images are quite large, so they will be scaled down to 150x250 by the parser. If we really want these images in full size we can either change the HTML document so that instead of including the image in the document it will link to the image, i.e. instead of:

<img src="overview-kernel.gif">

we would use:

<a href="overview-kernel.gif" BPP=1 MAXWIDTH=700 MAXHEIGHT=700>overview-kernel.gif</a>

Then we can tap on the link to the image when we want to view it. This is something we can only do when we have access to the document and to support this in a more transparent way the parser should be able to do this automatically for you in the future.

Since we are not interested in any external documents we will use the --stayonhost option and a high maximum depth. Then we don't have to worry about exactly how deep we should follow links and what external links we should filter out using exclusion lists. Now we are ready to build the document:

% Spider.py -v --stayonhost -M5 -H file:/tmp/sag/index.html -N "Linux Admin Guide" -f DB/SAG

Working for pluckerdir /home/pilot/.plucker
Processing file:/tmp/sag/index.html.
           0 collected, 0 still to do
  Retrieved ok

Processing file:/tmp/sag/backup-timeline.gif.
           73 collected, 0 still to do
  Retrieved ok

Writing out collected data...
Writing db 'Linux Admin Guide' to file /home/pilot/.plucker/SAG.pdb
Converted file:/tmp/sag/book1.html

Converted file:/tmp/sag/x89.html
Wrote 1 <= plucker:/~special~/index
Wrote 2 <= file:/tmp/sag/index.html
Wrote 3 <= plucker:/~special~/pluckerlinks
Wrote 11 <= file:/tmp/sag/backup-timeline.gif

Wrote 83 <= mailto:gregh@sunsite.unc.edu
Wrote 87 <= plucker:/~special~/links1
Done!

Install the document you find in /home/pilot/.plucker/DB and you have instant access to The Linux System Administrators' Guide.

5.5.2. Pluck remote targets

E-books and manuals in all glory, but many times we want to get fresh articles that updates every day. Well, Plucker can handle that, too.

5.5.2.1. Daily News

To be able to get the latest news from Wired we will set up a special section in the configuration file, so that we only have to run:

% Spider.py -s wired

..every morning to get the latest version of Wired for later perusal.

The handheld friendly version of Wired is located at http://www.wired.com/news_drop/palmpilot/ and we want to pluck it to a depth of 3 levels. We also know that it only uses black and white images. This give us the following section in the configuration file (/home/pilot/.pluckerrc):

[wired]
bpp = 1
home_maxdepth = 3
home_url = http://www.wired.com/news_drop/palmpilot/
db_file = DB/Wired

5.5.2.2. Comics

We need some fun, too, so let's download a few strips for some well known comics. To simplify things we will use a tool called netcomics to get the comics and then use a local description file to build the document. How to install netcomics is beyond this tutorial, but it is a Perl script and might work on any platform that have Perl support (for Linux users there exists pre-built packages). After you have installed netcomics, you should create a small shellscript called netcomics.sh to be used by the parser:

#!/bin/sh

netcomics -D -d /tmp/Comics/ -c "ch dilbert dilbertcl uf"

( cd /tmp/Comics ; \
mv Dilbert-*.gif Dilbert.gif ; \
mv Dilbert_Classics-*.gif Dilbert_Classics.gif ; \
mv Calvin_and_Hobbes-*.gif Calvin_and_Hobbes.gif ; \
mv User_Friendly-*.gif User_Friendly.gif )

On OS/2 and Windows this will look like the follwing. On OS/2 it should be named netcomics.cmd whereas on Windows it should be named netcomics.bat:

perl netcomics.pl -D -d \temp\Comics\ -c "ch dilbert dilbertcl uf"

cd \temp\Comics
move Dilbert-*.gif Dilbert.gif
move Dilbert_Classics-*.gif Dilbert_Classics.gif
move Calvin_and_Hobbes-*.gif Calvin_and_Hobbes.gif
move User_Friendly-*.gif User_Friendly.gif

This script will download Calvin & Hobbes, Dilbert, Dilbert Classic and UserFriendly to a separate directory (/tmp/Comics/) and rename the date specific files into a general format that can be used in the local description file:

<html>
<body>

<h1>Comics Home Page</h1>

<p><a href="file:/tmp/Comics/Dilbert.gif">Dilbert</a></p>
<p><a href="file:/tmp/Comics/Dilbert_Classics.gif">Dilbert Classic</a></p>
<p><a href="file:/tmp/Comics/Calvin_and_Hobbes.gif">Calvin &amp; Hobbes</a></p>
<p><a href="file:/tmp/Comics/User_Friendly.gif">UserFriendly</a></p>

</body>
</html>

To simplify things even further we will also add a new section for the comics:

[comics]
bpp = 4
home_url = plucker:/HTML/comics.html
maxwidth = 600
maxheight = 200
db_file = DB/Comics
before_command = "netcomics.sh"

NOTE: On OS/2 or Windows you can use the before_command to the set the name of your batch file.

As you can see we have added the shellscript as a command that should be run before the description file is parsed. Everyday (except on Sunday when the strips are too large for these options -- we will show a solution to that later in the section) we now only have to run:

% Spider.py -v -s comics

Executing 'before_command': "netcomics.sh"
Working for pluckerdir /home/pilot/.plucker
Processing file:/home/pilot/.plucker/HTML/comics.html.
           0 collected, 0 still to do
  Retrieved ok
Processing file:/tmp/Comics/Dilbert.gif.
           1 collected, 3 still to do
  Retrieved ok
Processing file:/tmp/Comics/Dilbert_Classics.gif.
           2 collected, 2 still to do
  Retrieved ok
Processing file:/tmp/Comics/Calvin_and_Hobbes.gif.
           3 collected, 1 still to do
  Retrieved ok
Processing file:/tmp/Comics/User_Friendly.gif.
           4 collected, 0 still to do
  Retrieved ok

Writing out collected data...
Writing db 'Comics' to file /home/pilot/.plucker/DB/Comics.pdb
Converted file:/home/pilot/.plucker/HTML/comics.html
Wrote 1 <= plucker:/~special~/index
Wrote 2 <= file:/home/pilot/.plucker/HTML/comics.html
Wrote 3 <= plucker:/~special~/pluckerlinks
Wrote 11 <= file:/tmp/Comics/Calvin_and_Hobbes.gif
Wrote 12 <= file:/tmp/Comics/Dilbert.gif
Wrote 13 <= file:/tmp/Comics/Dilbert_Classics.gif
Wrote 14 <= file:/tmp/Comics/User_Friendly.gif
Wrote 15 <= plucker:/~special~/links1
Done!

To be able to use it also on Sundays we add yet another section to the configuration file.

[sunday]
bpp = 2
maxwidth = 550
maxheight = 400
db_file = DB/SundayComics

Using a lower bit depth for the images we are now able to include larger versions of the comics. Each Sunday we would run:

% Spider.py -s comics -s sunday

..and since the parser applies the sections in the given order the changed values in sunday will override the ones in comics.