Indexed Text Retrieval
This article originally appeared in Doctor Dobb's Journal, November 1995.
With the advent of inexpensive storage devices (disk storage currently
costs about $0.50 per megabyte), and the advent of large bodies of
free (or inexpensive) data, such as telephone directories, Internet
documents, FAQs (Frequently Asked Questions), and others, it has become
increasingly necessary and desirable to get this data into local storage,
and index it for fast and convenient access.
While the topic of text retrieval can span many articles, I've chosen
to focus on one area -- telephone number databases. The ideas
presented here can be expanded to include other types of text databases
and formats quite easily, and, where appropriate, will be mentioned.
Phone Database Requirements
My requirements for the telephone number database are simple: it must
be able to support fairly large amounts of data (549 Megabytes are required
for all of Canada, for example), it must be able to index
by phone number in under one second, and by text keys in a matter of
several seconds. I also wanted the search criteria for the database to
be as flexible as possible, ie: "find all hospitals in Toronto", or
"find all the Smith's on Sunset Boulevard", etc. Disk space was not
a consideration, due to the relatively low price per megabyte.
My first requirement, where I need to find all information associated with
a telephone number in under
one second, stems from the fact that I subscribe to the telephone company's
'caller ID' service. For a nominal monthly fee, they deliver the number
of the caller between the first and second ring. I want to be able to
get the associated information quickly, before the second ring.
This requirement was actually quite simple to solve. In a classical
database, the telephone number would serve as the key into the database,
and would perhaps be hashed, and then looked up. In my database, the
telephone number represents a pathname, which I can open and do a binary
search on. Unix file systems are not terribly efficient at searching
through thousands of filenames to find the file that you want to open,
so I implemented what's called a 'digit tree' file structure.
Digit Tree
In this type of structure, there is a root directory that contains the
first digit of the phone number (first digit of the area code, "6" for
"613", etc). Under each of those directories are further subdirectories
that are named after the subsequent digits in the phone number. For
example, the telephone number "6135141212" would be stored in a data
file called:
/databaseRoot/6/1/3/5/1/4
where the last "4" is not a directory, but rather a file. This file
contains all of the 613514 numbers, sorted by the last four digits of
the phone number (station code). For example, the "4" file could
contain (sample only):
1212 Weather Number, 24 hours / day, Ottawa Valley
1213 Liz and Brian's Pizza House, 2666 Temp Street, Ottawa
1443 Pay Phone, 55 Queensway Offramp
The advantages of this particular database organization are numerous.
First of all, the file system has to open only a few
levels of directory (a reasonably quick operation), and search through
just one small file. My experience with the telephone directory for
Ontario, Canada indicates that office codes (the "514" in the above example)
are typically about half full -- so approximately 5000 records. Secondly,
a fair amount of data compression is built into this system. Since
the individual files are stored in a hierarchical structure, the area code
and office code information are implied by the file pathname itself,
and do not have to be repeated as data within the file.
The other advantage is that the 613514 entry is still a small,
flat ASCII text file
-- I can edit it with my favourite editor, without having to go into a
specialized (and substantially more limited) 'database editing utility'
just to change an entry.
Of course, if you wanted to squeeze some data space out of this scheme,
the station code, instead of being represented in 4 character ASCII, followed
by a space, could be squeezed into just 14 bits (say 2 bytes for ease,
saving 3 bytes).
So far, not rocket science. The implementation of getting your
existing telephone database into that particular organization is left as an
exercise to the reader -- I wrote a small C program in a few hours to do it,
which took several hours to run.
In fact, strictly speaking, you don't need the database organized that way
to be able to use most of the other ideas in this article.
Search by Text
Now, to address the second requirement. In the scheme described so far,
the entire database is split out by area code and office code, over a
large number of subdirectories and files (I have 8221 files in my
database at the time of writing!) Obviously, it is not practical to open
each and every file looking for a record that matches a bunch of keys.
Another problem that I have is that not all telephone records have the
same data, nor in the same order. As I accumulated telephone directory
information, I merged from many different databases (Canadian White
Pages CD-ROM, Ottawa/Hull Pay Phone database, private
database of acquaintances, etc), all with their own different formats.
(The example above for 613514 is typical).
This precluded using the traditional database-design trick of creating
a lastname index, a firstname index, a street name index, etc.
In fact, the situation I was stuck with was that I had only two pieces of
data that made any sense across all of the different formats -- a telephone
directory number (DN), and some descriptive text. Since the descriptive
text is not 'tagged' in any way (ie: firstname, lastname, address), it
all has to be considered during a keyword search. This therefore means
that each and every word in the database has to be indexed.
This is the concept of Indexed Text Retrieval (ITR).
Indexed Text Retrieval (ITR)
By indexing each and every word, I now have the ability to search based
upon the logical AND/OR combination of any given words that I choose.
For example, to find all of the hospitals in Ottawa, I just type:
itr hospital ottawa
where "itr" is the name of the ITR program described in this
article, and "hospital" and "ottawa" are just two text keywords. The ITR
program will print out all phone number records that match the two keywords.
(Notice that I've made my version of ITR case insensitive).
Implementation of ITR requires two separate programs -- an indexing
program, and a lookup program. Source code is provided for both.
The process of indexing each and every word involves opening each and every
input file, reading each line (station code and text) and parsing out the
words. In my implementation, I skip anything that is not "A" through "Z";
it automatically becomes a word separator. Therefore, for each word,
I have the following information: file name (in this case, that's the
area code and office code), position (in this case, the station code), and
the word itself. I don't care about the word's relative position on the
line -- I just care that it is associated with a given DN (the implication is
that when I specify the search keys, I don't have to specify them in any
particular order; "ottawa hospital" is equivalent to "hospital ottawa").
This index is large, and managing this amount of data is in itself a
challenge.
The Alpha Tree
However, having had good success with the digit tree approach described
above, I decided to experiment with an alpha tree. Using this scheme,
to find the keys for 'ottawa', I would simply open the file "o/t/t/a/w/a.db"
and find all of the keys. The keys are directly translatable into
the telephone database record.
In order to preserve disk space (after all, I don't have infinite
disk space!), some decisions had to be made about the extent of
the alpha tree. The first decision made was that the maximum depth
that I would need is five levels. Depending upon the data that you
wish to index, this restriction may be drastically altered to suite
your requirements.
To generate the index, I implemented a "bucket sort" algorithm. During
analysis, I would take each DN and word, and generate a ten byte entry,
consisting of six bytes from the word (null padded), and four bytes
representing the DN. Then, I would append this entry to the file with
the same name as the first character of the word. For example, given the
sample database above, the word 'weather' would be stripped to just five
characters ("weath"), and, along with the key (6135141212) would be appended
to a file called "w.1". The next word, "Number", would also be stripped to
five characters ("numbe"), and along with the key (same key as 'Weather',
6135141212), would be appended to a file called "n.1".
Once all words in all files were processed, each
".1" file would be opened in turn, and analyzed. If there was a
sufficient number of records in the file to warrant splitting it up (this
is a tunable parameter in the source), then a subdirectory would be
created, and the process would repeat, except using the next character
of the word as the basis for the filename. For example, the "w.1" file would
be analyzed, and the word "weath" would be written to "w/e.1". This
process continues until either five levels of directory structure have
been generated, or the ".1" file is small enough that it is no longer
necessary to split it. As a final step, when all of the files are
split apart as necessary, the partial word is stripped out of the files,
as it is no longer required,
making each record contain only the four byte (telephone number) key.
The final result of the indexer is an alpha-tree structure on disk that
has, as files, indicies representing phone numbers. This looks like a
classical 'hash'
function, except that the length of the hash is flexible, depending upon
the number of entries. On my disk, I have the following files (extract):
| Filename | Size (Bytes) | Records |
|---|
| w/e/k.db | 36 | 9 |
| w/e/l/m.db | 28 | 7 |
| w/e/l/e.db | 32 | 8 |
| w/e/l/b.db | 1060 | 265 |
| w/e/l/c.db | 2580 | 645 |
| w/e/l/d/a.db | 36 | 9 |
This excerpt shows that there aren't as many words that start with "WEK"
as there are that start with "WELC".
An interesting side note is the fact that as the number of levels increases,
the theoretical number of files increases. The following table summarizes
the theoretical maximum number of files, and the number actually observed
on the database that I have.
| Levels | Maximum | Actual | Percentage |
|---|
| 1 | 26 | 26 | 100.00 |
| 2 | 676 | 462 | 63.02 |
| 3 | 17,576 | 5,730 | 32.60 |
| 4 | 456,976 | 23,949 | 5.24 |
| 5 | 11,881,376 | 32,314 | < 0.01 |
Consuming 141.2 Mbytes of index for the 549 MB database.
Retrieval Software
Let's look at the retriever.
The retriever takes the keywords passed on the command line, and attempts
to find database records that match.
In the case of specifying only one keyword to search for, the retriever
opens the associated keyword file (for example, if the
keyword was "welcome", the retriever would open 'w/e/l/c.db').
Then, for each entry in the keyword file, the retriever opens the corresponding
database file, performing secondary matching on the
database records as they go by. Any records that fully match are printed.
The reason that a secondary match is required is because the ".db" files
are unique only as far as their filenames go. For example, the w/e/l/c.db
file contains keys for all words that START with "welc", such as "welcome",
"welch", etc.
In the case of two or more keywords, the retriever opens all of the specified
keyword files, and attempts to find keys that are the same in all of these
keyword files.
(Since the key files are sorted by key, this is a simple task). Once keys
are found that are the same in all files, the actual telephone database file
is opened, and a secondary match is performed to check that the keywords are
all present in the database entry.
The current retriever as listed provides only an "AND" function -- all of
the keywords listed on the command line must match in order for the record
to be printed. It is a relatively simple matter to add an "OR" capability,
and partial key matching.
Extending the Concept
How can we extend this to usefully index arbitrary bodies of text?
How can we extend the search capabilities to be more powerful?
Let's revisit the indexer first. Our indexer generated alpha-tree
entries that contained the index. The index consisted of a complete
telephone number. We interpreted the telephone number in
two parts; together, the area code and office code was the "filename" portion,
and the station code (last 4 digits) was the "record position" within the
specified filename.
By revising the indexer to generate 8 byte indicies, divided
into 4 bytes for the file name (as a file ID number, DOCNUM), and 4 bytes for
the word position within that file, arbitrary text can be indexed.
In fact, our sample telephone database indexer is a subset of this
approach.
(You can't actually store a full telephone number into 4
bytes, we cheated a little. We created a table of area code and
office code pairs, and stored a six digit index into this table, instead
of the actual area code and office code. In my database, I have 8221 area
code / office code pairs, which means that the index ranges from
0000010000 through to 0082219999, well within the range of 4 bytes. Some
string manipulation is then performed on the decimal ASCII representation of
the 4 bytes, and the area/office code pair and the station ID are split
out, or merged in, depending upon the operation being performed).
The interpretation of the 4 byte file position is left up to the designer.
For certain types of database searches, the 4 bytes might represent the
byte offset of the word from the start of the file. For other types, it
may represent the relative word number in the file. In our telephone
database example, it represented the record number.
A truly flexible approach to ITR could be organized as shown:
INDEXER
=======
+-----------+
+-----------+|
+-----------+||
\ +-----------+||| +---------+
documents | | Front-end |||+ | |
to be +-+---->| |---------------->| Indexer |----> alpha-tree
indexed | | | analyzer |+ Doc#, | |
/ | +-----------+ position, +---------+
| /|\ & word
| |
| |
| |
\|/ |
---- ----
/ \ / \
\----/ \----/
| | | |
| |---->| |
| | | |
\----/ \----/
Filename Doc# to
to Doc# filetype
database database
(DOCNUM) (DOCTYPE)
RETRIEVER
=========
Keywords +-----------+
| +-----------+|
\|/ +-----------+||
+---------+ +-----------+|||
| generic | | Back-end |||+
alpha-tree ------>| |---------->| specific |------->matching entries
| matcher | Position | matcher |+
+---------+ +-----------+
| /|\
| D |
| O |
| C |
| N |
| U ---- ----
| M / \ / \
| \----/ \----/
| | | | |
+-------->| |---->| |
| | | |
\----/ \----/
Filename Doc# to
to Doc# filetype
database database
(DOCNUM) (DOCTYPE)
In this scheme, any documents being added to the ITR database are assigned
a document number in the DOCNUM database. This allows an association
to be made between
the filename (which can be arbitrarily long and complex, e.g. hypertext links)
and a 4 byte document number to be established.
Next, another database is established that allows the assocation between the
document number and a document type (DOCTYPE). The document type is used by
the
front-end analysis scheme to choose an appropriate analyzer, and by the
back-end matching scheme to choose a corresponding matcher.
To index a file, once the DOCNUM and DOCTYPE database entries have been
created, every word within the file is processed by the appropriate
front-end analyzer, and the document number, word position, and word itself
are sent to the indexer. The indexer creates/updates the alpha-tree
based upon the word, adding the document number and word position at the
appropriate place.
To retrieve a document, a generic matcher is passed the list of keywords that
are being searched for, and finds them in the alpha tree. Every time the
generic matcher finds a match (based upon the alpha-tree information),
it calls the specific matcher for that file type (from the DOCTYPE database)
to verify the match. If the match is verified, the appropriate information
is printed. Again, depending upon the exact requirements, there may be no
need for a back-end matcher, if, for example, the alpha-tree information
is complete, rather than partial (the telephone database used a partial tree).
The DOCTYPE database and analyzer/matcher are where a large part of the
flexibility of the system comes into play. The analyzer/matcher operate
as a pair, so it is entirely up to them to interpret the word position
that is returned from the generic matcher. As an example,
the DOCTYPE could indicate a postscript file, or a spreadsheet file,
or a word processor file, or ...
This would allow the indexing of arbitrary documents.
|