Improving USENET News Performance
This article originally appeared in Doctor Dobb's Journal, May 1996
Let me start by teasing existing system administrators who have a
newsfeed coming into their site. How would you like to be able to run
"expire" in a few seconds instead of the usual several hours? Would you
like to see your disk have almost no activity while a newsfeed is coming
in, or news is being expired, instead of the usual "this PC is about to
explode" frenzied disk thrashing?
What is news?
Most everyone today is familiar with Usenet news. Just the mention of
newsgroups like "comp.os.qnx", or "alt.fan.dan-quayle", is enough to
recognize what I mean. People variously call it "News", or "Internet
News", or "Usenet news", or whatever it happens to look like on the
system that they log in to. We'll stick with the original name, "Usenet
news".
We'll look at how Usenet news works on existing Unix systems. Then,
I'll describe how this arrangement can be improved in speed and
efficiency by at least an order of magnitude, and then I'll explain the
steps required to do this under the QNX operating system.
How does news work?
People from around the world post messages, (called "articles", in news
jargon) to the various newsgroups. Their news systems then distribute
these articles to neighboring machines. Neighboring machines distribute
the articles to their neighbors, and so on, until the article has
propagated all the way around the world. Machines check the incoming
articles to see if they already have a copy, and quietly delete the
article if they do. Amazingly, this all works!
We won't focus on the details of how the machines get the articles,
suffice it to say that there is a program on the user's system that is
responsible for getting articles from other machines, and storing them.
It is how this program DOES that operation that is the main focus of
this discussion.
Let's look at a typical system. This system accepts articles for about
4000 newsgroups (this is by no means a huge system, either!). As each
article comes in, the article's "header" portion is scanned, and the
news software determines where (ie: into which newsgroup(s)) that
article should be stored.
A long time ago, when there wasn't all that much news traffic, it seemed
like a good idea to just store one article per file. The newsgroup
names got converted into pathnames, and everything was simple.
For example, if I had an article for the newsgroup "comp.os.qnx" that
just came in, I would pick the next article number for that newsgroup
(say it was 1143), and store the new article in a file called
/usr/spool/news/comp/os/qnx/1143. (The /usr/spool/news part is just the
name of the directory where all of the incoming news articles live --
it's up to each site to determine which directory that will be, but
/usr/spool/news is very common.)
The next article that came in for that newsgroup would go into
/usr/spool/news/comp/os/qnx/1144, and so on.
So why is this a problem?
There are a number of reasons why this isn't an ideal solution.
Articles can arrive in any order -- we're not always going to get all of
the articles for comp.os.qnx, then all of the articles for
comp.os.rsx11, then comp.os.svr4, etc. This means that as the articles
arrive, the news storage program is creating files in an ad-hoc manner,
all over the disk. Not only that, it is creating from a few thousand to
several hundred thousand files per day, depending upon the size of the
feed! (This works out to tens of files per second). This means that the
poor disk is getting quite a workout.
Given the volume of news these days, the disk would fill up quickly.
So, most news systems have an "expiry policy". This is described in a
text file that associates newsgroups with retension periods. For
example, how long do we retain articles in "comp.os.qnx", before
deleting them? This usually depends upon two major factors -- how much
disk space you have, and which newsgroups you value. "news.answers",
which contains the FAQs (Frequently Asked Questions) for most newsgroups
is a valuable source of information -- I keep mine around for a month.
Articles in the newsgroup "alt.barney.dinosaur.die.die.die", while
amusing, don't get to stay around for more than 4 days.
If the disk is to never overflow (experienced sys admins will chuckle
knowingly here), the volume of news coming in must equal the volume of
news being expired. On a typical news system, the "expiry" processing
is done by a batch file that runs periodically, say once or twice a day.
With current implementations, the expiry processing takes on the order
of a few hours. Sometimes, the expiry processing will take so long that
it appears that the machine is doing nothing but expiring!
Therefore, not only are there many files being created in random places
on the disk each second, there are also a roughly equal number of files
being deleted, from different random places each second.
This is suboptimal.
To make matters worse, the news systems in common use end up copying
each article many times over before placing it in its final resting
place, and then, when it comes time to expire the article, an additional
read of the article takes place. (See diagram 1, "Conventional News
Flow").
How can this possibly be made better?
Once the files are in their appropriate places, it is hoped that a large
number of users will read a large percentage of the articles that have
been stored.
Sadly, even in a large company, the vast majority of news articles are
never read. They simply expire.
Rather than using a general purpose filesystem to (suboptimally) manage
a news system, why not use our knowledge of how these files are going to
be created, read, and then deleted, to create a filesystem uniquely
tailored to handling this information?
The one major drawback to coming up with a "new" filing strategy is that
there is currently a HUGE body of free software available on the
Internet that is intimately aware of the current organization of news
articles (ie: the /usr/spool/news structure just described). It would
certainly not be a popular idea to tell thousands of developers that
there is a new way of doing news, and would they please change all of
their programs!
I'm going to describe the operation of a program called KNews (kay-
nooz), that takes advantage of some of the features of the QNX operating
system to more efficiently handle news articles.
Under KNews, there are a host of cooperating programs, of which we'll
only describe the two main ones. One program gets the articles from
other machines and stores them, and another program implements a
"virtual filesystem" (which I'll describe shortly).
Getting the articles
The first program, the "newsfeeder", knows how to get articles from
other machines. There are actually two main variants of the newsfeeder,
one that knows how to get articles from NNTP-based hosts, and another
that knows how to get articles from UUCP/rnews feeds.
When the newsfeeder gets an article, it looks at the header. In
conjunction with the local expiration policy file, the newsfeeder
program determines right then and there when this article will expire.
The article is then appended to a file that has other articles already
in it that expire at that same time. When we put the article into the
file, we make a note of the position within the file where we start
writing the article, and the length of the article.
Let's assume we were examining article #677, posted to "ott.forsale"
that had a posting date of Jan 15th, 1996, at 10:34 GMT. Our expiration
period for that particular newsgroup is 4 days. So, we know that the
article will expire Jan 19th, 1996, at 10:34 GMT. Therefore, we store
the article at the end of a file called "960119.10" (we ignore the
":34", as we have defined our expiry granularity to be 1 hour). We make
a note of the position within the file that we stored the article at,
the length of the article, and the article number:
ott.forsale, art#677, exp 1996 01 19 10:00, pos 36554, len 495
By the time that we had received a day's worth of news, we would have
created some number of files, each containing collections of articles.
The number of files created is a function of the expiration policy and
the expiration granularity. For example, with a maximum expiration of
10 days, and an expiration granularity of 6 hours, we would create 40
files (10 * 24 / 6). This is between 3 and 4 orders of magnitude fewer
files than conventional means!
Retrieving articles
Ok, so now all of the articles are in a few big files (we'll call these
files "expiry-ordered heap (EOH) files"), stored in a directory. (By
convention, this directory is /usr/spool/knews).
What we really want to do is have the operating system present a
/usr/spool/news directory structure to all applications. This way, none
of the applications have to be modified to work with KNews.
Really what this means is mapping files under /usr/spool/news into
portions of our EOH files. (See diagram 2, "mapping /usr/spool/news to
EOH files in /usr/spool/knews").
The newsfeeder program already provides all of the information that we
need; it tells us the newsgroup name, the article number, the EOH file
name, the position within that file, and the length of the article.
Under most operating systems, creating a virtual filesystem that does
this kind of mapping is a major undertaking. Under the QNX operating
system, it is a few hundred lines of code. This article focuses on just
this task.
Taking over /usr/spool/news
We want our fake /usr/spool/news directory structure to walk, talk, and
act just as if it was a real disk-based filesystem.
Our virtual filesystem for news (VFsys.knews) should export the illusion
of two main object types; subdirectories and files. It should support
read-only attempts to access these objects; we don't need the additional
complexity associated with writing for our example.
We should be able to:
$ cd /usr/spool/news/comp/os/qnx
$ ls
1134 1136 1137 1142
$ cat 1134
<contents of article 1134 show up>
$ cd ../vms
$ pwd
/usr/spool/news/comp/os/vms
Since it is a read-only filesystem, we also require some method of
telling VFsys.knews about new articles, by giving it the article number,
expiry date, offset, and length.
A Little Background
Let me digress just a little before we consider the meat of the problem.
QNX is a microkernel, message-passing operating system, with most
services implemented via a message. For example, when an application
("client") opens a file, the application's C library open code places
the arguments passed to the open() call into a message, and sends this
message to the process that is managing the file system, Fsys
("server").
Fsys decides if the open should succeed or not (based on permissions,
existance of the file, etc), and returns a status. Some time later, the
client might request a read of 200 bytes. Again, the client's library
composes a message which it then sends out to the server, requesting
that a read be performed. The server will go out to disk, and return
the data to the client. (See reference at end of article for more
information about message passing performance).
Fsys isn't the only program that is allowed to receive and process open
and read calls, though. Under QNX there is no "sacred" attribute
associated with programs like "Fsys", versus user written programs like
"hello world".
We can do a really neat trick under QNX that we can't do under other
operating systems easily. Under QNX, I can write a program that assumes
responsibility for a portion of the pathname space. This functionality
is a subset of what is generally termed a "resource manager" under QNX.
What does a resource manager have to do?
Let's look at what has to happen in order to successfully take over
/usr/spool/news.
We have to:
- tell someone that we are now responsible for a portion of the pathname space (we register ourselves),
- handle requests from clients, (open, read, close, etc)
- handle other messages from our cooperating news feeder program (new article has arrived, expire, etc).
To implement step 1, the function resmgrInit in resmgr.c calls the
qnx_prefix_attach call to associate our program with a particular
pathname prefix -- in this case "/usr/spool/news". This then causes any
process opening files and directories in this portion of the pathname
space to have its messages sent to our process, instead of the process
managing the actual disk drive.
To implement steps 2 and 3, the function serve in resmgr.c waits for
messages from clients (the Receive function call). This is a blocking
call -- the program does not continue past the Receive call until a
message has been received. Once we get a message, we call the
appropriate routine from the jump table. We can receive a whole range
of messages, each of them corresponding to either some C-function call
that a client can execute, or some of our special "news feeder"
messages.
Most notably, in terms of "regular" messages, we receive messages
corresponding to the open, read, write, readdir, and stat calls.
To implement our virtual filesystem, we just have to handle these calls.
Let's look at these calls in turn. The open call specifies which
particular file is being opened. Because we can get requests from
multiple clients (ie: three different terminal sessions can all do an
"ls" at the same time of some portion of /usr/spool/news), we have to
track processing based upon who is sending the message. So, in the open
handler you will notice that we associate an "OCB" (Open Control Block)
with the particular request, via the "map_ocb" function call. This OCB
holds the state information associated with a particular client's use of
the virtual filesystem.
There are really two types of requests that we can handle: file requests
and directory requests.
File requests deal with the following functions (resmgr.c procedures in
brackets):
open a file (io_open)
read that file (io_read)
close that file (io_close)
and directory requests are similar:
open a directory (io_handle)
read entries out of that directory (io_readdir)
close that directory (io_close)
Once we have associated the context with the open call, we then expect
that we will receive a read request, followed by a close request. In
the close request is where we dissociate the context from the request.
The real work is done in the open (in the case of files) or the read (in
the case of directories) functions.
For a file, the open call has to verify that the file actually exists,
then, it has to initialize the context for that file.
To determine if the file exists, we look at some internal data
structures that VFsys.knews maintains (in database.c). This data is a
network of directories and files, similar to what a filesystem manager
would actually maintain on disk. (See diagram 3, "Internal Database
Organization"). By following the chain of directories (shown as
circles), we eventually get to a file entry (shown as a rectangle). The
file entry tells us the article number, EOH filename, position, and
length.
We then open the EOH file, seek to the starting position for the article
within the EOH file, and store the file descriptor into the context
block.
Then, whenever the client performs a read, our server just reads bytes
from the file descriptor, and transfers them back to the client.
For directories, we have a similar flow. The directory path is
evaluated against the internal database structure, and checked for
validity. If invalid, an error status is returned to the client. If
valid, an "ok" status is returned to the client, and it is expected that
the client will send the server some "io_readdir" messages to get the
directory entries. We again make use of the context block, but this
time we drive the processing by a finite state machine. The context
block's state variable, ocb -> state, can have one of the following
state values: OCBState_Initial, OCBState_GetFiles,
OCBState_GetDirectories, or OCBState_Done.
OCBState_Initial generates the virtual directory entries "." and ".."
(that each directory must have), and then transits to either the
OCBState_GetFiles or OCBState_GetDirectories state, depending upon what
is contained in the internal database structure.
OCBState_GetFiles returns directory entries for the files, one by one
from the internal database structure, and OCBState_GetDirectories does
the same thing for subdirectories.
That's the basics of handling a virtual filesystem under QNX 4. You
will notice that I have other functions in there as well, like an
"io_stat" handler (for the client's "stat" function call), and
"io_chdir" etc. The operation of these is analogous to the operation of
the simple open/read/close and handle/readdir/close messages.
Apart from the standard "resource manager" messages, there are also a
number of messages that deal specifically with VFsys.knews itself -- we
have messages that tell VFsys.knews about a new article that's just been
made available, or tell VFsys.knews to expire articles, etc.
Let's look at the handling of the expiry message. When a client sends
the expiry message, the Receive call in procedure "serve" in "resmgr.c"
gets the message, and the newsmsgsJT jump table ends up calling the
procedure newsExpire. The code for newsExpire is in resmgr.c, and
basically calls journalExpire and databaseExpire, which do the actual
work. "databaseExpire" runs through all of the internal database
elements, and finds all of the ones that are expired, and removes them.
(We optimized things here for speed by having the database entries
sorted by expiration. On my system, a 486/33, with about 10k articles
being expired, this took under 1 second. You can run expire once per
minute if you wanted to, although a practical limit would be once per
hour. I run mine every 3 hours.)
|