NOTE: I wrote the spamometer in 1997, used it for quite a while and received filters from a number of people. I never made much effort to publicize the effort or to turn it into what I really imagined. Since that time, there have been many spam fighting tools released. In particular, the Spam Assassin is very close to the spamometer in its approach. I now use the Spam Assassin (as well as a few other bits and pieces). So the below is mainly interesting (to me in any case) for historical reasons.

Contents:

1. Spam detection philosophy
    1.1. Header space vs body space
2. The spamometer
    2.1. Getting the Spamometer
    2.2. Description
    2.3. Invocation
    2.4. Spamometer function files
    2.5. Diagnostics
    2.6. Exit status
    2.7. Debugging hint
    2.8. Unpacking
    2.9. Manifest
3. The Library Of Spam
    3.1. Contribute via procmail
    3.2. But why?
    3.3. Acknowledgment
4. Help me
5. Future


1. Spam detection philosophy

The problem of detecting spam mail is usually addressed by looking for certain properties of mail messages and rejecting the message as spam if one or more of those properties is true.

The logic and implementation of this is usually very straightforward. Even with very limited logic (e.g., this mail header is present OR this header contains this pattern), one can (currently) do quite a good job of spotting spam.

This approach is (currently) quite effective, but these methods have some drawbacks:

There is room for improvement here. For example, I recently wrote a filter which looks at the first few lines of a mail message for lines that end with exclamation marks. If there are many of these lines, especially if the lines contain many !'s and no lower case letters, I conclude the message is probably spam. This function is independent of mail header and sending domain. It never needs to be updated once one is happy with how upset it gets at seeing !'s.

My driving philosophy is that one should be accumulating evidence, rather than doing black or white Boolean processing, and that this evidence will mainly need to be based on the message body. Of course headers are important too, especially now when a lot of spam can be detected by simply examining headers (see below).

I'm looking to implement something that, in its internal processing, carries on a dialogue like this:

"Ah, this mail contains that Pegasus mail header, so it's probably spam. But then again, it doesn't mention the word "money" and it has no '$' signs in the first paragraph, so perhaps it's not spam. And look, there's a "In-Reply-To:" header, which also makes it unlikely to be spam. But look at all those exclamation marks at the end of lines 9, 10 and 11, that's got to be spam. Plus, there's a line that says "Send mail to XXX to remove yourself from this list" and a line that says "This is not a MLM scheme". But wait, those things are part of an included mail, not part of the main message, so this is probably someone sending you a copy of a piece of spam mail, and anyway, who ever heard of spam that includes other people's mail - this isn't spam."

From this successive accumulation of evidence, the program decides one way or another.

1.1. Header space vs body space

Another reason I believe we will need to move towards a probabilistic approach that mainly considers the body of the mail is that spammers are continually getting better at making their headers look authentic. At the same time, the content of most spam doesn't change much. If a spammer gets an account with a new ISP and uses it to send spam, there is no way to block that spam by simply looking at mail headers (because these will be innocuous). There are hundreds of domains from which spam is known to come. But the spam that comes from these domains cannot be detected the first time it is seen if we continue to look only at headers and domain names. Detecting the spam the first time is important, because the speed with which spammers can (and do) move means that next time the spam may well be coming from somewhere else.

To do this properly, we need to be looking at message bodies. The war should be fought there, not in message headers. The spammers will get better and better at making sure the right headers are present and that suspicious headers are not. The spammers have far less control over the content of the messages they send out. Fighting over headers is a battle that, I believe, spam detectors will ultimately lose. But trying to identify spam based on message content is a battle that can be won, perhaps quite easily.

Header processing can also benefit from a probabilistic approach. For example, right now there may be no spammers who insert a fake "In-Reply-To" into their outgoing spam. For this reason, a function that finds such a header and returns $IS_NOT_SPAM (the spamometer's indicator that a message is definitely not spam) will be completely reliable. But suppose some spammers realize that inserting this header will increase their probability of getting mail through (they may read it on this web page, for all I care). The correct response to this is to simply lower the probability with which one declares a message not to be spam if that header is seen. Thus we may instead return 0.9 to indicate that there's a 90% chance the message is not spam. This approach allows the spamometer to be adapted to the current state of the spamming art. This is not possible with a simplistic Boolean approach. The Boolean approach has to either declare the message to be spam or not based on the mere presence or absence of the header. There is no room for fuzziness or the accumulation of evidence.

As header processing becomes less reliable (because spammers inevitably get better at creating apparently correct headers), the probabilities coming from spamometer header functions will be gradually reduced. We will probably reach the point when the most we can say is that there's a small chance a message is spam based on headers alone. As I mentioned, if we looked at the content, it might be the same in 2 years time as it is today.

For these reasons, I wrote the spamometer. It's not complete, but it provides a framework for what I know is possible.

top


2. The spamometer

2.1. Getting the Spamometer

The spamometer is available via ftp at ftp://jon.es/pub/spamometer.tar.gz.

The spamometer is available to you under the terms of the GNU General Public License.

2.2. Description

The spamometer is a perl script (you'll need perl 5) that attempts to guess whether a mail message (read from STDIN) is spam. Most people will want to call it from something like procmail.

It allows you to do very sophisticated mail processing. In fact, you should be able to do anything you want. The .spamometerrc file included with this distribution shows just a few examples. I'll write more soon, this was a one-night hack. These include functions that look for mail messages that

This is not meant to be a replacement for procmail in any way. You can do some things easily that are difficult or impossible in procmail (without an external helper program, such as this one). In particular, the spamometer knows nothing about delivering mail!

The spamometer knows nothing at all about what constitutes spam. All of this is decided by user-supplied spamometer functions which the spamometer calls according to user-defined priorities. These functions are of 3 types:

  1. Functions that get called when only the header of the mail message has been read.

  2. Functions that get called after each line of the body is read. These functions also have the full header at their disposal.

  3. Functions that get called only when the entire mail body is read. These functions are passed a file whose contents is the body of the message. These functions also have the full header at their disposal.

Users supply files of these functions on the command line. This allows people to run the spamometer with large collections of functions for spam detection. Typically, once this thing gets further along, users will not need to be writing their own spamometer functions.

2.3. Invocation

Invocation is

spamometer [-c] [-n] [-v] [function-files] < file

(or through procmail, see the file INSTALL).

-i (ignore) means do not read the $HOME/.spamometerrc file

-c (continue-processing) stops the spamometer from exiting when a spamometer function returns $IS_SPAM or $IS_NOT_SPAM. This is useful for running all the registered spamometer functions on a message (with -v) to see what they would each produce.

-v (verbose) means be verbose. You'll see why the spamometer considers the mail to be spam, plus you'll get a header message that includes the Message-ID: from the mail.

Additional arguments will be taken as spamometer function files. These will be read with perl's "do' statement. For what these files may contain, see Spamometer Function Files below.

2.4. Spamometer function files

The spamometer function files (of which $HOME/.spamometerrc is the default) contain perl functions that look at mail headers and bodies and try to figure out if the mail is spam. These functions can interact with one another, they can use common subroutines, they can maintina state with normal perl variables etc.

The $HOME/.spamometerrc will always be the first function file included (using perl's "do" statement). Avoid this with the -i option.

The idea of the design is for the spamometer to deal with getting the mail, calling the perl functions (in an approrpiate order) in the spamometer function file, and dealing with their results. You get to simply concentrate on the function files. You can include spamometer functions from anyone you please. You should be able to do almost anything you want. You shouldn't need to modify the spamometer to add new tests for spam.

In order to have one of your functions run at the appropriate time, you must write it, place it in a file that you pass to the spamometer, and register the function. Registration looks like this

  register("subject_re_test",     $HEADER_FUNC,    priority);
  register("too_many_bangs_test", $BODY_FUNC,      priority);
  register("uppercase_test",      $FULL_BODY_FUNC, priority);

The first arg is the (string) name of the function you want to register. Next is the spamometer function type (see above), there are only 3, use the predefined variables to indicate which yours is.

Lastly, give a priority. Priorities run from 0 (or $HIGHEST_PRIORITY) up to 100 ($LOWEST_PRIORITY). You may also use $DEFAULT_PRIORITY to get something in between. The priority is used to determine the order in which the functions (of the same type) will be called.

The highest priority $HEADER_FUNC spamometer functions are called following the reading of the header. **If no $HEADER_FUNC functions are registered, the header is simply skipped and will not be available to any other functions.** This can easily be avoided by defining a $HEADER_FUNC function that simply returns $IS_NOT_SPAM.

Then, lower priority header functions are invoked. Then, assuming the message has not yet been classified as spam or non-spam, the body is read line by line. After each line, all $BODY_FUNC functions are called, from highest to lowest priority.

Finally, when the entire message body has been read, $FULL_BODY_FUNC functions are called, also from highest to lowest priority.

It might be useful to think of priorities as expected return times. If a spamometer function can do its work very quickly (e.g., by simply looking for a regexp match in one header line), give it a low priority. If a function does a grep through 800 domain names, giving it a higher priority will ensure that simple and faster tests (if any) will be run first.

Spamometer functions are all expected to return one of the following values:

  $IS_SPAM
  $IS_NOT_SPAM
  $NO_OPINION
  $I_GIVE_UP      ($BODY_FUNC functions only)
  (0.0 ... 1.0)   (i.e., a real value greater than 0.0 and less than 1.0)

These have the following semantics:

  $IS_SPAM      = The message is spam with probability 1.0. Exit.
  $IS_NOT_SPAM  = The message is not spam with probability 1.0. Exit.
  $NO_OPINION   = I don't know anything (yet).
  $I_GIVE_UP    = I don't know, and I give up. Please don't call me again.
  (0.0 ... 1.0) = I estimate this as the probability this message is spam.

The last option is currently ignored (see the section on Future below).

In the case of $BODY_FUNC functions, a return value of $I_GIVE_UP will cause the spamometer to stop calling that function on subsequent body lines. This allows for spamometer functions that attempt to identify spam by looking at the start of the body of a message. If they cannot identify the message as spam, they return $I_GIVE_UP to eliminate themselves. This is a big advantage, since the entire body of the mail message need not be read. Typically, such functions will either make a quick decision that a mail is or is not spam, and will give up after that. For an example, see the too_many_bangs_test in the distributed .spamometerrc file.

The calling interface of these 3 function types is as follows:

    $HEADER_FUNC:    (%headers, $verbose)
    $BODY_FUNC:      ($line, $nlines, $nchars, %headers, $verbose)
    $FULL_BODY_FUNC: ($file, $nlines, $nchars, %headers, $verbose)

In all cases, $verbose indicates whether the user specified -v (in which case messages can be printed (to STDOUT) indicating reasons for rejection as spam, or otherwise). $line is the just-read body line. $nlines and $nchars are the number of lines and chars read so far. In the case of the $FULL_BODY_FUNC functions, this will be the total number of lines and chars in the file. These are passed in the hope that they will save on recomputation (or independent computation by several spamometer functions). $file contains the name of the file with the body of the mail message in it. This file needs to be opened, and you should remember to close it (there may be many functions reading this file after yours).

The %headers variable is an associative array that holds the header of the mail message. This contains keys such as $headers{'subject:'} and $headers{'to:'}. Note that the : is left in the key and the key is always lower case. The colon is left in to allow you to look at both $headers{from:'} and $headers{from'}.

If the spamometer finds continuation lines in the header, these are simply concatenated (with the newline removed). The extra whitespace is left intact.

2.5. Diagnostics

Normal output (produced from -v) appears on STDOUT. Error output appears on STDERR.

By default, the spamometer produces no output. The exit status (see below) of the proces indicates spam or non-spam.

2.6. Exit status

The spamometer exits with status

0 if it believes stdin contains spam
1 if it believes stdin does not contains spam
2 if you get the usage wrong
3 if there is a problem with a function file

This usage is convenient for procmail. If the spamometer cannot determine things one way or another, it exits with the non-spam value.

2.7. Debugging hint

If you're seeing a message from the spamometer that looks like

spamometer: Could not do filename.

this is a sign that the file filename contains perl that is incorrect. Try running just that file with perl filename. This will show you the error. When you've fixed it, you'll see instead a message like this

Undefined subroutine &main::register called at filename line 20.

which means that everything is now fine. Go ahead and try the spamometer again.

2.8. Unpacking

Try something like this

gunzip -c spamometer.tar.gz | tar xfv -

if you see some sort of error, drop me a line. This will create a directory for you, called something like spamometer-0.1.

2.9. Manifest

You should see these files/dirs after unpacking and changing directory into the directory everything got unpacked into:

README.htmlyou know what.
INSTALLbrief instructions on you know what.
spamometerthe perl script.
.spamometerrca simple sample ~/.spamometerrc for you.
.spamometerrc-jdssome functions from J. Daniel Smith
.spamometerrc-terrysome that I use (some from above).
.spamometerrc-extrassome extras.
sample-mailssome mails you can feed to the spamometer (spamometer -v < file works best).

top


3. The Library Of Spam

Yes, it's true, I want your spam!

Yes, that's right folks, I want your spam. I'm going to collect spam for a while. I'm going to write some spamometer functions to recognize this stuff based on the message bodies (though I do want your headers too).

Also, if you currently have files of collected spam lying around, and you don't mind parting with them, please send them on too. You can send me gzipped attachments, or contact me and I'll tell you where you can ftp your collection to.

3.1. Contribute via procmail

If you use procmail to get rid of spam, please consider replacing your rules like this
  :0
  * spam conditions
  /dev/null

with this:

  :0
  * spam conditions
  ! library-of-spam@cliffs.ucsd.edu

3.2. But why?

One of my plans is to feed the messages (perhaps bodies, perhaps headers, perhaps both) to a neural net and/or classification programs & do supervised learning. The evidence that comes from these can then be used by the spamometer. Even if a neural net were only 50% accurate, it would be useful when combined with evidence from other sources. Plus, I want to tune some simpler spam recognizers that operate on mail bodies (looking for !'s, $'s and words like earn, money, mlm and so on).

But to do all this, I need your spam (baby). All of it.

3.3. Acknowledgment

I'll happily make all the spam I receive available to the world. As if we weren't awash with it in the first place.

Additionally, I will personally chisel the names appearing in From: headers of all contributors to the Library Of Spam upon the stone tablet that will be placed above the main portal*.

* congressional funds permitting.

top


4. Help me

There's many ways you could improve this code.

I really have no time for this, and I'm a fairly bad perl programmer. Many nice spamometer functions can be written with no need to understand or change the spamometer code itself. Existing code for detecting spam can be incorporated. This stuff could use a web page, an info file, a manpage. Someone who knows Bayesian stats could do the heart of the thing. If you want to do any of this, please jump in. My fingers are numb.

top


5. Future

The spamometer is basically a piece of scaffolding to support the program I really wanted to write.

The piece that is missing is treatment of probabilities other than 0.0 and 1.0.

Someone with experience in Bayesian inference (or similar) might be able to help me out here. The basic idea is to accumulate evidence that a piece of mail is spam (or otherwise). So, for example, if you know that roughly half the mail you receive with a subject line that is all uppercase and which ends with a ! is spam, you could write a tiny $HEADER_FUNC that returned 0.5 if this were the case. The evidence would be incorporated with evidence from other sources to determine (or, more likely, guess educatedly) if the mail were spam. This evidence might be combined with the presence of 3 or more $ signs in the first 10 lines, plus the high probability that mail coming from a domain that looked like "@.*sales.*\.com" was spam. These sorts of things would be offset by low probabilities when your functions encountered reassuring things in the mail (for example, the presence of a subject line that started with "Re:", the presence of a "In-Reply-To" header, or mail that comes from a .edu domain).

As the arms race of spammer vs spamee evolves, a program such as the spamometer can be easily modified to reflect the current state of the art. For example, where perhaps it was once a reliable indicator that mail was not spam if it contained a "Comments: Authenticated Sender is .*+@" header, this can now be taken as a red flag. The spamometer (if what I want to get done ever gets implemented by someone who knows how) can be nicely adjusted by simply altering probabilities or adding simple functions.


Terry Jones (terry <AT> jon.es)