From LedHed's Wiki
Jump to: navigation, search

UPGRADING DSPAM


Follow the steps sequentially from the base version you are running up to the top.


Upgrading from 3.8


1. Ensure MySQL is using the new database schema. The following clauses should be executed for upgrading pre-3.9.0 DSPAM MySQL schema to the 3.9.0 schema:

ALTER TABLE `dspam_signature_data`
 CHANGE `uid` `uid` INT UNSIGNED NOT NULL,
 CHANGE `data` `data` LONGBLOB NOT NULL,
 CHANGE `length` `length` INT UNSIGNED NOT NULL;
ALTER TABLE `dspam_stats`
 CHANGE `uid` `uid` INT UNSIGNED NOT NULL,
 CHANGE `spam_learned` `spam_learned` BIGINT UNSIGNED NOT NULL,
 CHANGE `innocent_learned` `innocent_learned` BIGINT UNSIGNED NOT NULL,
 CHANGE `spam_misclassified` `spam_misclassified` BIGINT UNSIGNED NOT NULL,
 CHANGE `innocent_misclassified` `innocent_misclassified` BIGINT UNSIGNED NOT NULL,
 CHANGE `spam_corpusfed` `spam_corpusfed` BIGINT UNSIGNED NOT NULL,
 CHANGE `innocent_corpusfed` `innocent_corpusfed` BIGINT UNSIGNED NOT NULL,
 CHANGE `spam_classified` `spam_classified` BIGINT UNSIGNED NOT NULL,
 CHANGE `innocent_classified` `innocent_classified` BIGINT UNSIGNED NOT NULL;
ALTER TABLE `dspam_token_data`
 CHANGE `uid` `uid` INT UNSIGNED NOT NULL,
 CHANGE `spam_hits` `spam_hits` BIGINT UNSIGNED NOT NULL,
 CHANGE `innocent_hits` `innocent_hits` BIGINT UNSIGNED NOT NULL;


If you are using preference extension with DSPAM, then you should execute the following clause for upgrading pre-3.9.0 DSPAM preference MySQL schema to the 3.9.0 schema:

ALTER TABLE `dspam_preferences` 
 CHANGE `uid` `uid` INT UNSIGNED NOT NULL;


If you are using virtual users (with AUTO_INCREMENT) in DSPAM, then you should execute the following clause for upgrading pre-3.9.0 DSPAM virtual uids MySQL schema to the 3.9.0 schema:

ALTER TABLE `dspam_virtual_uids`
 CHANGE `uid` `uid` INT UNSIGNED NOT NULL AUTO_INCREMENT;


If you are using virtual user aliases (aka: DSPAM in relay mode) in DSPAM, then you should execute the following clause for upgrading pre-3.9.0 DSPAM virtual uids MySQL schema to the 3.9.0 schema:

ALTER TABLE `dspam_virtual_uids`
      CHANGE `uid` `uid` INT UNSIGNED NOT NULL;


If you need to speed up the MySQL purging script and can afford to use more disk space for the DSPAM MySQL data, then consider executing the following clause for adding three additional indices:

ALTER TABLE `dspam_token_data`
 ADD INDEX(`spam_hits`),
 ADD INDEX(`innocent_hits`),
 ADD INDEX(`last_hit`); 


2. Ensure PosgreSQL is using the new database schema. The following clauses should be executed for upgrading pre-3.9.0 DSPAM PosgreSQL schema to the 3.9.0 schema:

ALTER TABLE dspam_preferences ALTER COLUMN uid TYPE integer;
ALTER TABLE dspam_signature_data ALTER COLUMN uid TYPE integer;
ALTER TABLE dspam_stats ALTER COLUMN uid TYPE integer;
ALTER TABLE dspam_token_data ALTER COLUMN uid TYPE integer;
DROP INDEX IF EXISTS id_token_data_sumhits;


If you are using virtual users in DSPAM, then you should execute the following clause for upgrading pre-3.9.0 DSPAM virtual uids to the 3.9.0 schema:

ALTER TABLE dspam_virtual_uids ALTER COLUMN uid TYPE integer;


Upgrading From 3.6


1. Add 'Tokenizer' setting to dspam.conf The 'Tokenizer' setting in 3.8.0 replaces tokenizer definitions in the "Feature" clause of previous version configurations. See src/dspam.conf (after make) for more information about this seting.


2. Check calls to dspam_logrotate Earlier versions of 3.6 did not prepend a leading "-l" flag to specifying log file selection. This is now required.


3. Ensure 3.6.0 malaligned hash databases are converted Version 3.6.0 failed to align hash databases to 8-byte boundaries. If you are upgrading from v3.6.0 and are using the hash_drv storage driver, you should run cssconvert to upgrade your .css files to a fully aligned format.


4. Invert "SupressWebStats" setting in dspam.conf SupressWebStats has been changed to simply WebStats, and the setting is inverted. Be sure to update this in dspam.conf.


5. Add "ProcessorURLContext" setting in dspam.conf ProcessorURLContext has been added to toggle whether URL specific tokens are created in the tokenizer process. The "on" value is default for previous versions of DSPAM.


Upgrading From 3.4


Follow all of the steps above, and the following steps:

1. Add "ProcessorBias" setting to dspam.conf ProcessorBias has been added to dspam.conf and must be specified. Since ProcessorBias is the default behavior for previous versions of DSPAM, you will need to add "ProcessorBias on" to dspam.conf. If you have specifically disabled bias, or are using a technique such as Markovian discrimination, you may leave this feature off.


2. Ensure references to SBLQueue are changed to RABLQueue. Older versions of DSPAM used the SBLQueue setting to write files for a DSPAM SBL setup. This has been renamed to RABLQueue. Please change this in dspam.conf if you are writing to a SBL/RABL installation.


3. Add "TestConditionalTraining" setting to dspam.conf TestConditionalTraining has been added to dspam.conf and must be specified to be enabled. Since TestConditionalTraining is the default behavior in DSPAM, it is strongly recommended that you add "TestConditionalTraining on" to dspam.conf


4. Ensure PostgreSQL installation have a lookup_tokens function PostgreSQL systems running v8.0+ must create the function lookup_tokens added to pgsql_objects.sql. The driver now checks your version and uses this function to improve performance on 8.0+.


5. Ensure you are specifying the correct storage driver. hash_drv is now the new default storage driver. hash_drv has no dependencies and is extremely fast/efficient. If you're not familiar with it, you should check out the readme. If you were previously using SQLite, you will now need to specify it as the storage driver: --with-storage-driver=sqlite_drv


NOTE:
Berkeley DB drivers (libdb3_drv, libdb4_drv) are deprecated and have been removed from the build. You will need to select an alternative storage driver in order to upgrade.


FRESH INSTALLATION



PREREQUISITES

DSPAM can use one of many different backends to store its information, and you will need to decide on one and install the appropriate software before you can build DSPAM. The following storage backends are presently available:

   Driver       Requirements
  -------------------------------------------------------------------------
 T mysql_drv:   MySQL client libraries      (and a server to connect to) 
 T pgsql_drv:   PostgreSQL client libraries (and a server to connect to)
   sqlite_drv:  SQLite v2.7.7 or above 
   sqlite3_drv: SQLite v3.x
*T hash_drv:    None (Self-Contained Hash-Based Driver)

  Legend:
   * Default storage driver
   T Thread-safe (Required for running DSPAM in server daemon mode)


In general, MySQL is one of the faster solutions with a smaller storage footprint, and is well suited for both small and large-scale implementations.


The hash driver (inspired by Bill Yerazunis' CRM Sparse Spectra algorithm) is the fastest solution by far and requires no dependencies, supports an auto-extend feature to grow the file size as needed, and is very fast and compact. It does, however, lack some features (such as merged groups support) and uses a lot of memory to mmap() users.


Documentation for any additional setup of your selected storage driver can be found in the doc/ directory. You'll need to follow any steps outlined in the storage driver documentation before continuing.


You can download MySQL from http://www.mysql.com.

You can download PostgreSQL from http://www.postgresql.com.

You can download SQLite from http://www.sqlite.org.


CONFIGURATION


DSPAM uses autoconf, so configuration is fairly standardized with other UNIX-based software:

./configure [options]

DSPAM supports the configuration options below. Generally, the default configuration is more than acceptable, so it's a good idea not to tweak too many settings unless you know what you are doing.


PATH SWITCHES
--prefix=DIR

Specify an alternative root prefix for installation. The default is /usr/local. This does not affect the location of dspam.conf (which defaults to /usr/local/etc). Use --sysconfdir= for this.


--sysconfdir=DIR

Specify an alternative home for the dspam.conf file. The default is prefix/etc.


--with-dspam-home=DIR

Specify an alternative DSPAM home for installation. This can alternatively be changed in dspam.conf, but is convenient to do on the configure line. The default is $prefix/var/dspam, or /usr/local/var/dspam.


--with-logdir=DIR

Specify an alternative log directory. The default is $dspam_home/log. Do not set this to /var/log unless DSPAM will have permissions to write to the directory.


FILESYSTEM SCALE

The default filesystem scale is "small-scale", and writes each user to its own directory in the top-level DSPAM home data directory. The following two switches allow the scale to be changed to be more suitable for larger installations.


--enable-large-scale

Switch for large-scale implementation. User data will be stored as $HOME/data/u/s/user instead of $HOME/data/user


--enable-domain-scale

Switch for domain-scale implementation. When used, DSPAM expects username@domain to be passed in as the user id and user data will be stored as $HOME/data/domain.com/user and $HOME/opt-in/domain/user.dspam instead of $HOME/data/user


INTEGRATION SWITCHES

--with-storage-driver=DRIVER[,DRIVER2[...,DRIVERN]]

Specify your storage driver selection(s). A storage driver is a driver written specifically for DSPAM to store tokens, signature data, and perform other proprietary operations. The default driver is hash_drv. The following drivers have been provided:

mysql_drv:   MySQL Drivers 
pgsql_drv:   PostgreSQL Drivers
sqlite_drv:  SQLite v2.x Drivers 
sqlite3_drv: SQLite v3.x Drivers
hash_drv:    Self-Contained Hash Database


If you are a packager, or wish to have multiple drivers built for any reason, you may specify multiple drivers by separating them with commas. This will cause the storage driver specified in dspam.conf to be dynamically loaded at runtime rather than statically linked. If you wish to build only one driver, but dynamically, then specify it twice as in:

--with-storage-driver=mysql_drv,mysql_drv.


If you will be compiling DSPAM to operate as a server daemon or to deliver via SMTP/LMTP, you will need to use a thread-safe driver (outlined in the chart earlier in this document). You may also need to use some of the driver-specific configure flags (discussed in the DRIVER SPECIFIC CONFIGURATION OPTIONS section below).


--disable-trusted-user-security

Administrators who wish to disable trusted user security may do so by using this configure flag. This will cause DSPAM to treat each user as if they were "trusted" which could allow them to potentially execute arbitrary commands on the server via DSPAM. Because of this, administrators should only use this option on either a closed server, or configure their DSPAM binary to be executable only by users who can be trusted. This option SHOULD NOT be used as a solution to your MTA dropping privileges prior to calling DSPAM. Instead, see the TRUSTED SECURITY section of this document.


--enable-homedir

When enabled, instead of checking for $HOME/$USER/opt-in/$USER[.dspam|.nodspam], DSPAM will check for a .dspam|.nodspam file in the user's home directory. DSPAM will also store each user's data in ~/.dspam when this option is enabled. Because of this, DSPAM will automatically install and run setuid root so that it can read each user's home directory.


NOTE:
This function is incompatible with most implementations of the Web UI, since it requires access to read each user's home directory. Therefore, only use this option if you will not be using the Web UI or plan on doing something asinine like running it as root.


--enable-daemon

Builds DSPAM with support for daemon mode, and builds associated dspamc thin client. Pthreads is required to build for daemon mode and the storage driver used must be thread-safe.


DRIVER SPECIFIC CONFIGURE SWITCHES

Some storage drivers have their own custom configuration switches:


mysql_drv:

--with-mysql-includes=DIR

Specify a path to the MySQL includes


--with-mysql-libraries=DIR

Specify a path to the MySQL libraries (Currently links to -lmysqlclient, also -lcrypto on some systems)


--enable-virtual-users

Tells DSPAM to create virtual user ids. Use this if your users don't actually exist on the system (e.g. in /etc/passwd if using a password file)


--enable-preferences-extension

MySQL supports the preferences extension, which stores user preferences in mysql instead of flat files (the built-in method)


--disable-mysql4-initialization

If you are compiling libdspam for use with a third party application, and the third party application makes its own calls to libmysqlclient, you should use this option to disable libdspam's initialization and cleanup of libmysqlclient, and allow the application to manage this. This option suppresses libdspam's calls to mysql_server_init and mysql_server_end.


NOTE:
Please see the file doc/mysql_drv.txt for more information about configuring the mysql_drv storage driver.


pgsql_drv:

--with-pgsql-includes=DIR

Specify a path to the PgSQL includes


--with-pgsql-libraries=DIR

Specify a path to the PgSQL libraries (Currently links to -lpq, and netlibs on some systems)


--enable-virtual-users

Tells DSPAM to create virtual user ids. Use this if your users don't actually exist on the system (e.g. in /etc/passwd if using a password file)


--enable-preferences-extension

Postgres supports the preferences extension, which stores user preferences in pgsql instead of flat files (the built-in method)


NOTE:
Please see the file doc/pgsql_drv.txt for more information about configuring the pgsql_drv storage driver.


sqlite_drv: sqlite3_drv:


--with-sqlite-includes=DIR

Specify a path to the SQLite includes


--with-sqlite-libraries=DIR

Specify a path to the SQLite libraries


DEBUGGING SWITCHES

--enable-debug

Turns on support for debugging output. This option allows you to turn on debugging messages for all or some users by editing dspam.conf or setting --debug on the commandline. Enabling debug in configure only adds support for debug to be compiled in, it must still be activated using one of the options prescribed above. Debugging support itself doesn't use up very many additional resources, so it should be safe to leave enabled on non-enterprise class systems.


--enable-verbose-debug

Turns on extremely verbose debugging output. --enable-debug is implied. Never use this on production builds!


NOTE:
When verbose debug is compiled in, DSPAM performs many additional mathematical calculations regardless of whether or not it's been activated. You shouldn't use --enable-verbose for production builds unless you have serious issues you can't resolve.


FEATURE ACTIVATION

--enable-clamav

Enables support for Clam Antivirus. DSPAM can interface directly with clamd to perform virus scanning and can be configured to react in different ways to viruses. See dspam.conf for more information.


ADDITIONAL CONFIGURATION OPTIONS

The remainder of configuration options are located in dspam.conf, which is installed in sysconfdir (default: /usr/local/etc) upon a make install. It is generally a good idea to review dspam.conf and make any changes necessary prior to using DSPAM.


BUILDING AND INSTALLING


After you have run configure with the correct options, build and install DSPAM by performing:

make && make install


NOTE:
If you are a developer wanting to link to the core engine of dspam, libdspam will be built during this process. Please see the example.c file for examples of how to link to and use libdspam. Static and dynamic libraries are built in the .libs directory. Needed headers will be installed in $prefix$/include/dspam.


PERMISSIONS


In the typical UNIX environment, you'll need to worry about the following permissions:


The CGI User: This is the user your web server (most likely Apache) is running as. This is commonly 'nobody' or 'web'. You can find this in Apache's httpd.conf by searching for 'User'. The CGI user will need the ability to access the following components of DSPAM:

  • Ability to execute the dspam binary
  • Ability to read and write to dspam_home/data/
  • Trusted user permissions in dspam.conf ("Trust [username]")
  • The execution 'Group' used must match the group dspam is running as (this is typically 'mail', 'dspam', or similar).


The MTA User: This is the user your mail server software is running as when it executes DSPAM. This is usually daemon, mail, exim, etc. This is typically different from the user the MTA runs and polices itself as, to avoid security problems. Consult your MTA's documentation for more info. The MTA user will require:

  • The ability to execute the dspam binary
  • Trusted user permissions in dspam.conf ("Trust [username]")


Systems Administrators: In order to perform administrative functions, systems administratiors will require:

  • The ability to execute dspam-related binaries
  • Trusted user permissions in dspam.conf ("Trust [username]")


NOTE:
If the MTA is communicating with DSPAM via LMTP (explained later), then execution permissions are not necessary.


NOTE about FreeBSD:
FreeBSD's default MTA user is 'mailnull' FreeBSD's default delivery agent also changes its uid, and so in order to call it, dspam must be installed as setuid root to work on the commandline properly. This is done automatically on install.


Understanding Trusted User Security

DSPAM has tighter security for untrusted users on the system to prevent them from touching other user's data or passing arbitrary commands to the delivery agent DSPAM calls. "Trusted User Security" is a simple system whereby any unsafe functions are not available to a user calling dspam unless they are within dspam.conf's trusted user list.


Local non-privileged users should be able to use DSPAM without any problems while remaining untrusted, as long as they behave. For example, an untrusted user cannot set their DSPAM username to any name other than their username. Untrusted users are also limited to the delivery options set by the system administrator, and cannot redirect how DSPAM delivers mail.


A list of trusted users is maintained in dspam.conf. This file should include a list of trusted users who should be allowed to set the dspam user, passthru parameters, and other information that would be potentially dangerous for a malicious user to be able to set. You'll need to ensure that your CGI user, MTA user, and system administrators are on the list.


MAIL SERVER INTEGRATION


As previously mentioned, there are three popular ways to implement DSPAM:


As a delivery proxy

The default approach integrates DSPAM directly with the mail server and filters spam as mail comes in. Please see the appropriate instructions in doc/ pertaining to your MTA.


As a POP3 proxy

This alternative approach implements a POP3 proxy where users connect to the proxy to check their email, and email is filtered when being downloaded. The POP3 proxy is a much easier approach, as it requires much less integration work with the mail server (and is ideal for implementing DSPAM on Exchange, etcetera). Please see the file doc/pop3filter.txt.


As an SMTP Relay

DSPAM can be configured as an SMTP relay, a.k.a appliance. You can set it up to sit in front of your real mail server and then point your MX records at it. DSPAM will then pass along the good mail to your real SMTP server. See doc/relay.txt for more information. The example provided uses Postfix and MySQL.


Trusted users and the MTA

If you are using an MTA that changes its userid to match the destination user before calling DSPAM, you won't be able to provide pass-thru arguments to DSPAM (these are the commandline arguments that DSPAM in turn passed to the local delivery agent, in such a configuration). You will need to pre-configure the "default" pass-thru arguments in DSPAM. This can be done by declaring an untrusted delivery agent in dspam.conf. When DSPAM is called by an untrusted user, it will automatically force their DSPAM user id and passthru delivery agent arguments specified in dspam.conf.

This information will override any passthru commandline parameters specified by the user. For example:

UntrustedDeliveryAgent       "/bin/mail -d $u"

The variable $u informs DSPAM that you would like the destination username to be used in the position $u is specified, so when DSPAM calls your LDA for user 'bob', it will call it with:

/bin/mail -d bob


ALIASES

There are essentially two different ways a user might train DSPAM. The first is by using the Web UI, which allows them to retrain via the "History" tab. This works quite well, as users must visit the Web UI occasionally to review their quarantine anyway (and reverse any false positives). We'll discuss this shortly in section 1.1.8.


The more common approach to training, discussed here, is to allow users to simply forward their spam to an email address where DSPAM can analyze and learn it. DSPAM uses a signature-based system, where a serial number of sorts is appended to each email processed by DSPAM. DSPAM reads this serial number when the user forwards (or bounced) a message to what is called their "spam email address". The serial number points to temporary information stored on the server (for 14 days by default) containing all of the information necessary for DSPAM to relearn the message. This is necessary in order to relearn the *exact* message DSPAM originally processed.


NOTE:
If you are using an IMAP based system, Web-based email, or other form of email management where the original messages are stored on the server in pristine format, you can turn this signature feature off by setting "TrainPristine on" in dspam.conf. DSPAM will then use the message itself that you provide it to train, which MUST be identical to the original message in order to retrain properly.


Because DSPAM learns each user's specific email behavior, it's necessary to identify the user in order to program their specific filtering database. This can be done in one of three ways:


The Simple Way

If you are using the MySQL or PgSQL storage drivers, the original numeric user id can be embedded in the signature, requiring only one central spam alias to be necessary for the entire system. To configure this, uncomment the appropriate UIDInSignature option in dspam.conf:

# MySQLUIDInSignature    on
# PgSQLUIDInSignature    on  


Now all you'll need is a single system-wide alias, and DSPAM will train the appropriate user when it sees the signature. An example of an alias might look like:

spam:"|/usr/local/bin/dspam --user root --class=spam --source=error"


Similarly, you may also wish to have a false-positive alias for users who prefer to tag spam rather than quarantine it:

notspam:"|/usr/local/bin/dspam --user root --class=innocent --source=error"


NOTE:
The 'root' user represents any active dspam user. It is necessary to supply a username on the commandline or DSPAM will bail on an error, however the user will be changed internally once the signature is read.


The Kind-of-Simple Way

If you're not using one of the above storage drivers, the next easiest way to configure aliases is to have DSPAM parse the 'To:' header of the message and use a catch-all subdomain to direct all mail into DSPAM for retraining. You can then instruct your users to email addresses like '[email protected]'. The ParseToHeaders option (available in dspam.conf) will parse the To: header of forwarded messages and set the username to either 'bob' or '[email protected]', depending on how it is configured. DSPAM can also set the training mode to either "learn spam" or "learn notspam" depending on whether the user specified a spam- or notspam- address in the To: header.


This is ideal if you don't want to set up a separate alias for each user on your system (The Hard Way). If you're fortunate enough to have a mail server that can perform regular expression matching, you can set up your system without a subdomain, and just use addresses like [email protected]. For the rest of us, it will be necessary to set up a subdomain catch-all directly into DSPAM. For example:

@relearn.domain.tld	"|/usr/local/bin/dspam"


Don't forget to set the appropriate ParseToHeaders and related options in dspam.conf as well. More specific instructions can be found in dspam.conf itself. In most cases, the following will suffice:

ParseToHeaders on
ChangeUserOnParse user
ChangeModeOnParse on


The Old Way (A.K.A. The Hard Way)

If neither of the easy ways are possible, you're stuck with doing it the hard way. This means you'll need a separate spam alias (and notspam alias, if users are tagging mail) for each user. To do this, you will need to create an email address for each user, so that DSPAM can analyze and learn for that specific user. For example:

spam-bob: "|/usr/local/bin/dspam --user bob --class=spam --source=error"

You will end up having one alias per mail user on the system, two if you do not use DSPAM's CGI quarantine (an additional one using notspam-). Be sure the aliases are unique and each username matches the name after the --user flag. A tool has been provided called dspam_genaliases. This tool will read the /etc/passwd file and write out a dspam aliases file that can be included in your master aliases table.


To report spam, the user should be instructed to forward each spam to spam-user@yourhost


It doesn't really matter what you name these aliases, so long as the flags being passed to dspam are correct for each user. It might be a good idea to create an alias custom to your network, so that spammers don't forward spam into it. For example, notspam-yourcompany-bob or something.


NOTE about Security:

You might be wondering if a user can forward a spam to another user's address, or whether a spammer can forward a spam to another user's notspam address. The answer is "no". The key to all mail-based retraining is the signature embedded in each email. The signature is stored with each user's own user id, and so not only does the incoming message have to bear a valid signature, but it also has to be stored on the system with the correct user id. This prevents any kind of alias abuse.


NIGHTLY MAINTENANCE AND HOUSEKEEPING CRONS



Non-SQL Based Nightly Purge

If you are NOT running a SQL-based solution, then you should configure dspam_clean to run under cron nightly. This clean tool will read all signature databases and purge signatures that are older than 14 days (configurable), purge abandoned tokens, and remove unimportant tokens. Without this tool, old signatures will continue to pile up. Be sure the user running cleanup has full read/write permissions on the DSPAM data files.

0 0 * * * /usr/local/bin/dspam_clean [options]

See the dspam_clean description for more information


SQL-Based Nightly Purge

SQL-Based solutions include a nightly SQL script to perform the same basic tasks as dspam_clean, and it does it much faster and with more finesse. You can find instructions about each driver's purge functions in the driver's README (doc/[driver].txt) for performing nightly maintenance. Most SQL drivers will include a purge script in the src/tools.[driver] directory. For example:

0 0 * * * mysql --user=[user] --pass=[pass] [db] < /path/to/purge-4.1.sql


Log Rotation

The system log and user logs can fill up fairly quickly, when all that's really needed to generate graphs are the last two to three weeks of data. You can configure a nightly log cleanup using dspam_logrotate:

0 0 * * * dspam_logrotate -a 30 -d /usr/local/var/dspam/data


NOTIFICATIONS


DSPAM is capable of sending three different notifications to users:

  • A "First Run" message sent to each user when they receive their first message through DSPAM.
  • A "First Spam" message sent to each user when they receive their first spam
  • A "Quarantine Full" message sent to each user when their quarantine box is > 2MB in size.

These notifications can be activated by copying the txt/ directory from the distribution into DSPAM's home (by default /usr/local/var/dspam). You will want to modify these templates prior to installing them to reflect the correct email addresses and URLs (look for 'configureme' and 'yourdomain').


NOTE:
The quarantine warning is reset when the user clicks 'Delete All', but is not reset if they use "Delete Selected". If the user doesn't wish to receive reminders, they should use the "Delete Selected" function instead of "Delete All".

You'll need to also set "Notifications" to "on" in dspam.conf.


THE WEB UI


The Web UI (CGI client) can be run from any executable location on a web server, and detects its user's identity from the REMOTE_USER environment variable. This means you'll need to use HTTP password authentication to access the CGI (Any type of authentication will work, so long as Apache supports the module). This is also convenient in that you can set up authentication using almost any existing system you have. The only catch is that you'll need the usernames to match the actual DSPAM usernames used the system. A copy of the shadow password file will suffice for most common installs.


The accompanying files in the webui/ folder should be copied into your document root and cgi-bin, as specified.


NOTE:
Some authentication mechanisms are case insensitive and will authenticate the user regardless of the case they type it in. DSPAM, on the other hand, is case sensitive and the case of the username used will need to match the case on the system. If you suffer from this authentication problem, and are certain all of your users' usernames are in lowercase, you can add the following line of code to the CGI right after the call to &ReadParse...

$ENV{'REMOTE_USER'} = lc($ENV{'REMOTE_USER'});

The CGI will need to function in the same group as the dspam agent in order to work with the files in dspam_home. The best way to do this is to create a separate virtualhost specifically for the CGI and assign it to run in the MTA group using Apache's suexec. If you are using procmail, additional configuration may also be necessary (see below).


NOTE:
Apache users do NOT take on the identity of the groups specified in /etc/group so you will need to specifically assign the group in httpd.conf.


NOTE about Procmail:
Because the DSPAM Web UI is a CGI script, DSPAM will not retain its setuid privileges when called. If you are running procmail, this will become a problem as procmail requires root privileges to deliver. The easiest hack around this is to create a procmail.dspam binary and make it setuid root, then make it executable only by the mail group (or whatever group DSPAM and the CGI run in).


The DSPAM Web UI has a minimal configuration inside the configure.pl script. You'll want to check and make sure all of the settings are correct. In most cases, the only settings that will be necessary to change are the large-scale or domain-scale flags.


BEFORE PROCEEDING

Check and make sure (Again) that the CGI user from Apache's httpd.conf is added as a trusted user in dspam.conf.


Default Preferences

Now would be a good time to set the system's default preferences. This can be done using the dspam_admin tool. For example:

dspam_admin ch pref default trainingMode TEFT
dspam_admin ch pref default spamAction quarantine
dspam_admin ch pref default spamSubject "[SPAM]"
dspam_admin ch pref default enableWhitelist on
dspam_admin ch pref showFactors off

The default preferences are used for any users who have not yet set their own preferences. You can also control which preferences the user may override by changing the "AllowOverride" settings in dspam.conf.


By default, the parameters specified on the commandline will be used (if any). If, however, a preference is found for the particular user those preferences will override the commandline.


GD Graphing Library

If you plan on leaving DSPAM's logging function enabled, and would like to produce pretty graphs for your users, the graph.cgi script requires the following be installed on your machine:


GD Graphics Library (http://www.boutell.com/gd/) Compile with png support

The following PERL modules (http://www.perl.com/CPAN/modules/by-module/GD/):

  • GD
  • GD-Graph3d
  • GDGraph
  • GDTextUtil
  • CGI

Typically this can be accomplished on the commandline:

perl -MCPAN -e 'install GD::Graph3d'


Configuring Administrators

Once you've configured the Web UI, you'll want to edit the 'admins' file to contain a list of users who are permitted to use the administration suite.


Opt-In/Out

If you would like your users to be able to opt in/out of DSPAM filtering, add the correct option to the nav_preferences.html template, depending on your configuration (for example, if you have an opt-in system, you'll want to add the opt-in option).

NOTE:
This currently only works with the preferences extension, and not drop files.

<INPUT TYPE=CHECKBOX NAME=optIn $C_OPTIN$>
Opt into DSPAM filtering
<INPUT TYPE=CHECKBOX NAME=optOut $C_OPTOUT$>
Opt out of DSPAM filtering


TESTING


If you've installed from an RPM, there's a good chance that the packager went to the trouble of testing already. If you're building from sources,however, you'll need to find a way to ensure your configuration isn't broken.


Most software packages are supplied with a test suite to determine if the software is functioning properly. Since DSPAM's correct function relies primarily on having the correct permissions and mail server configuration, a test script fails to provide the level of testing required for such a package. The following exercise has been provided to test dspam's correct functioning on your system. This exercise does not test the Web UI, but only the core dspam agent.


Before running the test, you should have completed section 1.1's instructions for compiling and installing dspam as well as configured your mail server to support dspam.


1. Create a new user account on your system

It is important that this be a new account to prevent any unrelated email from being delivered during testing. Be sure to configure a spam alias for the test account.


2. Send a short email

Send a short email (10 words or less) to the account, and pick it up using your favorite mail client.


3. Run dspam_stats

dspam_state [username]

You should see a value of 1 for "TI" or "Total Innocent" as shown below:

 dspam-test            0 TP       1 TN       0 FN       0 FP

If you receive an error such as "unable to open /usr/local/var/dspam... for reading", then the dspam agent is not configured correctly. The problem could exist in either your mail server configuration or one or more of the permissions on the directory or agent. Check your configuration and permissions, and repeat this step until the correct results are experienced.


4. Run dspam_dump

dspam_dump [username]

This will get a complete list of tokens and their statistics. Each token should have an I: (innocent) hit count of 1. The tokens will be represented as 64-bit values, for example:

3126549390380922317              S:    0  I:    1  LH: Mon Aug  4 11:40:12 2003
13884833415944681423             S:    0  I:    1  LH: Mon Aug  4 11:40:12 2003
14519792632472852948             S:    0  I:    1  LH: Mon Aug  4 11:40:12 2003
8851970219880318167              S:    0  I:    1  LH: Mon Aug  4 11:40:12 2003

To view statistics for a particular token, run:

dspam_dump [username] [token]

Where token is the plain-text token value. For example:

%dspam_dump bill FREE
7717766825815048192  S: 00265  I: 00068  P: 0.7358


5. Forward the test message

Forward the test message to the spam alias you've created for the test account. Provide enough time for the message to have processed.


6. Run dspam_stats again

dspam_state [username]

Now, the value for TN should be zero and the value for FN (false negatives) should be 1 as shown below:

dspam-test            0 TP       0 TN       1 FN       0 FP

If this is not the case, check the group permissions of the dspam agent as well as the permissions your MTA uses when piping to aliases.


7. Run dspam_dump [username] again

dspam_dump [username] Make sure that EVERY token now has an I: of zero and a S: of 1:

3126549390380922317              S:    1  I:    0  LH: Mon Aug  4 11:44:29 2003
13884833415944681423             S:    1  I:    0  LH: Mon Aug  4 11:44:29 2003
14519792632472852948             S:    1  I:    0  LH: Mon Aug  4 11:44:29 2003
8851970219880318167              S:    1  I:    0  LH: Mon Aug  4 11:44:29 2003

If you have some tokens that do not have an S: of 1 or an I: of 0, the dspam signature was not found on the email, and this could be due to a lot of things.


TROUBLESHOOTING


Problem: No files are being created in the user directory

Solution: Check the directory permissions of the directory. The user directory must be writable by the user the dspam agent is running as as well as the CGI user.



Problem: False positives are never being delivered

Solution: Your CGI most likely doesn't have the privileges required by the LDA to deliver the messages. Make sure the CGI user is in the correct group. Also consider setting the dspam agent to setuid or setgid with the correct permissions.



Problem: My database is getting huge!

Solution: DSPAM's default training mode is TEFT. On top of this, the purging defaults are very lax. You might consider switching to TOE (Train-on-Error) mode training if you require a minimal database. If you are willing to sacrifice accuracy for disk space, disabling the 'chain' tokenizer from dspam.conf will prevent the use of multi-word (chained) tokens, which will also cut your database size considerably. You may also consider more frequent calls to dspam_clean -p to purge neutral data, which comprises a majorrity of most databases.


For more help, please see the DSPAM FAQ at http://dspam.sourceforge.net.


DSPAM TOOLS


A few useful tools have been provided to make DSPAM management a bit easier. These tools include:

dspam_admin

A tool used to perform specific administrative functions. These functions are usually included as part of an extensions package (such as the preferences extension). Available functions are listed in the tool's usage output.


dspam_train

Used to train and test a corpus of ham and spam (in maildir format). Syntax: dspam_train [username] [spam_dir] [nonspam_dir] where username is the username of the user to apply the training to, and the two dirs represent directories containing messages in individual files (e.g. maildir/corpus format). dspam_train can be used on an existing user's database, to further improve accuracy, or to train from scratch. It also provides a solid test jig for testing the efficiency and accuracy of a test corpus against the filter.

NOTE:
dspam_train will automatically balance training of the corpus to ensure both spam and nonspam are trained based on the ratio of spam/nonspam. this means if you have twice as much spam as nonspam, two spam will be trained for every nonspam.


dspam_dump

Dumps a DSPAM dictionary. This can be used to view the entire contents of a user's dictionary, or used in combination with grep to view a subset of data. Syntax: dspam_dump username] [token] where username is the DSPAM user's username. If a token is specified, statistics only for that token will be printed.


dspam_clean

Performs nightly housecleaning by deleting old or useless data from user data. dspam_clean performs the following operations:

1. Using the -s flag, dspam_clean will continue to perform stale signature purging. If an age is specified, for example -s14, the age defined as the default will be overridden. Specifying an age of 0 will delete all signatures for the users processed.

2. Using the -p flag, dspam_clean will delete all tokens from a user's database whose probability is between 0.35 and 0.65 (fairly neutral, useless tokens) that fall beyond the default age. If an age is specified, for example -p30, the age defined as the default will be overridden. It is a good idea to use this type of clean with an age of 0 on users after a lot of corpus training.

3. Using the -u flag, dspam_clean will delete all unused tokens from a user's database. There are four different types of unused tokens:

- Tokens which have not been used for a long time
- Tokens which have a total hit count below 5
- Tokens which have only one spam hit
- Tokens which have only one innocent hit

Ages may be overridden by specifying a format such as -u30,15,10,10 where each number represents the respective age. Specifying an age of zero will delete all unused tokens in the category. Defaults are set in dspam.conf.

Optionally, usernames may be specified to override the default behavior of processing all users.


Examples:

Process all users on the system using all clean operations:

dspam_clean -s -p15 -u90,30,15,15


Delete all of user 'dick' and 'jane's signatures:

dspam_clean -s0 dick jane


Perform a post-corpus training clean on user 'spot':

dspam_clean -p0 -u0,0,0,0 spot


Run dspam_clean with all default options, all clean modes enabled, on all users on the system:

dspam_clean -s -p -u

NOTE:
You may wish to only run certain cleaning modes depending on the type of storage driver you are using. For example, the MySQL storage driver includes a script which performs signature and unused token operations, leaving only probability operations as useful. If you are using a SQL-based storage driver, it is strongly recommended that you use the maintenance scripts wherever possible for optimum efficiency.


dspam_stats

Displays the spam statistics for one or all users on the system. Syntax: dspam_stats [username] If no username is provided, all users will be displayed. Displays TP (true positives), TN (true negatives), FN (false negatives), and FP (false positives).


dspam_genaliases

Reads the /etc/passwd file and outputs a dspam aliases table which can be included in the master aliases table. You may try Art Sackett's generate_dspam_aliases tool at http://www.artsackett.com/freebies/generate_dspam_aliases/ if you need some better functionality. This will eventually be merged in as a replacement for the existing tool.


dspam_merge

Merges multiple users' dictionaries together into one user's dictionary (does not affect the merge users). This can be used to create a seeded dictionary for a new user, or to copy a single user's dictionary to a new file. This is great for building global dictionaries, but crunches a lot of time and disk.


AGENT COMMANDLINE ARGUMENTS



Specifying a User

The DSPAM agent (dspam) recognizes the following commandline arguments:

--user [user1 user2 ... userN]

Specifies the destination user(s) of the incoming message. DSPAM then processes the message once for each user individually. If the message is to be delivered, the $u (or %u) parameters of the arguments string will be interpolated for the current user being processed.


Classification

--class=[spam|innocent]

Tells DSPAM that the message being presented has already been classified by the user. This flag should be used when a misclassification has occurred, when the user is corpus-feeding a message, or an inoculation is being presented. This flag must be used in conjunction with the --source flag. Providing no classification invokes the SOP of DSPAM, which is to determine the message's nature on its own.


Source

--source=[error|corpus|inoculation]

Wherever --class is used, the source of the user-provided classification must also be provided. The source is very important and dramatically affects DSPAM's training behavior:


error:
The message being presented was a message previously misclassified by DSPAM. When 'error' is provided as a source, DSPAM requires that the DSPAM signature be present in the message, and will use the signature to recall the original training metadata. If the signature is not present, the message will be rejected. In this source mode, DSPAM will also decrement each token's previous classification's count as well as the user totals.

You should use error only when DSPAM has made an error in classifying the message, and should present the modified version of the message with the DSPAM signature when doing so.


corpus:
The message being presented is from a mail corpus, and should be trained as a new message, rather than re-trained based on a signature. The message's full headers and body will be analyzed and the correct classification will be incremented, without its opposite being decremented.

You should use corpus only when feeding messages in from corpus, not for correcting errors.


inoculation:
The message being presented is in pristine form, and should be trained as an inoculation. Inoculations are a more intense mode of training designed to cause DSPAM to train the user's metadata repeatedly on previously unknown tokens, in an attepmt to vaccinate the user from future messages similar to the one being presented.

You should use inoculation only on honeypots and the like.


Delivery

--deliver=[innocent,spam]

Tells DSPAM to deliver the message if its result falls within the criteria specified. For example, --deliver=innocent will cause DSPAM to only deliver the message if it classifies as innocent. Providing --deliver=innocent,spam will cause DSPAM to deliver the message regardless of its classification. This flag provides a significant amount of flexibility for nonstandard implementations, where false positives may not be delivered but spam is, and etcetera.


--stdout

If the message is indeed deemed "deliverable" by the --deliver flag, this flag will cause DSPAM to deliver the message to stdout, rather than the configured delivery agent.


--process

Tells DSPAM to process the message. This is the default behavior, and the flag is implied unless --classify is used - but is a good idea to use to avoid ambiguity.


--classify

Tells DSPAM only to classify the message, and not make any writes to the user's metadata or attempt to deliver/quarantine the message.


NOTE:
The output of the classification is specific to the user, not including the output of any groups they might be affiliated with, so it is entirely possible that the message would be caught as spam by the group, even if it didn't appear in the classification. If you want to get the classification for the GROUP, use the group name as the user instead of an individual.


Signatures

--signature=[signature]

For some implementations, the admin may wish to pass the signature in via commandline instead of allowing DSPAM to find it on its own. This is especially useful when front-ending the agent with other tools. Using this option will set the active signature and will also forego reading of stdin.


Training Modes

--mode=[toe|tum|teft|notrain|unlearn]

Configures the training mode to be used for this process:


TEFT

Train-Everything. Trains on all messages processed. This is a very thorough training approach and should be considered the standard training approach for most users. TEFT may, however, prove too volatile on installations with extremely high per-user traffic, or prove not very scalable on systems with extremely large user-bases. In the event that TEFT is proving ineffective, one of the other modes is recommended.

NOTE:
Until a user reaches 100 innocent messages in their metadata, train-on-error will also be teft-based, even if otherwise specified on the commandline.


TOE

Train-on-Error. Trains only on a classification error, once the user's metadata has matured to 2500 innocent messages. This training mode is much less resource intensive, as only occasional metadata writes are necessary. It is also far less volatile than the TEFT mode of training. One drawback, however, is that TOE only learns when DSPAM has made a mistake - which means the data is sometimes too static, and unable to "ease into" a different type of behavior.


TUM

Train-until-Mature. This training mode is a hybrid between the other two training modes and provides a great balance between volatility and static metadata. TuM will train on a per-token basis only tokens which have had fewer than 50 "hits" on them, unless an error is being retrained in which case all tokens are trained. This training mode provides a solid core of stable tokens to keep accuracy consistent, but also allows for dynamic adaptation to any new types of email behavior a user might be experiencing. It is a balance of resources as well, as only less-than-mature tokens are written to the database. NOTE: You should corpus train before using tum.


NOTRAIN

No training. Do not train the user's data, and do not keep totals. This should only be used in cases where you want to process mail for a particular user (based on a group, for example), but don't want the user to accumulate any learning data.


UNLEARN

Unlearn original training. Use this if you wish to unlearn a previously learned message. Be sure to specify --source=error and --class to whatever the original classification the message was learned under. If not using TrainPristine, this will require the original signature from training.


RECOMMENDATIONS In general, it is recommended that users begin with TEFT. If a user is experiencing between a 75-85% spam ratio, they may benefit from Train-on-Mature mode. If a user is experiencing over 90% spam, then Train-on-Error mode should make a noticeable improvement in accuracy. It eventually boils down to what works best for your users. There is no reason a system could not be configured (with a script) to analyze a user's *.stats file and determine the best training mode for that user.


Features

--feature=[no,wh,tb=N]

Specifies the features that should be activated for this filter instance. The following features may be used individually or combined using a comma as a delimiter:


no:

Bayesian Noise Reduction (BNR). Bayesian Noise Reduction kicks in at 2500 innocent messages and provides an advanced progressive noise logic to reduce Bayesian Noise (wordlist attacks) in spams. See http://bnr.nuclearelephant.com for more information. BNR is not for everyone, and so users should try it out after they've trained to see if it helps improve accuracy.


tb=N:

Sets the training loop buffering level. Training loop buffering is the amount of statistical sedation performed to water down statistics and avoid false positives during the user's training loop. The training buffer sets the buffer sensitivity, and should be a number between 0 (no buffering whatsoever) to 10 (heavy buffering). The default is 5, half of what previous versions of DSPAM used. To avoid dulling down statistics at all during the training loop, set this to 0. This feature should be disabled if you're not paranoid about false positives, as it does increase the number of spam misses significantly during training.


wh:

Automatic whitelisting. DSPAM will keep track of the entire "From:" line for each message received per user, and automatically whitelist messages from senders with more than 10 innocent messages and zero spams. Once the user reports a spam from the sender, automatic whitelisting will automatically be deactivated for that sender. Since DSPAM uses the entire "From:" line, and not just the sender's email address, automatic whitelisting is a very safe approach to improving accuracy during initial training.


NOTE:
None of the present features are necessary when the source is "error", because the original training data is used from the signature to retrain, instantiating whatever features (such as whitelisting) were active at the time of the initial classification. Since BNR is only necessary when a message is being classified, the --feature flag can be safely omitted from error source calls.


Daemon Mode

--daemon

Puts DSPAM in daemon mode; e.g. DSPAM acts like a server when started with this parameter. See section 2.3 for more information about daemon mode.