Choosing which messages to archive

From Messaging Server Technical Reference Wiki
Jump to: navigation, search

The choice of which messages to archive is a critical one for sites, especially when the archiving is for compliance purposes. This has three components: (1) choosing whose or which types of messages to archive, (2) choosing in what form and at what stage(s) of processing and transitting the MTA the messages should be captured for archiving, and (3) choosing whether Message Store IMAP APPEND operations (moving a message to a folder) should cause archiving. Of these three questions, the first (whose or which types of messages to archive) is usually well specified. The third question (whether to archive due to Message Store operations) also tends to be straightforward to decide. However, the second question may require additional consideration. Between initial message submission and eventual final delivery into a mailbox, while transitting the MTA, messages undergo various transformations, some trivial and some potentially dramatic. Such transformations can include: addition of Received: header lines, addition of other header lines (such as missing-but-required header lines such as Date:, or addition of spam filtering header lines, or addition of mailing list header lines, etc.), transformations ("address reversal") of addresses in header lines, alias or list expansion changing the currently active set of envelope recipients, "split up" of a multi-recipient message into different copies for different subsets of recipients, addition of "disclaimer" text, changes in Content-transfer-encoding, document conversion processing, conversion to a different charset (CHARSET-CONVERSION), etc.

Three possible approaches for selection of which messages transitting the MTA are eligible for archiving include:

  1. Flow-based: Those messages passing through certain channels (such as channels delivering to the Message Store, or channels sending out to the Internet) should be archived.
  2. User-based: Those messages sent to or from certain users (perhaps all users; perhaps all users in certain domains; perhaps only some distinguished subset of users) should be archived.
  3. Content-based: Those messages containing certain content should be archived.

Such approaches correspond, respectively, to techniques of:

  1. For flow-based archiving, it would be typical to trigger archiving via channel *spamfilter* options (if using an archiving callout approach) or via channel Sieve filters (using a "capture" action in a channel Sieve filter located via a sourcefilter or destinationfilter, as relevant). Choice of the "correct" channels on which to trigger archiving is critical.
  2. For user-based archiving, it would be typical to trigger archiving via some user-level (or new in MS 8.0, domain-level) LDAP attribute; see Capture triggered via LDAP attribute. Use of a class-of-service may be helpful in setting such an attribute on all (or large subsets) of users. Note that when such an ldap_capture or (new in MS 8.0) ldap_domain_attr_capture named LDAP attribute is used, then capture will occur at whatever channel stage a user alias is expanded (capturing messages to the user), as well as whenever address reversal occurs (capturing messages from the user). Since address reversal in particular normally occurs during every message enqueue, deployments involving multiple channel "hops" or multiple relay hosts may find multiple "copies" of messages---one "copy" per channel "hop" -- getting captured for archiving. Thus an alternative to such global use of an LDAP attribute is to use instead a Sieve filter "capture" action, perhaps consulting a Sieve external list (which may consist of consulting a user-level LDAP attribute). This technique of using a channel-specific Sieve filter that consults a Sieve external list allows more precisely timed (limited to specific channels) archiving that is still based on (provisioned via) LDAP attribute settings; see for instance Example Sieve external lists with properties.
  3. For content-based archiving, it is critical to detect and label which messages contain the sort of content that needs archiving. If users and user e-mail agents can be relied upon to label such context ab initio, when messages are first generated, that is one solution for labelling. Very simple, and easy to detect, content criteria may be codable into a Sieve script---for instance, detecting certain MIME Content-type: labelling. More complex content detection, especially in cases of concerns about uncooperative users attempting to evade archiving requirements, may require special, third-party scanning-and-detection software, a la spam/virus filter software. As usual, the preferred approach for integrating such third party packages is via the MTA's spamfilter plug-in facility; if the third party package does not support such callout use, then the second best choice is to deploy the package "on the side" of the MTA using the usual aliasdetourhost/alternate conversion channel approach. In any case, once the messages are labelled in whatever way chosen, then the actual trigger for archiving can use Sieve filter based "capture" triggered by presence of the relevant label.

See also: