Character set conversion

From Messaging Server Technical Reference Wiki
Jump to: navigation, search


If the MTA's initial probe of the CHARSET-CONVERSION mapping table (to determine whether or not any character set conversion or message reformatting need be performed) finds that the message is to be reformatted, it will proceed to check each part of the message. Any text parts are found and their character set parameters are used to generate the second probe. Only when the MTA has checked and found that conversions may be needed does it ever perform the second probe. The input string in this second case looks like this:


IN-CHAN=in-channel;OUT-CHAN=out-channel;IN-CHARSET=in-char-set

The in-channel and out-channel are the same as before, and the in-char-set is the name of the character set associated with the particular part in question. (Note that the include_conversiontag MTA option, regardless of setting, has no effect on the form of this second probe of the CHARSET-CONVERSION mapping table.) If no match occurs for this second probe, no character set conversion is performed (although message reformatting, e.g., changes to MIME structure, may be performed in accordance with the keyword matched on the first probe). If a match does occur it should typically produce a string of the form:


OUT-CHARSET=out-char-set

Here the out-char-set specifies the name of the character set to which the in-char-set should be converted. Note that both of these character sets must be defined in the character set definition table, charsets.txt, located in the MTA table directory. No conversion will be done if the character sets are not properly defined in this file. This is not usually a problem since this file defines several hundred character sets; most of the character sets in use today are defined in this file. See the description of the imsimta chbuild utility for further information on the charsets.txt file.

If all the conditions are met, the MTA will proceed to build the character set mapping and do the conversion. The converted message part will be relabelled with the name of the character set to which it was converted. Encoded-words in message headers (text encoded according to the rules of RFC 2047) will also have the specified charset conversion applied.

In addition, the following other types of output request are supported.

When working on text parts of messages, one may also specify an encoding in which the MTA should output that part:


OUT-ENCODING=encoding-name

Here encoding-name must be the name of an encoding supported by the MTA, namely one of (as of this writing):

  • NONE,
  • 8BIT,
  • 7BIT,
  • ATOB,
  • BASE32,
  • BASE64,
  • BASE85,
  • BINARY,
  • BINARY-8BIT,
  • BINHEX,
  • BTOA,
  • COMPRESSED-BASE64,
  • COMPRESSED-BINARY,
  • COMPRESSED-UUCODE,
  • COMPRESSED-UUDECODE,
  • COMPRESSED-UUENCODE,
  • DEFLATE-BASE64,
  • DEFLATE-BINARY,
  • DEFLATE-UUCODE,
  • DEFLATE-UUDECODE,
  • DEFLATE-UUENCODE,
  • HEXADECIMAL,
  • OLD-BASE64,
  • PATHWORDS,
  • QUOTED-PRINTABLE,
  • UUCODE,
  • UUDECODE,
  • UUENCODE,
  • X-ATOB,
  • X-BASE32,
  • X-BASE85,
  • X-BINHEX,
  • X-BTOA,
  • X-C-DATA,
  • X-COMPRESSED-UUCODE,
  • X-COMPRESSED-UUDECODE,
  • X-COMPRESSED-UUENCODE,
  • X-DEFLATE-UUCODE,
  • X-DEFLATE-UUDECODE,
  • X-DEFLATE-UUENCODE,
  • X-HEXADECIMAL,
  • X-OLD-BASE64,
  • X-PATHWORKS,
  • X-UUCODE,
  • X-UUDECODE,
  • X-UUENCODE.

Both an output charset and an output encoding may be specified, by separating the clauses with a comma.

For encoded-words in message header lines (material encoded using the RFC 2047 encoding rules), the OUT-ENCODING must be one of QUOTED-PRINTABLE, HEXADECIMAL, X-HEXADECIMAL, or BASE64; attempting to set any other output encoding will result in the "unknown" encoding being used.

There are also several additional options that can be applied for conversion of the charset in message headers. Specifying


OUT-CHARSET=out-charset,RELABEL-ONLY=1 

in the template (right hand side) of a mapping entry means that the MTA will simply use the specified charset name out-charset wherever the in-charset name had appeared. That is, this is intended to be used in cases where the original charset label was wrong, and it is desired to simply override the original labelling with correct labelling (but no actual charset conversion need be performed).

Specifying


IN-CHARSET=* 

in the template (right hand side) of a mapping entry requests that the MTA attempt to sniff the data to attempt to determine what character set was truly used. Currently, the only useful such determination that can be made by the MTA is between US-ASCII, EUC-JP, SHIFT-JIS, and ISO-2022-JP.

Specifying


OUT-LANGUAGE=lang-tag

in the template (right hand side) of a mapping entry tells the MTA to set the specified language tag as the value of the Content-language: header line. Specifying


OUT-LANGUAGE=*lang-tag

tells the MTA to insert the specified language tag with the charset name inside encoded-words on header lines, if no explicit language tag was already present in the encoded-words.


See also: