                 =====================================

                               TUTORIAL

                  HANDLING EUROPEAN CHARACTER SETS IN
                    USENET NEWS AND ELECTRONIC MAIL

                            Gisle Hannemyr
                             Oslonett A.S.
                           gisle@oslonett.no

                              Version 2.0
                              1994 May 22

                 =====================================


Introduction
============

This file contains a brief tutorial on the use of european character
sets in usenet news and electronic mail and why this sometimes is a
problem.  It also describes the mechanisms provided by MIME (Multi-
purpose Internet Mail Extensions) to support transfer of European
character sets.

The tutorial is usually supplied as part of "mimelite" (a small program
library written in C to support MIME).  The lastest version will always
be available by anonymous ftp from oslonett.no.  The file to get is:

    oslonett.no:Software/MsDos/Kommunikasjon/Offline/mimeltXX.zip

where the XX in the filename is replaced by a two digit version number.

If you don't have access to ftp, write to the author (gisle@oslonett.no)
and I'll try to send you the mimelite archive as electronic mail.


Background
==========

In the beginning, there was US-ASCII. US-ASCII defined a standard
binding of numeric codes to graphical representations of characters.
The US-ASCII system used the codes from 33 to 126 (inclusive) for its
graphical symbols.  This makes room for 94 graphical symbols, which
may comfortably be encoded by 7 bits. Later, the US-ASCII represen-
tation was made into an international standard, which was given the
name ISO-646-IRV (IRV stands for "International Reference Version).

ISO has also a scheme for registering character sets.  Provisions for
this is detailed in ISO 2022 ("Code Extension Techniques"), and ISO
2375 ("Procedures for registration of Escape Sequences"). This
scheme has given US-ASCII registration number 2, so another synonym
for the 94 character US-ASCII character set is ISO-IR-002 (IR stands
for "International Register" [of coded character sets to be used with
escape sequences]).

Unfortunately, 94 graphical symbols are too few for all the weird and
wonderful characters used in miscellaneous European languages.  There
was several attempts to court the European market by giving us products
with our own characters.  IBM and Microsoft, in particular, made a lot
of effort, and created in the process a whole maze of twisty little
encodings, all different.  These are known as "codepages" (CPs).  A
number of other computer manufacturers also created their own
encodings.

The one thing that was constant in this process was the encodings of
the 94 graphical characters that was defined in US-ASCII, but the
encodings between 128 and 255 was a mess.

Enter the ISO (International Standards Organization).  ISO, with good
help from the international community, created the ISO-8859-series.
This is a series of standards defining mappings between 8-bit
character codes and graphical symbols.  Each part of the series is
designed to serve the needs of a particular geographic area.

The first part of this series (ISO 8859/1, also known as "ISO Latin
alphabet no. 1", or simply "ISO Latin 1") quickly gained wide
acceptance.  ISO 8859/1 extends US-ASCII by providing additional
characters for the following major languages of the western world:
Danish, Dutch, English, Faroese, Finnish, French, German, Icelandic,
Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.

ISO 8859/1 was adopted in Scandinavia, Western Europe, and also among
several major US manufacturers (DEC, SCO, Sun, HP). It is was
subsequently adopted by X/Open as the preferred character set for Unix
workstations, and it also appears to be the preferred character set in
X.11 and the default character set used by Microsoft Windows
applications, plus a number of others.  It has also become the default
character set used when transmitting messages containing European
characters on Usenet.

While Microsoft's _Windows_ defaults to ISO Latin 1, Microsoft's
_MS-DOS_ does not.  There is a DOS codepage (CP 850 multilingual) that
contains all the graphical symbols of ISO Latin 1, but with different
encoding.  As every ISO 8859/1 character can be mapped to a character
in Code Page 850, CP 850 is the code page that should be used on a
MS-DOS-machine to be able to display the full ISO 8859/1 set.

Consequently, in order to successfully transmit messages between a
MS-DOS system and Usenet/Internet, some conversion of character
encodings are required to prevent the message from becoming garbled. 
When transmitting mail between Apple MacIntosh and Usenet similar
problems arise.  One solution to this problem which is rapidly gaining
acceptance is MIME (RFC 1521).  So let us take a brief look at the
mechanisms provided by MIME to manage foreign character sets.  (You
don't actually need to understand MIME to use the mimelite library, but
it helps.)


MIME
====

MIME (RFC 1521) provides a number of mechanisms for specifying and
describing the format of Internet message bodies.  The full capabil-
ities of MIME are beyond those of the mimelite library.  The mimelite
library covers a subset of MIME, namely the bits that support encoding
and transfer of European character sets.

MIME specify 3 header fields of particular interest to us:

1) The MIME-Version header field.

   This tells us that the message contains headers compliant with
   RFC 1521.

2) The Content-Type header field.

   This describe the data contained in the body part and has the
   following syntax:

      type "/" subtype *[ ";" parameter ]

   We are not going to do a full MIME implementation, so we shall
   just handle a single type (TEXT) and a single subtype (PLAIN).
   Everything else is UNKNOWN.

   The thing we are interested in is the "parameter".  We shall look
   for a parameter named "charset", and its value.  RFC 1521 defines
   the following values:

       US-ASCII
       ISO-8859-x  (where x refer to a specific part of the
                    ISO-8859 set of standards).

   The only value of ISO-8859 handled by the mimelite library is so far
   ISO-8859-1 (Western European alphabets).
   

3) The Content-Transfer-Encoding header field.

   This field tells us how the body is encoded.  We consider the
   following values for this field:

      QUOTED-PRINTABLE
      8BIT
      7BIT
      BASE64
     [BINARY]

   BINARY, 8BIT and 7BIT indicate that the body is _not_ encoded.
   QUOTED-PRINTABLE tells us that the body is encoded according
   to the scheme described in section 5.1 of RFC 1521, and BASE64
   is described in sec. 5.2 in the same document.

   The encoding in square brackets is not yet supported by the
   mimelite library.

All the above addresses encoding of 8 bit data in the message body as
described in RFC 1521.  There is an accompanying RFC 1522 which
describes encoding of 8 bit data in message headers.  This scheme is in
the author's most humble opinion a kludge and is best avoided.  The
mimelite library will decode RFC 1522 encoded message headers, but I
don't want to complicate this tutorial with describing it in detail.
Interested parties with a strong stomach should read RFC 1522
themselves.


Importing MIME-encoded messages 
-------------------------------

When importing messages, find out what encoding is used in the message
(this does not distinguish between mail and  news, because there is no
need to):

The following strategy can be used:

- If the message has a MIME header, parse the MIME header fields to
  determine the actual encoding used.

- If not, check whether the message has a RFC-1049 style Content-Type
  header, and try to make sense of this.

- If the message does not have a MIME header, nor a RFC-1049 style
  header, you may assume (according to RFC-822 and RC-1036):

     Content-Type: TEXT/PLAIN; charset=US-ASCII
     Content-Transfer-Encoding: 7bit

  However, I recommend that assume the following:

     Content-Type: TEXT/PLAIN; charset=ISO-8859-1
     Content-Transfer-Encoding: 8BIT

  The reason the latter assumption is better is that there is quite
  common for messages to contain ISO-8859-1 highbit characters without
  a MIME header (especially news).  Assuming 8BIT encoding will ensure
  that such messages is processed properly.

You then need to decode the message according to the information
(actual or assumed) found in these headers.


Exporting messages using MIME
-----------------------------

When exporting messages, please note that while it is common to assume
8-bit clean transport for news, only 7-bits can be used for mail in most
segments of the Net (but see note at the end).

The differences between what we can rely on for transport mean that
news and mail may need to be treated differently.

NEWS:
   Check whether the user has actually used any characters outside the
   range of US-ASCII in the body of text.

   - If, no, the message will remain unchanged

   - If yes, traslate the character codes from whatever convention your
     system uses to ISO 8859/1.

   Because the text is uncoded, there is not a requirement to add the
   MIME headers to the message.  Some think that such labeling is
   useful, others think it is a waste of bandwidth.

   If you want to add MIME headers, use the following for news messages
   with only 7 bit characters:

      MIME-Version 1.0
      Content-Type: TEXT/PLAIN; charset=US-ASCII
      Content-Transfer-Encoding: 7BIT

   and the following for news messages that contain characters with the
   highbit set:

        MIME-Version 1.0
        Content-Type: TEXT/PLAIN; charset=ISO-8859-1
        Content-Transfer-Encoding: 8BIT


MAIL:
   Check whether the user has actually used any characters outside the
   range of US-ASCII in the body of text.

   - If, no, the message will remain unchanged.

     It is optional whether you want to add the following three lines
     to the header:

        MIME-Version 1.0
        Content-Type: TEXT/PLAIN; charset=US-ASCII
        Content-Transfer-Encoding: 7bit

   - If yes, translate the characters in the body of the message from
     whatever your system uses to ISO 8859/1, and also  into a specific
     content-transfer-encoding (according to RFC 1521) that may safely
     encode 8 bit data so they can be carried by 7 bit transport.  The
     preferred encoding for mail with text containing european charac-
     ters is QUOTED-PRINTABLE.  If this is used, the  following 3 lines
     must be added to the header:

        MIME-Version 1.0
        Content-Type: TEXT/PLAIN; charset=ISO-8859-1
        Content-Transfer-Encoding: QUOTED-PRINTABLE


(NOTE: 8 bit transport for SMTP mail are becoming more and more wide-
 spread.  If you are confident that your mail transport is 8 bit clean
 you may choose to drop the encoding and ship email messages using 8BIT
 encoding).



Software packages
=================

Below is a list of MIME utilities and character recoding packages
(including the mimelite library).

Some of the recoders contain tables for hundreds of character sets
which you may want to look up in case you need to add new recoding
tables to the mimelite library.

In the lists below, XXX is used to denote version number.


MIME utilities
--------------

metamail : Very complete implementation of MIME.
	   Nathaniel S. Borenstein <nsb@thumper.bellcore.com>
	   ftp.funet.fi:pub/unix/mail/metamail/mmXXX.tar.Z

mime     : Patch/hook to encode/decode MIME for HellDiver.
           Kosta Kostis (kosta@blues.sub.de)
           ftp.uni-erlangen.de:pub/doc/ISO/charsets/mimebXXX.zip
           ftp.uni-erlangen.de:pub/doc/ISO/charsets/mimesXXX.zip

mimelite:  Lightweight ANSI C library to add MIME to other programs.
	   Gisle Hannemyr (gisle@oslonett.no)
           oslonett.no:Software/MsDos/Kommunikasjon/Offline/mimeltXX.zip

mpack    : Utilities for encoding and decoding binary files in MIME.
           John Gardiner Myers <jgm+@CMU.EDU> 
           ftp.andrew.cmu.edu:pub/mpack/*


Recode utilities
----------------

recode   : GNU recode.
           Francois Pinard (pinard@iro.umontreal.ca)
           ftp.uni-oldenburg.de:pub/gnu/recode/*

chset    : RFC1345-based character set recoder.
           Keld Simonsen (keld@dkuug.dk)
           dkuug.dk:pub/chsetXXX.tar.Z

pep:	 : General purpose text filter, handles tabs, eoln, ANSI.
           escapes, as well as character set recoding.
	   Gisle Hannemyr (gisle@oslonett.no)
	   oslonett.no:Software/Src/C/pepXXX.zip

transtab : Character Encoding Converter Generator Package.
           Kosta Kostis (kosta@blues.sub.de)
           ftp.uni-erlangen.de:pub/doc/ISO/charsets/transXXX.tar.gz


========================================================================
..EOF

