
FILE: german.words
VERSION: DEC-SRC-92-Apr-05

EDITOR

    Jorge Stolfi <stolfi@src.dec.com>
    DEC Systems Research Center
  
AUTHORS OF ORIGINAL WORDLISTS

    Stefan Kutsch <sk@krabat.marco.de>
    Werner Icking <icking@gmdzi.gmd.de>

DESCRIPTION

    The file german.words is a list of over 170,000 German words and
    proper nouns, compiled from several public domain wordlists.

    The file has one word per line, and is sorted with sort(1)
    in plain ASCII collating sequence.

    The file is supposed to include verb forms, declined nouns, words
    derived by standard prefixes and suffixes ("be", "ge", "un",
    etc.), and compound words.  However, the list is still highly
    incomplete and inconsistent.

    Nouns are capitalized, as required by the language rules.
    Umlauts are denoted by a double quote (") after the modified vowel
    (A/U/O/a/u/o).  The sharp-ss (ess-zet?) is denoted by the
    combination 'sS'.  Besides the letters [a-zA-Z], the file uses only
    hyphen, double quotes, apostrophe, and newline.

AUXILIARY LISTS

    In the same directory as german.words you will find the follwing file:

    german.trash

        A list of 2817 words from the original wordlists that were
        intentionally excluded from german.words.  

        The list includes abbreviations, acronyms, computer slang,
        obvious typos and misspelllings, apparently foreign words, and
        several words that looked suspicious to me.

ORIGINAL LISTS 

    The original wordlists from which those files were compiled are listed
    below.  They were obtained by anonymous FTP on 92-Feb-10.

    [1] file: germanl.Z
        size: 137591 bytes (328734 bytes uncompressed)
        author: Stefan Kutsch <sk@krabat.marco.de>
        from: reseq.regent.e-technik.tu-muenchen.de: /public/doc/dict

    [2] file: german-wordlist.new.Z
        size: 761910 bytes (2052775 bytes uncompressed)
        contact: Werner Icking <icking@gmdzi.gmd.de>
        from: reseq.regent.e-technik.tu-muenchen.de: /public/doc/dict

    [3] file: german-wordlist.Z
        size: 761528 bytes (2060734 bytes uncompressed)
        author: Werner Icking <icking@gmdzi.gmd.de>
        from: reseq.regent.e-technik.tu-muenchen.de: /public/doc/dict

    COMMENTS: Apparently, list [3] is an obsolete version of [2], with
    umlauts coded as "ae", "oe", "ue", and no disctinction between
    "ss" and ess-zet. This file can be found at many other sites,
    under the name "words.german.Z".

    The following USENET message refers to the above lists:

        Archive-name: text/dictionary/german-wordlist/1991-05-15
        Archive: faui43.informatik.uni-erlangen.de:
          /portal/doc/dict/german-wordlist.Z [131.188.1.43]
        Original-posting-by: icking@gmdzi.gmd.de (Werner Icking)
        Original-subject: Re: Liste deutscher Worte / Spellchecker
        Reposted-by: emv@msen.com (Edward Vielmetti, MSEN)
  
        Peter.Bruells@arbi.informatik.uni-oldenburg.de writes:
  
        > Ausserdem braeuchte ich eine ascii-liste deutscher Woerter, um
        > ggf. einen Spellchecker zu trainieren oder umzutrainieren.
  
        Darauf kamen zwei Antworten (u.a. eine von mir) mit Hinweisen
        auf vorhandene Woerterlisten:
  
        sk@krabat.marco.de (Stefan Kutsch) writes:
  
        > die ascii-liste gibt's auf
        >   ftp.informatik.tu-muenchen.de 131.159.0.110
        >   /public/doc/dict/germanl.Z
  
        icking@gmdzi.gmd.de (Werner Icking) writes:
  
        > Im Rahmen einer aehnlichen Diskussion in ...  bekam ich von
        > From: squirrel@bart.cs.mcgill.ca (Alexander OKAPUU-VON VEH)
        > Subject: Re:  German Spell-Checker
  
        > einen Hinweis auf eine Woerterliste, die leider Umlaute nur
        > als Uemlaeutoe enthaelt:
  
        >   Du kannst Dir die Liste holen:
  
        >   Sie ist bei faui43.informatik.uni-erlangen.de unter 
        >   /portal/doc/dict   und heisst  german-wordlist.Z
  
        >   Ungefaehr 760 K komprimiert, 2.1 MB sonst.  
  
        Da die Nachfrage so grosz ist, habe ich auf beide Listen mal
        meinen NI-Spellchecker angesetzt (der KI-Spellchecker ist ja
        noch in Arbeit).  Eine allererste Sichtung brachte bereits
        folgendes zu Tage:
  
        Die ***Muenchener*** Woerterliste kennt so nuetzliche 
        Woerter wie
  
            ANBE ANLLO" ARGE PM2 PM3 PME PU
  
        Und auch der "Aasgeier" fehlt nicht (wohl aber dem Aasgeier sein
        Genitiv).
  
        Auch kennt es Umlaute (U"mla"uto") und Esszet:
  
          AdresSbit AdresSbusses AdresSgenerierung Adressberechnungen
          Adressbus Adresserweiterung Adressgenerator Adressoffsets
          Adressraum Adressraumerweiterung Anschlu"sS AnschlusS
          Anschlu"ss Anschluss
  
        wobei es aber an der Rechtschreibung hapert.
  
        Die teutschen Woerter "Padrta Paella Paintadditiv" finden sich
        direkt in Reihe.
  
        Und auch so beliebte Woerter wie "zusammenaddiert" oder
        "aufaddiert" fehlen nicht.
  
        Wenn es um Genauigkeit geht, findet man u.a.
  
          Genau Genaue Genauere Genauikeit
  
  
        Die **Erlangener** Woerterliste kennt AT ATP und viele weitere
        nuetzliche Buchstabenkombinationen.
  
        Bei "Aasgeiern Aasgeiers" fehlt der Nominativ ebenso wie den
        "Abenddaemmerungen" der Singular. Aehnliches gilt fuer "Bliebe
        Blinddaerme Blinddaermen Blinddarms Blinde" oder "Daemmerstunde
        Daemmerungen".
  
        So etwas braucht man zum "Buechermachen"
  
        Auch hier half die Suche nach Genauigkeit: "Genannten
        Genaratoren Genau"
  
        Und wenn man mal wirklich ganz genaue hinschaut, findet man
  
          genaueste genauesten genausowenig genaustem genausten
          genauster
  
        Dem laeszt sich fast nichts mehr "zuaddieren"!
  
          Genau Genaue Genauere Genauikeit
  
        So etwas ist als *Eingabe* fuer einen Spellchecker gut brauchbar.
  
          Werner Icking
          icking@gmdzi.gmd.de
          (+49 2241) 14-2443
          Gesellschaft fuer Mathematik und Datenverarbeitung mbH (GMD)
          Schloss Birlinghoven, P.O.Box 1240, 
          D-5205 Sankt Augustin 1, FRGermany
          "Der Dativ ist dem Genitiv sein Tod."

COMPILATION PROCESS

    The file german.words is basically the union of the files
    "germanl" [1] and "german-wordlist.new" [2], obtained from
    reseq.regent.e-technik.tu-muenchen.de, with a few thousand
    corrections.

    The table below gives the number of words in each original list
    ("raw"), and how many of such words were included ("accept") and
    not included ("reject") in the final file english.words:

        ref  file                       lcase  accept  reject
        ---  -----------------------  -------  ------  ------
        [1]  germanl                    27342   26929     413
        [2]  german-wordlist.new       159510  158186    1324
        [3]  german-wordlist           160086  120921   39165

    In the case of file [3], the "reject" count includes most of the 
    words with occurrences of ae/oe/ue/ss.


(NON-)COPYRIGHT STATUS

  To the best of my knowledge, all the files I used to build these
  wordlists were available for public distribution and use, at least
  for non-commercial purposes.  I have confirmed this assumption with
  the authors of the lists, whenever they were known.
  
  Therefore, it is safe to assume that the wordlists in this package
  can also be freely copied, distributed, modified, and used for
  personal, educational, and research purposes.  (Use of these files in
  commercial products may require written permission from DEC and/or
  the authors of the original lists.)
  
  Whenever you distribute any of these wordlists, please distribute
  also the accompanying README file.  If you distribute a modified
  copy of one of these wordlists, please include the original README
  file with a note explaining your modifications.  Your users will
  surely appreciate that.

(NO-)WARRANTY DISCLAIMER

  These files, like the original wordlists on which they are based,
  are still very incomplete, uneven, and inconsitent, and probably
  contain many errors.  They are offered "as is" without any warranty
  of correctness or fitness for any particular purpose.  Neither I nor
  my employer can be held responsible for any losses or damages that
  may result from their use.

