LINGUIST List 17.1214|
Fri Apr 21 2006
Review: Software/Morphology: Alchemist 2.0 (1st review)
Editor for this issue: Lindsay Butler
This LINGUIST List issue is a review of a book published by one of our supporting publishers, commissioned by our book review editorial staff. We welcome discussion of this book review on the list, and particularly invite the author(s) or editor(s) of this book to join in. To start a discussion of this book, you can use the Discussion form on the LINGUIST List website. For the subject of the discussion, specify "Book Review" and the issue number of this review. If you are interested in reviewing a book for LINGUIST, look for the most recent posting with the subject "Reviews: AVAILABLE FOR REVIEW", and follow the instructions at the top of the message. You can also contact the book review staff directly.
Message 1: Alchemist 2.0
From: Oliver Streiter <ostreiterweb.de>
Subject: Alchemist 2.0
CREATORS: Sprague, Colin; Hu, Yu
AVAILABLE AT: http://linguistica.uchicago.edu/alchemist.html
Oliver Streiter, Department of Western Languages and Literature, National
University of Kaohsiung, Taiwan
Alchemist is a tool that allows users to read in raw text files and create a
morphological analysis in XML format that can be used as a ''gold standard'' for
evaluating the results of an unsupervised morphological analyzer. The user
manually identifies morphemes and categorizes them as root or affix, together
with, optionally, a degree of certainty of the analyst. In addition, morphemes
can be assigned morphosyntactic features, such as part of speech, person,
number, and gender. The tool is intended for researchers who want to perform a
linguistic analysis and store their data in a standard format. Given its clear
and attractive interface, the tool might be effectively used for in-class
exercises on morphological analysis.
Alchemist can be freely downloaded from http://linguistica.uchicago.edu/, in a
binary version for Windows and Mac OS X, or as source code e.g. for Linux/Unix
environments. The type of license, however, under which the tool can be
downloaded is not specified. The software is documented in a 19-page PDF file,
accessible from the same site.
The first step in using Alchemist is to create a new GOLD standard collection
from an input text file (RTF or plain text). Then, the user has to specify the
maximal number of words in the list, e.g. 200 or 500 words. Before the word list
is created, the user can select or create some rules to 'scrub' the text. It is
thus possible to remove unwanted numbers or HTML markup. New scrubbing rules can
be added in the form of regular expressions. Once created, scrubbing rules can
be saved and loaded in a later session.
Using the white space character as predefined word delimiter, the tool creates a
word list from the input text. This word list, called Word Collection, can be
sorted from left to right and from right to left to discover prefixes or
suffixes respectively. To facilitate the discovery of morphemes, the words in
the Word Collection can also be filtered using regular expressions. The
documentation contains a number of interesting examples of possible filters.
Using the mouse pointer, the user can mark roots and affixes in the Word
Collection. Roots and affixes are then highlighted in different colors in the
Word Collection. In addition, the morphemes, their types ('root' or 'affix') and
the words in which they occur show up automatically in a list of morphemes,
called the Morpheme Explorer. The same morpheme derived from different words and
allomorphs can be merged in this Morpheme Explorer in a fairly intuitive way.
The morphemes in the Morpheme Explorer can be also used as filter of the word
collection. Thus, clicking on one or more affixes followed by the button 'Show
Filtered' will cause all words containing this affix to be listed in the Word
Collection. Using this filter, the user can jump in a very easy and efficient
way from an affix to a root, from the root to other affixes etc.
The Word Collection and its analysis can be stored in XML as GOLD standard
(General Ontology for Linguistic Descriptions,
http://www.linguistics-ontology.org/) standard. In later sessions, the user can
open this XML document and continue the analysis. Merging two analyses or adding
one text to an existing analysis doesn't seem to be possible.
The software documentation is well written and contains a detailed description
of all functions of Alchemist. However the documentation does not mention the
license. It neither covers the installation process. While installation on Mac
OS X and Windows was as smooth as it can be, I abandoned the compilation of
Alchemist under Linux after compilation stopped with a cryptic message and
neither the software documentation, nor the contact person, nor a Web-search
provided any helpful information.
The documentation also lacks a discussion of wider contexts in which the tool
can be used. The user's acquaintance with the GOLD standard, or at least the
willingness to use this is taken for granted. Explaining the GOLD standard and
its usefulness in the introduction of the documentation would increase the
relevance of the tool.
The web-page of Alchemist does not contain additional information. There doesn't
seem to be any active user group, help desk, mailing list or any other kind of
information structure through which users and developers might interact.
The design of the interface is excellent. It integrates a nice help function.
The usage is as intuitive as it can be. Singular windows however cannot be
resized. Additional space might be gained by putting the R,A,C buttons after the
When testing the tool in different contexts, the tool, however, does no longer
seem as mature as its interface and the documentation suggest. The most serious
problem is related to the encoding of the input text file. Unlike a web-browser,
there is no way to specify the encoding of the input file. The tool assumes
uniformly that files have been encoded in Latin-1 (ISO 8869-1). Alchemist thus
produces broken graphical representations for all other encodings, e.g. German,
Spanish or French in Unicode. Characters using more than one bite are split into
meaningless symbols. As a consequence, the tool is limited to the Latin script
and within the Latin script only those writing systems which fall within the
scope of ISO 8859-1.
Thus not only many East and Central Asian languages but also many richly
accented African languages cannot be processed by Alchemist unless
transliterated into a form which falls within the scope of ISO 8859-1. To make
it clear, this excludes writing systems using the Arabic script, the Abugida
script, the Chinese script, the Cyrillic scripts, in addition to about 100 other
scripts. Excluded are also many languages using the Latin script but not
included in ISO 8859-1, e.g Czech (ISO 8859-2) and Turkish (ISO 8859-9). This
failure to support Unicode should be corrected in future versions if the tools
is to have any relevance.
The input and output functions reveal additional problems. Although the input
text file can be an RTF-file or a plain text file, the RTF-file I created with
OpenOffice, was not processed correctly and RTF tags showed up in the Word
Collection. Thus using plain text input files seems to be the only feasible
option. The XML output contains huge amounts of rubbish characters. Strictly
speaking, this is fool's gold and not XML. An inexperienced user would discard
the output and with it the entire tool.
A problematic procedure is the transformation of the input text into word lists.
Although this transformation is relatively easy for English, there is no general
procedure which can do this transformation for all writing systems of the world
without consulting a linguistic database. The white space characters, the
hyphen, the apostrophe may or may not be, according to the writing system, part
of a word. Thus even common languages like French or Italian are processed
incorrectly in Alchemist as two words joined by a ''''' are not split. Languages
that can have a white space character within a words, e.g. Vietnamese and
Sesotho (Roux 2005, Streiter & Stuflesser 2005) and languages without a word
separator require more advanced techniques.
While transforming the Word Collection in a collection of morphemes I
encountered the following problem. In some cases I would like to have a link
back to the text in which a word occurred. Ambiguous words, e.g. 'reports' can
be understood only in context and providing a KWIC view of the word might reveal
whether it is a verb or a noun. In addition, when I tried to undo an analysis
and deleted the affix from the Morpheme Explorer, the affix disappeared also
from the data in the Word Collection. Clicking on one character in the Word
Collection and deleting the related root in the Morpheme Explorer splits the
root into two roots. I do not know whether this is an intended behavior. Overall
the possibilities to undo an analysis or go back to an earlier stage in the
analysis are not given.
Finally, there are some minor problems:
* Using the Help-function, the tool crashed several times after a few (maybe
inexact) mouse movements on Mac OS X. Unsaved data where then lost.
* The filter on the Word Collection and the Sorting of the Word Collection do
not interact in a meaningful way. When a filter is used, words are sorted from
left to right. When words are sorted from right to left, no filter can be used.
I can however think of no linguistic motivation why both techniques should not
be used in combination.
* Sometimes the system shows an unexpected outcome, e.g. after the deletion of a
word from the Word Collection, the system falls back on the last morpheme-based
Overall, Alchemist is a very promising tool which will certainly find its way
onto the linguist's Desktop. It is well designed, easy to use and produces an
output in an important standard. However, the tool is not as solid as one would
wish it to be. The main problem is that it does not support Unicode. This
however might be solved easily in future releases. Non-Unicode encoded files
could then be converted on the fly to Unicode using functionalities similar to
ICONV. To overcome the difficulties in the creation of word lists will require
more linguistic intelligence, e.g. in the form of a linguistic database.
Finally, it can be hoped that the developing team will succeed in building a
community around the tool, so that new users can join discussion groups when
seeking support. This will also provide the feedback necessary to overcome last
problems with buttons, windows and file formats. After all, alchemy was not that
unsuccessful, except in the production of gold. The Alchemist however promises
something better, it will help you to produce a gold standard.
Roux, J. C. (2005), Results of the African Speech Technology (AST) Project,
Streiter, O. & Stuflesser, M. (2005), XNLRDF, the Open Source Framework for
Multilingual Computing, Lesser Used Languages & Computer Linguistics, European
Academy Bozen/Bolzano, Italy, 27th-28th October 2005,
ABOUT THE REVIEWER
Oliver Streiter teaches computational linguistics and corpus linguistics at the
National University of Kaohsiung, Taiwan. His current research focuses on the
compilation and annotation of linguistic resources to support low density
Respond to list|Read more issues|LINGUIST home page|Top of issue
Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.