96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
ARBEITEN ZUR MEHRSPRACHIGKEIT<br />
WORKING PAPERS IN MULTILINGUALISM<br />
Folge B <strong>•</strong> Series B<br />
<strong>96</strong> <strong>•</strong> <strong>2011</strong><br />
Hanna Hedeland, Thomas Schmidt, Kai Wörner (eds.)<br />
Multilingual Resources and<br />
Multilingual Applications<br />
Proceedings of the Conference of the<br />
German Society for Computational Linguistics and<br />
Language Technology (GSCL) <strong>2011</strong><br />
Sonderforschungsbereich<br />
Mehrsprachigkeit<br />
© ISSN 0176-599X
© Hanna Hedeland, Thomas Schmidt, Kai Wörner<br />
<strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong><br />
Max Brauer-Allee 60<br />
D-22765 Hamburg<br />
ARBEITEN ZUR MEHRSPRACHIGKEIT<br />
WORKING PAPERS IN MULTILINGUALISM<br />
Folge B, Nr. <strong>96</strong> <strong>•</strong> <strong>2011</strong><br />
Hanna Hedeland, Thomas Schmidt, Kai Wörner (eds.)<br />
Multilingual Resources and Multilingual Applications<br />
Proceedings of the Conference of the German Society for<br />
Computational Linguistics and Language Technology (GSCL) <strong>2011</strong><br />
Die „Arbeiten zur Mehrsprachigkeit – Folge B“ publizieren Forschungsarbeiten aus dem<br />
Sonderforschungsbereich 538 Mehrsprachigkeit, der von der Deutschen Forschungsgemeinschaft<br />
im Juli 1999 an der <strong>Universität</strong> Hamburg eingerichtet wurde. Wir danken der<br />
DFG <strong>für</strong> ihre Unterstützung.<br />
Die „Arbeiten zur Mehrsprachigkeit – Folge B“ sind bei der Deutschen Bibliothek in Frankfurt/Main<br />
mit der Seriennummer ISSN 0176-559X eingetragen.<br />
Redaktion:<br />
Martin Elsig, Svenja Kranich, Thomas Schmidt, Manuela Schönenberger<br />
Technische Umsetzung:<br />
Thomas Schmidt
Collaborative Research Center: Multilingualism<br />
Sonderforschungsbereich 538: Mehrsprachigkeit<br />
University of Hamburg<br />
Founded in July 1999, the Collaborative Research Centre on Multilingualism conducts research<br />
on patterns of language use in multilingual environments, bilingual language acquisition,<br />
and the role of multilingualism and language contact for language change.<br />
In the current, fourth funding period (2008–<strong>2011</strong>), the Centre comprises two main research<br />
branches, each of which focuses on a central set of common issues, and a third branch of<br />
projects dealing with practical application solutions. Branch E, Multilingual Language Acquisition,<br />
consists of four projects, with a common focus on the nature of “critical phases” in<br />
language acquisition. Branch H, Historical Aspects of Multilingualism and Variation, consists<br />
of eight projects, dealing with questions of language change and language contact. This<br />
branch also comprises projects of former separate branch K, Multilingual Communication.<br />
Since 2007, a new Branch T, Transfer Projects, has been active. It consists of five projects<br />
whose goal is to develop concrete solutions for practical problems relating to multilingual<br />
situations, based on research results derived from the Centre’s research activities.<br />
Languages currently studied at the Research Centre include the following: Danish, Catalan,<br />
English, Faroese, French, German, German Sign Language, Icelandic, Irish, Polish, Portuguese,<br />
Spanish, Swedish, and Turkish, as well as several historical or regional sub-varieties<br />
of some of these languages.<br />
Chair:<br />
Prof. Dr. Christoph Gabriel<br />
christoph.gabriel@uni-hamburg.de<br />
Co-chairs:<br />
Prof. Dr. Kurt Braunmüller<br />
braunmueller@rrz.uni-hamburg.de<br />
Prof. Dr. Barbara Hänel-Faulhaber<br />
Barbara.Haenel@uni-hamburg.de<br />
University of Hamburg, SFB 538, Max-Brauer-Allee 60, D-22765 Hamburg.<br />
Tel. +49 40-42838-6432. http://www.uni-hamburg.de/sfb538/
Local Organizing Comittee<br />
� Thomas Schmidt<br />
� Kai Wörner<br />
� Timm Lehmberg<br />
� Hanna Hedeland<br />
Program Committee<br />
� Maja Bärenfänger (<strong>Universität</strong> Gießen)<br />
� Stefanie Dipper (<strong>Universität</strong> Bochum)<br />
� Kurt Eberle (Lingenio Heidelberg)<br />
� Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)<br />
� Ullrich Heid (<strong>Universität</strong> Hildesheim)<br />
� Claudia Kunze (Qualisys GmbH)<br />
� Lothar Lemnitzer (Berlin-Brandenburgische Akademie der Wissenschaften)<br />
� Henning Lobin (<strong>Universität</strong> Gießen)<br />
� Ernesto de Luca (Technische <strong>Universität</strong> Berlin)<br />
� Cerstin Mahlow (<strong>Universität</strong> Zürich)<br />
� Alexander Mehler (<strong>Universität</strong> Bielefeld)<br />
� Wolfgang Menzel (<strong>Universität</strong> Hamburg)<br />
� Georg Rehm (Deutsches Forschungszentrum <strong>für</strong> Künstliche Intelligenz)<br />
� Josef Ruppenhofer (<strong>Universität</strong> Saarbrücken)<br />
� Thomas Schmidt (<strong>Universität</strong> Hamburg)<br />
� Roman Schneider (Institut <strong>für</strong> Deutsche Sprache Mannheim)<br />
� Bernhard Schröder (<strong>Universität</strong> Duisburg)<br />
� Manfred Stede (<strong>Universität</strong> Potsdam)<br />
� Angelika Storrer (<strong>Universität</strong> Dortmund)<br />
� Maik Stührenberg (<strong>Universität</strong> Bielefeld)<br />
� Thorsten Trippel (<strong>Universität</strong> Tübingen)<br />
� Cristina Vertan (<strong>Universität</strong> Hamburg)<br />
� Andreas Witt (Institut <strong>für</strong> Deutsche Sprache Mannheim)<br />
� Christian Wolff (<strong>Universität</strong> Regensburg)<br />
� Kai Wörner (<strong>Universität</strong> Hamburg)
Call for Papers<br />
The Conference of the German Society for Computational Linguistics and Language Technology (GSCL)<br />
in <strong>2011</strong> will take place from 28th to 30th September <strong>2011</strong> at the University of Hamburg. The main<br />
conference theme is Multilingual Resources and Multilingual Applications.<br />
Contributions to any topic related to Computational Linguistics and Language Technology are invited, but<br />
we especially encourage submissions that are related to the main theme. The topic “Multilingual Resources<br />
and Multilingual Applications” comprises all aspects of computational linguistics and speech and language<br />
technology in which issues of multilingualism, of language contrasts or of language independent<br />
representations play a major role. This includes, for instance:<br />
� representation and analysis of parallel corpora and comparable corpora<br />
� multilingual lexical resources<br />
� machine translation<br />
� annotation and analysis of learner corpora<br />
� linguistic variation in linguistic data and applications<br />
� localisation and internationalisation<br />
Conference languages are English and German; contributions are welcome in both languages. Three types<br />
of submission are invited:<br />
� Regular talk – Submission of an extended abstract<br />
� Poster – Submission of an abstract<br />
� System presentation – Submission of an abstract<br />
Only contributions in electronic form will be accepted. We do not provide style sheets for submissions at<br />
this stage; constraints on length are given below. Submissions must be submitted via the electronic<br />
conference system. Accepted abstracts will be published as a book of abstracts on the occasion of the<br />
conference. Extended versions of the papers will be published as a special issue of the Journal for<br />
Language Technology and Computational Linguistics after the conference.<br />
German Society for<br />
Computational Linguistics<br />
& Language Technology
Contents<br />
I Invited Talks<br />
Constructing Parallel Lexicon Fragments Based on English FrameNet Entries:<br />
Semantic and Syntactic Issues ..................................................................................................... 9<br />
Hans C. Boas<br />
The Multilingual Web: Opportunities, Borders and Visions....................................................... 19<br />
Felix Sasaki<br />
Combining Various Text Analysis Tools for Multilingual Media Monitoring ............................. 25<br />
Ralf Steinberger<br />
II Regular Papers<br />
Generating Inflection Variants of Multi-Word Terms for French and German........................... 33<br />
Simon Clematide, Luzia Roth<br />
Tackling the Variation in International Location Information Data: An Approach<br />
Using Open Semantic Databases ............................................................................................... 39<br />
Janine Wolf, Manfred Stede, Michaela Atterer<br />
Towards Multilingual Biographical Event Extraction - Initial Thoughts on the<br />
Design of a New Annotation Scheme.......................................................................................... 45<br />
Michaela Geierhos, Jean-Leon Bouraoui, Patrick Watrin<br />
The Corpus of Academic Learner English (CALE): A New Resource for the Study<br />
of Lexico-Grammatical Variation in Advanced Learner Varieties.............................................. 51<br />
Marcus Callies, Ekaterina Zaytseva<br />
From Multilingual Web-Archives to Parallel Treebanks in Five Minutes................................... 57<br />
Markus Killer, Rico Sennrich, Martin Volk<br />
Querying Multilevel Annotation and Alignment for Detecting Grammatical<br />
Valence Divergencies ................................................................................................................ 63<br />
Oliver Čulo<br />
SPIGA – A Multilingual News Aggregator................................................................................. 69<br />
Leonhard Hennig, Danuta Ploch, Daniel Prawdzik, Benjamin Armbruster,<br />
Christoph Büscher, Ernesto William De Luca, Holger Düwiger, Şahin Albayrak<br />
From Historic Books to Annotated XML: Building a Large Multilingual<br />
Diachronic Corpus .................................................................................................................... 75<br />
Magdalena Jitca, Rico Sennrich, Martin Volk<br />
Visualizing Dependency Structures............................................................................................ 81<br />
Chris Culy, Verena Lyding, Henrik Dittmann<br />
A Functional Database Framework for Querying Very Large Multi-Layer Corpora.................. 87<br />
Roman Schneider
Hybrid Machine Translation for German in taraXÜ: Can Translation Costs Be<br />
Decreased Without Degrading Quality? ................................................................................... 93<br />
Aljoscha Burchardt, Christian Federmann, Hans Uszkoreit<br />
Annotation of Explicit and Implicit Discourse Relations in the TüBa-D/Z Treebank.................. 99<br />
Anna Gastel, Sabrina Schulze, Yannick Versley, Erhard Hinrichs<br />
Devil’s Advocate on Metadata in Science .................................................................................105<br />
Christina Hoppermann, Thorsten Trippel, Claus Zinn<br />
Improving an Existing RBMT System by Stochastic Analysis ....................................................111<br />
Christian Federmann, Sabine Hunsicker<br />
Terminology Extraction and Term Variation Patterns: A Study of French and<br />
German Data............................................................................................................................117<br />
Marion Weller, Helena Blancafort, Anita Gojun, Ulrich Heid<br />
Ansätze zur Verbesserung der Retrieval-Leistung kommerzieller Translation-<br />
Memory-Systeme.......................................................................................................................123<br />
Dino Azzano, Uwe Reinke, Melanie Sauer<br />
WikiWarsDE: A German Corpus of Narratives Annotated with Temporal<br />
Expressions ..............................................................................................................................129<br />
Jannik Strötgen, Michael Gertz<br />
Translation and Language Change with Reference to Popular Science Articles:<br />
The Interplay of Diachronic and Synchronic Corpus-Based Studies .........................................135<br />
Sofia Malamatidou<br />
A Comparable Wikipedia Corpus: From Wiki Syntax to POS Tagged XML ..............................141<br />
Noah Bubenhofer, Stefanie Haupt, Horst Schwinn<br />
A German Grammar for Generation in OpenCCG....................................................................145<br />
Jean Vancoppenolle, Eric Tabbert, Gerlof Bouma, Manfred Stede<br />
Multilingualism in Ancient Texts: Language Detection by Example of Old High<br />
German and Old Saxon.............................................................................................................151<br />
Zahurul Islam, Roland Mittmann, Alexander Mehler<br />
Multilinguale Phrasenextraktion mit Hilfe einer lexikonunabhängigen<br />
Analysekomponente am Beispiel von Patentschriften und nuztergenerierten<br />
Inhalten ....................................................................................................................................157<br />
Daniela Becks, Julia Maria Schulz, Christa Womser-Hacker, Thomas Mandl<br />
Die Digitale Rätoromanische Chrestomathie – Werkzeuge und Verfahren <strong>für</strong> die<br />
Korpuserstellung durch kollaborative Volltexterschließung ....................................................163<br />
Claes Neuefeind, Jürgen Rolshoven, Fabian Steeg<br />
Ein umgekehrtes Lehnwörterbuch als Internetportal und elektronische Ressource:<br />
Lexikographische und technische Grundlagen..........................................................................169<br />
Peter Meyer, Stefan Engelberg<br />
Localizing a Core HPSG-Based Grammar for Bulgarian .........................................................175<br />
Petya Osenova
III Poster Presentations<br />
Autorenunterstützung <strong>für</strong> die Maschinelle Übersetzung............................................................183<br />
Melanie Siegel<br />
Experimenting with Corpus-Based MT Approaches..................................................................187<br />
Monica Gavrila<br />
Method of POS-Disambiguation Using Information about Words Co-Occurrence<br />
(For Russian) ..........................................................................................................................191<br />
Edward Klyshinsky, Natalia Kochetkova, Maxim Litvinov, Vadim Maximov<br />
Von TMF in Richtung UML: In drei Schritten zu einem Modell des<br />
übersetzungsorientierten Fachwörterbuchs ..............................................................................197<br />
Georg Löckinger<br />
Annotating for Precision and Recall in Speech Act Variation: The Case of<br />
Directives in the Spoken Turkish Corpus ..................................................................................203<br />
Şükriye Ruhi, Thomas Schmidt, Kai Wörner, Kerem Eryılmaz<br />
The SoSaBiEC Corpus: Social Structure and Bilinguality in Everyday<br />
Conversation ............................................................................................................................207<br />
Veronika Ries, Andy Lücking<br />
DIL, ein zweisprachiges Online-Fachwörterbuch der Linguistik<br />
(Deutsch-Italienisch) ...............................................................................................................211<br />
Carolina Flinz<br />
Knowledge Extraction and Representation: The EcoLexicon Methodology...............................215<br />
Pilar León Araúz, Arianne Reimerink<br />
Processing Multilingual Customer Contacts via Social Media..................................................219<br />
Michaela Geierhos, Yeong Su Lee, Matthias Bargel<br />
ATLAS – A Robust Multilingual Platform for the Web ..............................................................223<br />
Diman Karagiozov, Svetla Koeva, Maciej Ogrodniczuk, Cristina Vertan<br />
Multilingual Corpora at the Hamburg Centre for Language Corpora ......................................227<br />
Hanna Hedeland, Timm Lehmberg, Thomas Schmidt, Kai Wörner<br />
The English Passive and the German Learner – Compiling an Annotated<br />
Learner Corpus to Investigate the Importance of Educational Settings.....................................233<br />
Verena Möller, Ulrich Heid<br />
Register, Genre, Rhetorical Functions: Variation in English Native-Speaker<br />
and Learner Writing.................................................................................................................239<br />
Ekaterina Zaytseva<br />
Tools to Analyse German-English Contrasts in Cohesion.........................................................243<br />
Kerstin Kunz, Ekaterina Lapshinova-Koltunski<br />
Comparison and Evaluation of Ontology Extraction Systems ...................................................247<br />
Stefanie Reimers
IV System Presentations<br />
New and Future Developments in EXMARaLDA.......................................................................253<br />
Thomas Schmidt, Kai Wörner, Hanna Hedeland, Timm Lehmberg<br />
Der VLC Language Index .........................................................................................................257<br />
Dirk Schäfer, Jürgen Handke<br />
Topological Fields, Constituents and Coreference: A New Multi-Layer<br />
Architecture for TüBa-D/Z........................................................................................................259<br />
Thomas Krause, Julia Ritz, Amir Zeldes, Florian Zipser<br />
MT Server Land Translation Services.......................................................................................263<br />
Christian Federmann
Invited Talks
Multilingual Resources and Multilingual Applications - Invited Talks<br />
8
Multilingual Resources and Multilingual Applications - Invited Talks<br />
Constructing parallel lexicon fragments based on English FrameNet entries:<br />
Semantic and syntactic issues<br />
Hans C. Boas<br />
The University of Texas at Austin<br />
Department of Germanic Studies and Department of Linguistics<br />
1 University Station, C3300, Austin, TX 78712-0304, U.S.A.<br />
E-mail: hcb@mail.utexas.edu<br />
Abstract<br />
This paper investigates how semantic frames from FrameNet can be re-used for constructing FrameNets for other languages.<br />
Section one provides a brief overview of Frame Semantics (Fillmore, 1982). Section 2 introduces the main structuring principles of<br />
the Berkeley FrameNet project. Section three presents a typology of FrameNets for different languages, highlighting a number of<br />
important issues surrounding the universal applicability of semantic frames. Section four shows that while it is often possible to reuse<br />
semantic frames across languages in a principled way it is not always straightforward because of systematic syntactic<br />
differences in how lexical units express the semantics of frames. Section five summarizes the issues discussed in this paper.<br />
Keywords: Computational Lexicography, FrameNet, Frame Semantics, Syntax<br />
1. Frame Semantics<br />
Research in Frame Semantics (Fillmore, 1982; 1985) is<br />
empirical, cognitive, and ethnographic in nature. It seeks<br />
to describe and analyze what users of a language<br />
understand about what is communicated by their<br />
language (Fillmore & Baker, 2010). Central to this line<br />
of research is the notion of semantic frame, which<br />
provides the basis for the organization of the lexicon,<br />
thereby linking individual word senses, relationships<br />
between the senses of polysemous words, and<br />
relationships among semantically related words. In this<br />
conception of the lexicon, there is a network of<br />
hierarchically organized and intersecting frames through<br />
which semantic relationships between collections of<br />
concepts are identified (Petruck et al., 2004). A frame is<br />
any system of concepts related in such a way that to<br />
understand any one concept it is necessary to understand<br />
the entire system; introducing any one concept results in<br />
all of them becoming available. In Frame Semantics,<br />
word meanings are thus characterized in terms of<br />
experience-based schematizations of the speaker's<br />
world, i.e. frames. It is held that understanding any<br />
element in a frame requires access to an understanding<br />
of the whole structure (Petruck & Boas, 2003). 1 The<br />
following section shows how the concept of semantic<br />
frame has been used to structure the lexicon of English<br />
for the purpose of creating a lexical database.<br />
2. The Berkeley FrameNet Project<br />
The Berkeley FrameNet Project (Lowe et al., 1997;<br />
Baker et al., 1998; Fillmore et al., 2003a; Ruppenhofer<br />
et al., 2010) is building a lexical database that aims to<br />
provide, for a significant portion of the vocabulary of<br />
contemporary English, a body of semantically and<br />
syntactically annotated sentences from which reliable<br />
information can be reported on the valences or<br />
combinatorial possibilities of each item targeted for<br />
analysis (Fillmore & Baker, 2001). The method of<br />
inquiry is to find groups of words whose frame<br />
structures can be described together, by virtue of their<br />
sharing common schematic backgrounds and patterns of<br />
expressions that can combine with them to form larger<br />
phrases or sentences. In the typical case, words that<br />
share a frame can be used in paraphrases of each other.<br />
The general purposes of the project are both to provide<br />
1 See Petruck (19<strong>96</strong>), Ziem (2008), and Fillmore & Baker<br />
(2010) on how different theories employ the notion of “frame.”<br />
9
Multilingual Resources and Multilingual Applications - Invited Talks<br />
reliable descriptions of the syntactic and semantic<br />
combinatorial properties of each word in the lexicon,<br />
and to assemble information about alternative ways of<br />
expressing concepts in the same conceptual domain<br />
(Fillmore & Baker, 2010).<br />
To illustrate, consider the sentence Joe stole the watch<br />
from Michael. The verb steal is said to evoke the Theft<br />
frame, which is also evoked by a number of<br />
semantically related verbs such as snatch, shoplift,<br />
pinch, filch, and thieve, among others, as well as nouns<br />
such as thief and stealer. 2 The Theft frame represents a<br />
scenario with different Frame Elements (FEs) that can<br />
be regarded as instances of more general semantic roles<br />
such as AGENT, PATIENT, INSTRUMENT, etc. More precisely,<br />
the Theft frame describes situations in which a<br />
PERPETRATOR (the person or other agent that takes the<br />
GOODS away) takes GOODS (anything that can be taken<br />
away) that belong to a VICTIM (the person (or other<br />
sentient being or group) that owns the GOODS before they<br />
are taken away by the PERPETRATOR). Sometimes more<br />
specific information is given about the SOURCE (the<br />
initial location of the GOODS before they change<br />
location). 3 The necessary background information to<br />
interpret steal and other semantically related verbs as<br />
evoking the Theft frame also requires an<br />
understanding of illegal activities, property ownership,<br />
taking things, and a great deal more (see Boas, 2005b;<br />
Bertoldi et al., 2010; Dux, <strong>2011</strong>).<br />
Based on the frame concept, FrameNet researchers<br />
follow a lexical analysis process that typically consists<br />
of the following steps according to Fillmore & Baker<br />
(2010:321-322): (1) Characterizing the frames, i.e. the<br />
situation types for which the language has provided<br />
special expressive means; (2) Describing and naming<br />
the Frame Elements (FEs), i.e. the aspects and<br />
components of individual frames that are likely to be<br />
mentioned in the phrases and sentences that are<br />
instances of those frames; (3) Selecting lexical units<br />
(LUs) that belong to the frame, i.e. words from all parts<br />
of speech that evoke and depend on the conceptual<br />
2 Names of frames are in courier font. Names of Frame<br />
Elements (FEs) are in small caps font.<br />
3 Besides so-called core Frame Elements, there are also<br />
peripheral Frame Elements that describe more general aspects<br />
of a situation, such as MEANS (e.g. by trickery), TIME (e.g. two<br />
days ago), MANNER (e.g. quietly), or PLACE (e.g. in the city).<br />
10<br />
background associated with the individual frames; (4)<br />
Creating annotations of sentences sampled from a very<br />
large corpus showing the ways in which individual LUs<br />
in the frame allow frame-relevant information to be<br />
linguistically presented; (5) Automatically generating<br />
lexical entries, and the valence descriptions contained in<br />
them, that summarize observations derivable from them<br />
(see also Atkins et al., 2003; Fillmore & Petruck, 2003;<br />
Fillmore et al., 2003b; Ruppenhofer et al., 2010).<br />
The results of this work-flow are stored in FrameNet<br />
(http://framenet.icsi.berkeley.edu), an online lexical<br />
database (Baker et al., 2003) currently containing<br />
information about more than 1,000 frames and more<br />
than 10,000 LUs. 4 Users can access FrameNet data in a<br />
variety of ways. The most prominent methods include<br />
searching for individual frames or specific LUs.<br />
Figure 1: Partial valence table for steal.v in the Theft<br />
frame<br />
Each entry for a LU in FrameNet consists of the<br />
following parts: (1) A description of the frame together<br />
with definitions of the relevant FEs, annotated examples<br />
sentences illustrating the relevant FEs in context, and a<br />
list of other LUs evoking the same frame; (2) An<br />
annotation report displaying all the annotated corpus<br />
4 For differences between FrameNet and other lexical databases<br />
such as WordNet see Boas (2005a/2005b/2009).
sentences for a given LU; (3) A lexical entry report<br />
which summarizes the syntactic realization of the FEs<br />
and the valence patterns of the LU in two separate tables<br />
(see Fillmore et al., 2003B; Fillmore, 2007).<br />
Figure 1 above illustrates an excerpt from the valence<br />
patterns in the lexical report of steal in the Theft<br />
frame. The column on the far left lists the number of<br />
annotated example sentences (in the annotation report)<br />
illustrating the individual valence patterns. The rows<br />
represent so-called frame element configurations<br />
together with their syntactic realizations in terms of<br />
phrase type and grammatical function. For example, the<br />
third frame element configuration from the top lists the<br />
FEs GOODS, MANNER, and PERPETRATOR. The GOODS are<br />
realized syntactically as a NP Object, the MANNER as a<br />
dependent ADVP, and the PERPETRATOR as an external NP.<br />
Such systematic valence tables allow researchers to gain<br />
a better understanding of how the semantics of frames<br />
are realized syntactically. 5<br />
3. FrameNets for other languages<br />
3.1. Similarities and differences<br />
Following the success of the Berkeley FrameNet for<br />
English, a number of FrameNets for other languages<br />
were developed over the past ten years. Based on ideas<br />
outlined in Heid (19<strong>96</strong>), Fontenelle (1997), and Boas<br />
(2001/2002/2005a), researchers aimed to create parallel<br />
FrameNets by re-using frames constructed by the<br />
Berkeley FrameNet project for English. While<br />
FrameNets for other languages aim to re-use English<br />
FrameNet frames to the greatest extent possible, they<br />
differ in a number of important points from the original<br />
FrameNet (see Boas, 2009).<br />
For example, projects such as SALSA (Burchardt et al.,<br />
2009) aim to create full-text annotation of an entire<br />
German corpus instead of finding isolated corpus<br />
sentences to identify lexicographically relevant<br />
information as is the case with the Berkeley FrameNet<br />
and Spanish FrameNet (Subirats, 2009). FrameNets for<br />
other languages also differ in what types of resources<br />
5 For details about the different phrase types and grammatical<br />
functions, including the different types of null instantiation<br />
(CNI, DNI, and INI) (Fillmore 1986), see Fillmore et al.<br />
2003b, Boas 2009, Fillmore & Baker 2010, and Ruppenhofer<br />
et al. 2010.<br />
Multilingual Resources and Multilingual Applications - Invited Talks<br />
they use as data pools. That is, besides exploiting a<br />
monolingual corpus as is the case with Japanese<br />
FrameNet (Ohara, 2009) or Hebrew FrameNet (Petruck,<br />
2009), projects such as French FrameNet (Pitel, 2009) or<br />
BiFrameNet (Fung & Chen, 2004) also employ multilingual<br />
corpora and other existing lexical resources.<br />
Another difference concerns the tools used for data<br />
extraction and annotation. While the Japanese and<br />
Spanish FrameNets adopted the Berkeley FrameNet<br />
software (Baker et al., 2003) with slight modifications,<br />
other projects such as SALSA developed their own tools<br />
to conduct semi-automatic annotation on top of existing<br />
syntactic annotations found in the TIGER corpus, or<br />
they integrate off-the shelf software as is the case with<br />
French FrameNet or Hebrew FrameNet. FrameNets for<br />
other languages also differ in the methodology used to<br />
produce parallel lexicon fragments. While German<br />
FrameNet (Boas, 2002) and Japanese FrameNet (Ohara,<br />
2009) rely on manual annotations, French FrameNet and<br />
BiFrameNet use semi-automatic and automatic<br />
approaches to create parallel lexicon fragments for<br />
French and Chinese. Finally, FrameNets for other<br />
languages also differ in their semantic domains and the<br />
goals they pursue. While most non-English FrameNets<br />
aim to create databases with broad coverage, other<br />
projects focus on specific lexical domains such as<br />
football (a.k.a. soccer) language (Schmidt, 2009) or the<br />
language of criminal justice (Bertoldi et al., 2010).<br />
Finally, while the data from almost all non-English<br />
FrameNets are intended to be used by a variety of<br />
audiences, Multi FrameNet6 is intended to support<br />
vocabulary acquisition in the foreign language<br />
classroom (see Atzler, <strong>2011</strong>).<br />
3.2. Re-using (English) semantic frames<br />
To exemplify how English FrameNet frames can be reused<br />
for the creation of parallel lexicon fragments<br />
consider Boas' (2005a) discussion of the English verb<br />
answer evoking the Communication_Response<br />
frame and its counterpart responder in Spanish<br />
FrameNet. The basic idea is that since the two verbs are<br />
translation equivalents they should evoke the same<br />
semantic frame, which should in turn be used as a<br />
common structuring device for combining the respective<br />
6 http://www.coerll.utexas.edu/coerll/taxonomy/term/627<br />
11
Multilingual Resources and Multilingual Applications - Invited Talks<br />
English and Spanish lexicon fragments. Since the<br />
MySQL databases representing each of the non-English<br />
FrameNets are similar in structure to the English<br />
MySQL database in that they share the same type of<br />
conceptual backbone (i.e., the semantic frames and<br />
frame relations), this step involves determining which<br />
English LUs are equivalent to corresponding non-<br />
English LUs.<br />
However, before creating parallel lexicon fragments for<br />
Spanish and linking them to their English counterparts<br />
via their semantic frame it is necessary to first conduct a<br />
detailed comparison of the individual LUs and how they<br />
realize the semantics of the frame. To begin, consider<br />
the different ways in which the FEs of the<br />
Communication_response frame are realized with<br />
answer.<br />
FE Name Syntactic Realization<br />
SPEAKER NP.Ext, PP_by_Comp, CNI<br />
MESSAGE INI, NP.Obj, PP_with.Comp, QUO.Comp,<br />
Sfin.Comp<br />
ADDRESSEE DNI<br />
DEPICTIVE PP_with.Comp<br />
MANNER AVP.Comp, PPing_without.Comp<br />
MEANS PPing_by.Comp<br />
MEDIUM PP_by.Comp, PP_in.Comp,<br />
PP_over.Comp<br />
TRIGGER NP.Ext, DNI, NP.Obj, Swh.Comp<br />
Table 1: Partial realization table for the verb answer<br />
(Boas 2005a)<br />
Table 1 shows that that there is a significant amount of<br />
variation in how FEs of the Communication_<br />
Response frame are realized with answer. For<br />
example, the FE DEPICTIVE has only one option for its<br />
syntactic realization, i.e. a PP complement headed by<br />
with. Other FEs such as SPEAKER and MANNER exhibit<br />
more flexibility in how the FEs of the frame are realized<br />
12<br />
syntactically while yet another set of FEs such as<br />
MESSAGE and TRIGGER exhibit the highest degree of<br />
syntactic variation. Now that we know the full range of<br />
how the FEs of the Communication_Response<br />
frame are realized syntactically with answer we can take<br />
the next step towards creating a parallel lexical entry for<br />
its Spanish counterpart responder.<br />
This step involves the use of bilingual dictionaries and<br />
parallel corpora in order to identify possible Spanish<br />
translation equivalents of answer. While this procedure<br />
may seem trivial, it is a rather lengthy and complicated<br />
process because it is necessary to consider the full range<br />
of valence patterns (the combination of FEs and their<br />
syntactic realizations) of the English LU answer listed in<br />
FrameNet. It lists a total of 22 different frame element<br />
configurations, totaling 32 different combinations in<br />
which these sequences may be realized syntactically. As<br />
the full valence table for answer is rather long we focus<br />
on only one out of the 22 frame element configurations,<br />
namely that of SPEAKER (Sp), MESSAGE (M), TRIGGER (Tr),<br />
and ADDRESSEE (A) in Table 2.<br />
Sp M Tr A<br />
a. NP.Ext NP.Obj DNI DNI<br />
b. NP.Ext PP_with.Comp DNI DNI<br />
c. NP.Ext QUO.Comp DNI DNI<br />
d. NP.Ext Sfin.Comp DNI DNI<br />
Table 2: Excerpt from the Valence Table for answer<br />
(Boas 2005a)<br />
As Table 2 shows, the frame element configuration<br />
exhibits a certain amount of variation in how the FEs are<br />
realized syntactically: All four valence patterns have the<br />
FE SPEAKER realized as an external noun phrase, and the<br />
FEs TRIGGER and ADDRESSEE not realized overtly at the<br />
syntactic level, but null instantiated as Definite Null<br />
Instantiation (DNI). In other words, in sentences such as<br />
He answered with another question the FEs TRIGGER and<br />
ADDRESSEE are understood in context although they are<br />
not realized syntactically.<br />
With the English-specific information about answer and<br />
the more general frame information in place we are now
in a position to search for the corresponding frame<br />
element configuration of its Spanish translation<br />
equivalent responder. Taking a look at the lexical entry<br />
of responder in Spanish FrameNet we see that the<br />
variation of syntactic realizations of FEs is similar to<br />
that of answer in Table 1.<br />
FE Name Syntactic Realizations<br />
SPEAKER NP.Ext, NP.Dobj, CNI, PP_por.COMP<br />
MESSAGE AVP.AObj, DNI, QUO.DObj,<br />
queSind.DObj, queSind.Ext<br />
ADDRESSEE NP.Ext, NP.IObj, PP_a.IObj, DNI, INI<br />
DEPICTIVE AJP.Comp<br />
MANNER AVP.AObj, PP_de.AObj<br />
MEANS VPndo.AObj<br />
MEDIUM PP_en.AObj<br />
TRIGGER PP_a.PObj, PP_de.PObj, DNI<br />
Table 3: Partial Realization Table for the verb responder<br />
(Boas 2005a)<br />
Spanish FrameNet also offers a valence table that<br />
includes for responder a total of 23 different frame<br />
element configurations. Among these, we find a<br />
combination of FEs and their syntactic realization that is<br />
comparable in structure to that of its English counterpart<br />
in Table 2 above.<br />
Multilingual Resources and Multilingual Applications - Invited Talks<br />
Sp M Tr A<br />
a. NP.Ext QUO.DObj DNI DNI<br />
b. NP.Ext QueSind.DObj DNI DNI<br />
Table 4: Excerpt from the Valence Table for responder<br />
(Boas 2005a)<br />
Comparing Tables 2 and 4 we see that answer and<br />
responder exhibit comparable valence combinations<br />
with the FEs SPEAKER and MESSAGE realized syntactically<br />
while the FEs TRIGGER and ADDRESSEE are not realized<br />
syntactically, but are instead implicitly understood (they<br />
are definite null instantiations). With a Spanish<br />
counterpart in place it now becomes possible to link the<br />
Spanish set of frame element configurations in Table 4<br />
with its English counterpart in Table 2 via the<br />
Communication_Response frame as the following<br />
Figure illustrates.<br />
Figure 2: Linking partial English and Spanish lexicon<br />
fragments via semantic frames (Boas 2005a)<br />
Figure 5 shows how the lexicon fragments of answer<br />
and responder are linked via the Communication_<br />
Response frame. The 'a' index points to the<br />
respective first lines in the valence tables of the two LUs<br />
(cf. Tables 2 and 4) and identifies the two syntactic<br />
frames as being translation equivalents of each other. At<br />
the top of Figure 2 we see the verb answer with one of<br />
its 22 frame element configurations, i.e. SPEAKER,<br />
TRIGGER, MESSAGE, and ADDRESSEE. Figure 2 shows for<br />
this configuration one possible set of syntactic<br />
realizations of these FEs, that given in row (a) in Table 2<br />
above. The 9a designation following answer indicates<br />
that this lexicon fragment is the ninth configuration of<br />
FEs out of a total of 22 frame element configurations<br />
listed in the complete realization table. Of the ninth<br />
frame element configuration 'a' indicates that it is the<br />
first of a list of various possible syntactic realizations of<br />
these FEs (there are a total of four, cf. Table 2 above).<br />
As already pointed out, the FE SPEAKER is realized<br />
syntactically as an external NP, MESSAGE as an object NP,<br />
and both TRIGGER and ADDRESSEE are null instantiated.<br />
13
Multilingual Resources and Multilingual Applications - Invited Talks<br />
The bottom of Figure 2 shows responder with the first<br />
of the 17 frame element configurations (recall that there<br />
are a total of 23). For one of these configurations, we<br />
see one subset of syntactic realizations of these FEs,<br />
namely the first row catalogued by Spanish FrameNet<br />
for this configuration (see row (a) in Table 3).<br />
The two parallel lexicon fragments at the top and the<br />
bottom of Figure 2 are linked by indexing their specific<br />
semantic and syntactic configurations as equivalents<br />
within the Communication_Response frame. This<br />
linking is indicated by the arrows pointing from the top<br />
and the bottom of the partial lexical entries to the midsection<br />
in Figure 2, which symbolizes the<br />
Communication_Response frame at the<br />
conceptual level, i.e. without any language-specific<br />
specifications. Note that this procedure does not<br />
automatically link the entire lexical entries of answer<br />
and responder to each other. Establishing such a<br />
correspondence link connects only the relevant frame<br />
element configurations and their syntactic realizations in<br />
Tables 2 and 4 via the common semantic frame, because<br />
they can be regarded as translation equivalents.<br />
Although linking the two lexicon fragments this way<br />
results in a systematic way of creating parallel lexicon<br />
fragments based on semantic frames (which serve as<br />
interlingual representations), it is not yet possible to<br />
automatically create or connect such parallel lexicon<br />
fragments. This means that one must carefully compare<br />
each individual part of the valence table of a LU in the<br />
source language with each individual part of the valence<br />
table of a LU in the target language. This step is<br />
extremely time intensive because it involves a detailed<br />
comparison of bilingual dictionaries as well as<br />
electronic corpora to ensure matching translation<br />
equivalents. Recall that Figure 2 represents only a very<br />
small set of the full lexical entries of answer and<br />
responder. The procedure outlined above will have to be<br />
repeated for each of the 32 different valence patterns of<br />
answer – and its (possible) Spanish equivalents. The<br />
following section addresses a number of other issues<br />
that need to be considered carefully when creating<br />
parallel lexicon fragments based on semantic frames.<br />
4. Cross-linguistic problems<br />
Creating parallel lexicon entries for existing English<br />
14<br />
FrameNet entries and linking them to their English<br />
counterparts raises a number of important issues, most<br />
of which require careful (manual) linguistic analysis.<br />
While some of these issues apply to the creation of<br />
parallel entries across the board, others differ depending<br />
on the individual languages or the semantic frame. The<br />
following subsections, based on Boas (to appear),<br />
briefly address some of the most important issues, which<br />
all have direct bearing on how the semantics of a frame<br />
are realized syntactically across different languages.<br />
4.1. Polysemy and profiling differences<br />
While translation equivalents evoking the same frame<br />
are typically taken to describe the same types of scenes,<br />
they sometimes differ in how they profile FEs. For<br />
example, Boas (2002) discusses differences in how<br />
announce and various German translation equivalents<br />
evoke the Communication_Statement frame.<br />
When announce occurs with the syntactic frame [NP.Ext<br />
_ NP.Obj] to realize the SPEAKER and MESSAGE FEs as in<br />
They announced the birth of their child, German offers a<br />
range of different translation equivalents, including<br />
bekanntgeben, bekanntmachen, ankündigen, or anzeigen.<br />
Each of these German LUs comes with its own<br />
specific syntactic frames that express the semantics of<br />
the Communication_ Statement frame. When<br />
announce is used to describe situations in which a<br />
message is communicated via a medium such as a<br />
loudspeaker (e.g. Joe announced the arrival of the pizza<br />
over the intercom), German offers ansagen and<br />
durchsagen as more specific translation equivalents of<br />
announce besides the more general ankündigen. Thus,<br />
by providing different LUs German offers the option of<br />
profiling particular FEs of the<br />
Communication_Statement frame, thereby<br />
allowing for the representation of subtle meaning<br />
differences of the frame and the perspective given of a<br />
situation (see Ohara, 2009 on similar profiling<br />
differences between English and Japanese LUs evoking<br />
the Risk frame).<br />
4.2. Differences in lexicalization patterns<br />
Languages differ in how the lexicalize particular types<br />
of concepts (see Talmy, 1985), which may directly<br />
influence how the semantics of a particular frame are
ealized syntactically. For example, in a comparative<br />
study of English, Spanish, Japanese, and German motion<br />
verbs in The Hound of the Baskervilles (and its<br />
translations), Ellsworth et al. (2006) find that there are a<br />
number of differences in how the various concepts of<br />
motion are associated with different types of semantic<br />
frames. More specifically, they show that English return<br />
(cf. The wagonette was paid off and ordered to return to<br />
Coombe Tracey forthwith, while we started to walk to<br />
Merripit House) and Spanish regresar both evoke the<br />
Return frame, whereas the corresponding German<br />
zurückschicken evokes the Sending frame. These<br />
differences demonstrate that although the concept of<br />
motion is incorporated into indirect causation, the<br />
frames expressing indirect causation may vary from<br />
language to language (see Burchardt et al., 2009 for a<br />
discussion of more fine-grained distinctions between<br />
verbs evoking the same frame in English and German).<br />
4.3 Polysemy and translation equivalents<br />
Finding proper translation equivalents is typically a<br />
difficult task because one has to consider issues<br />
surrounding polysemy (Fillmore & Atkins, 2000; Boas,<br />
2002), zero translations (Salkie, 2002; Boas 2005a;<br />
Schmidt, 2009), and contextual and stylistic factors<br />
(Altenberg & Granger, 2002; Hasegawa et al., 2010),<br />
among others. To illustrate, consider Bertoldi's (2010)<br />
discussion of contrastive legal terminology in English<br />
and Brazilian Portuguese. Based on the English<br />
Criminal Process frame (see FrameNet), Bertoldi<br />
finds that while there are some straightforward<br />
translation equivalents of English LUs in Portuguese,<br />
others involve a detailed analysis of the relevant<br />
polysemy patterns.<br />
Consider Figure 3, which compares English and<br />
Portuguese LUs in the Notification_of_<br />
charges frame. The first problem discussed by<br />
Bertoldi (2010) addresses the fact that although there are<br />
corresponding Portuguese LUs such as denunciar, they<br />
do not evoke the same semantic frame as the English<br />
LUs, but rather a frame that could best be characterized<br />
as evoking the Accusation frame. The second<br />
problem is that six Portuguese translation equivalents of<br />
the English LUs evoking only the Notification_<br />
of_charges frame, i.e. acusar, acusação, denunciar,<br />
Multilingual Resources and Multilingual Applications - Invited Talks<br />
denuncia, pronunciar, and pronuncia, potentially evoke<br />
three different frames.<br />
Figure 3: English LUs from the Notification_of_<br />
Charges frame and their Portuguese translation<br />
equivalents (Bertoldi, 2010: 6)<br />
Figure 4: LUs evoking multiple frames in the Portuguese<br />
Crime_scenario frame (Bertoldi, 2010:7)<br />
This leads Bertoldi to claim that the LUs acusar,<br />
acusação, denunciar, and denuncia may evoke two<br />
different Criminal_Process sub-frames, besides<br />
15
Multilingual Resources and Multilingual Applications - Invited Talks<br />
other general language, non-legal specific frames, as is<br />
illustrated by Figure 4. Bertolid's analysis shows that<br />
finding translation equivalents is not always an easy task<br />
and that one needs to pay close attention to different<br />
polysemy networks across languages, which may<br />
sometimes be influenced by systematic differences such<br />
as differences between legal systems.<br />
4.4 Universal frames?<br />
Claims about the universality of certain linguistic<br />
features are abundant in the literature. When it comes to<br />
semantic frames the question is whether frames derived<br />
on the basis of English are applicable to the description<br />
and analysis of other languages (and vice versa). While<br />
a number of studies on motion verbs (Fillmore & Atkins,<br />
2000; Boas, 2002; Burchardt et al., 2009; Ohara, 2009)<br />
and communication verbs (Boas, 2005a; Subirats, 2009),<br />
among other semantic domains, suggest that there are<br />
frames that can be re-used for the description and<br />
analysis of other languages, there also seem to be<br />
culture-specific frames that may not be re-usable<br />
without significant modification.<br />
One set of examples comes from the English<br />
Personal_Relationship frame, whose semantics<br />
appears to be quite culture-specific. Atzler (<strong>2011</strong>) shows<br />
that concepts such as dating (to date) seem to be quite<br />
specific to Anglo culture and may not be directly<br />
applicable to the description of similar activities in<br />
German. Another, perhaps more extreme example, is the<br />
term sugar daddy, which has no exact counterpart in<br />
German, but instead requires a lengthy paraphrase in<br />
German to render the concept of this particular type of<br />
relationship in German.<br />
A second example comes from the intransitive Finnish<br />
verb saunoa (literally 'to sauna'), which has no direct<br />
English counterpart because it very culture-specific, and<br />
in effect evokes a particular type of frame. To this end,<br />
Leino (2010:131) claims that this verb (and<br />
correspondingly the Finnish Sauna frame) “expresses a<br />
situation in which the referent of the subject goes to the<br />
sauna, is in the sauna, participates in the sauna event, or<br />
something of the like.” Dealing with such culturespecific<br />
frames thus requires quite lengthy paraphrases<br />
to arrive at an approximation of the semantics of the<br />
frame in English.<br />
16<br />
5. Conclusions and outlook<br />
This paper has outlined some of the basic steps<br />
underlying the creation of parallel lexicon fragments.<br />
Employing semantic frames for this purpose is still a<br />
work in progress, but the successful compilation of<br />
several FrameNets for languages other than English is a<br />
good indication that this methodology should be pursued<br />
further.<br />
Clearly, the problems outlined in the previous section<br />
need to be solved. The first problem, polysemy and<br />
profiling differences, is perhaps the most daunting one.<br />
Decades of linguistic research into these issues (see, e.g.<br />
Leacock & Ravin, 2000; Altenberg & Granger, 2002)<br />
seem to suggest that there is no easy solution that could<br />
be implemented to arrive at an automatic way of<br />
analyzing, comparing, and classifying different<br />
polysemy and lexicalization patterns across languages.<br />
This means that for now these issues need to be<br />
addressed manually, in the form of careful linguistic<br />
analysis, in the near future.<br />
The same can be said about the problems surrounding<br />
lexicalization patterns, zero translations, and the<br />
universality of frames. Without a detailed catalogue of<br />
linguistic analyses of these phenomena in different<br />
languages, and a comparison across language pairs, any<br />
efforts regarding the effective linking of parallel lexicon<br />
fragments, whether on the basis of semantic frames or<br />
not, will undoubtedly hit many roadblocks.<br />
6. Acknowledgements<br />
Work on this paper was supported by a fellowship for<br />
experienced researchers from the Alexander von<br />
Humboldt Foundation, as well as by Title VI grant<br />
#P229A100014 (Center for Open Educational Resources<br />
and Language Learning) to the University of Texas at<br />
Austin.<br />
7. References<br />
Altenberg, B., Granger, S. (2002): Recent trends in<br />
cross-linguistic studies. In B. Altenberg & S. Granger<br />
(Eds.), Lexis in Contrast. Amsterdam/Philadelphia:<br />
John Benjamins, pp. 3-50.<br />
Atkins, B.T.S., Fillmore, C.J., Johnson, C.R. (2003):<br />
Lexicographic relevance: Selecting information from
Multilingual Resources and Multilingual Applications - Invited Talks<br />
corpus evidence. International Journal of<br />
Lexicography, 16(3), pp. 251-280.<br />
Atzler, J. (<strong>2011</strong>): Twist in the line: Frame Semantics as a<br />
vocabulary teaching and learning tool. Doctoral<br />
Dissertation, The University of Texas at Austin.<br />
Baker, C.F., Fillmore, C.J., Lowe, J.B. (1998): The<br />
Berkeley FrameNet Project. In COLING-ACL '98:<br />
Proceedings of the Conference, pp. 86-90.<br />
Baker, C.F., Fillmore, C.J., Cronin, B. (2003): The<br />
Structure of the FrameNet Database. In International<br />
Journal of Lexicography, 16(3), pp. 281-2<strong>96</strong>.<br />
Bertoldi, A. (2010): When translation equivalents do not<br />
find meaning equivalence: a contrastive study of the<br />
frame Criminal_Process. Manuscript. UT<br />
Austin.<br />
Bertoldi, A., Chishman, R., Boas, H.C. (2010): Verbs of<br />
judgment in English and Portuguese: What<br />
contrastive analysis can say about Frame Semantics.<br />
Calidoscopio, 8 (3), pp. 210-225.<br />
Boas, H.C. (2001): Frame Semantics as a framework for<br />
describing polysemy and syntactic structures of<br />
English and and German motion verbs in contrastive<br />
computational lexicography. In Proceedings of<br />
Corpus Linguistics 2001, pp. 64-73.<br />
Boas, H.C. (2002): Bilingual FrameNet dictionaries for<br />
machine translation. In: Proceedings of the Third<br />
International Conference on Language Resources and<br />
Evaluation, Vol. IV, pp. 1364-1371.<br />
Boas, H.C. (2005a): Semantic frames as interlingual<br />
representations for multilingual lexical databases.<br />
International Journal of Lexicography, 18(4), pp. 445-<br />
478.<br />
Boas, H.C. (2005b): From theory to practice: Frame<br />
Semantics and the design of FrameNet. In S. Langer<br />
& D. Schnorbusch (Eds.), Semantik im Lexikon.<br />
Tübingen: Narr, pp. 129-160.<br />
Boas, H.C. (2009): Recent trends in multilingual<br />
lexicography. In H.C. Boas (Ed.), Multilingual<br />
FrameNets in Computational Lexicography: Methods<br />
and Applications. Berlin/New York: Mouton de<br />
Gruyter, pp. 1-36.<br />
Boas, H.C. (to appear): Frame Semantics and<br />
Translation. In I. Antunano and A. Lopez (Eds.),<br />
Translation in Cognitive Linguistics. Berlin/New<br />
York: Mouton de Gruyter.<br />
Burchardt, A., Erk, K., Frank, A., Kowalski, A., Pado,<br />
S., & Pinkal, M. (2009): Using FrameNet for the<br />
semantic analysis of German: annotation,<br />
representation, and automation. In H.C. Boas (Ed.),<br />
Multilingual FrameNets in Computational<br />
Lexicography: Methods and Applications. Berlin/New<br />
York: Mouton de Gruyter, pp. 209-244.<br />
Dux, R. (<strong>2011</strong>): A frame-semantic analysis of five<br />
English verbs evoking the Theft frame. M.A. Report.<br />
The University of Texas at Austin.<br />
Ellsworth, M, Ohara, K., Subirats, C., & Schmidt, T.<br />
(2006): Frame-semantic analysis of motion scenarios<br />
in English, German, Spanish, and Japanese.<br />
Presentation given at the 4th International Conference<br />
on Construction Grammar (ICCG-4), Tokyo.<br />
Fillmore, C.J. (1982): Frame Semantics. In Linguistic<br />
Society of Korea (Ed.), Linguistics in the Morning<br />
Calm. Seoul: Hanshin, pp. 111-138.<br />
Fillmore, C.J. (1985): Frames and the semantics of<br />
understanding. Quadernie di Semantica, 6, pp. 222-<br />
254.<br />
Fillmore, C.J. (2006): Pragmatically controlled zero<br />
anaphora. BLS, 12, pp. 95-107.<br />
Fillmore, C.J. (2007): Valency issues in FrameNet. In: T.<br />
Herbst & K. Götz-Vetteler (Eds.), Valency:<br />
theoretical, descriptive, and cognitive issues.<br />
Berlin/New York: Mouton de Gruyter, pp. 129-160.<br />
Fillmore, C.J., Atkins, B.T.S. (2000): Describing<br />
polysemy: The case of 'crawl'. In Y. Ravin and C.<br />
Laecock (Eds.), Polysemy. Oxford: Oxford University<br />
Press, pp. 91-110.<br />
Fillmore, C.J., Baker, C.F. (2010): A frames approach to<br />
semantic analysis. In: B. Heine and H. Narrog (Eds.),<br />
The Oxford Handbook of Linguistic Analysis.<br />
Oxford: Oxford University Press, pp. 313-340.<br />
Fillmore, C.J., Petruck, M.R.L. (2003): FrameNet<br />
Glossary. International Journal of Lexicography,<br />
16(3), pp. 359-361.<br />
Fillmore, C.J, Johnson, C.R., Petruck, M.R.L. (2003a):<br />
Background to FrameNet. International Journal of<br />
Lexicography, 16(3), pp. 235-251.<br />
Fillmore, C.J., Petruck, M.R.L., Ruppenhofer, J.,<br />
Wright, A. (2003b): FrameNet in action: The case of<br />
Attaching. International Journal of Lexicography,<br />
16(3), pp. 297-333.<br />
17
Multilingual Resources and Multilingual Applications - Invited Talks<br />
Fontenelle, T. (1997): Using a bilingual dictionary to<br />
18<br />
create semantic networks. International Journal of<br />
Lexicography, 10(4), pp. 275-303.<br />
Fung, P., Chen, B. (2004): BiFrameNet: Bilingual frame<br />
semantics resource construction by cross-lingual<br />
induction. In Proceedings of COLING 2004.<br />
Hasegawa, Y., Lee-Goldman, R., Ohara, K., Fujii, S.,<br />
Fillmore, C.J. (2010): On expressing measurement<br />
and comparison in English and Japanese. In H.C.<br />
Boas (Ed.), Contrastive Studies in Construction<br />
Grammar. Amsterdam/Philadelphia: John Benjamins,<br />
pp. 169-200.<br />
Heid, U. (19<strong>96</strong>): Creating a multilingual data collection<br />
for bilingual lexicography from parallel monolingual<br />
lexicons. In Procedings of the VIIth EURALEX<br />
International Congress, pp. 559-573.<br />
Leino, J. (2010): Results, cases, and constructions:<br />
Argument structure constructions in English and<br />
Finnish. In H.C. Boas (Ed.), Contrasive Studies in<br />
Construction Grammar. Amsterdam/Philadelphia:<br />
John Benjamins, pp. 103-136.<br />
Lowe, J.B., Baker, C.F., Fillmore, C.J. (1997): A frame-<br />
semantic approach to semantic annotation. In<br />
Proceedings of the SIGLEX Workshop on Tagging<br />
Text with Lexical Semantics: Why, What, and How?<br />
Held April 4-5, in Washington, D.C., in conjunction<br />
with ANLP-97.<br />
Ohara, K. (2009): Frame-based contrastive lexical<br />
semantics in Japanese FrameNet: The case of risk and<br />
kakeru. In H.C. Boas (Ed.), Multilingual FrameNets<br />
in Computational Lexicography: Methods and<br />
Applications. Berlin/New York: Mouton de Gruyter,<br />
pp. 163-182.<br />
Petruck, M.R.L. (19<strong>96</strong>): Frame Semantics. In J.<br />
Verschueren, J.-O. Östman, J. Blommaert, C. Bulcaen<br />
(Eds.), Handbook of Pragmatics.<br />
Amsterdam/Philadelphia: John Benjamins, pp. 1-13.<br />
Petruck, M.R.L. (2009): Typological considerations in<br />
constructing a Hebrew FrameNet. In H.C. Boas (Ed.),<br />
Multilingual FrameNets in Computational<br />
Lexicography: Methods and Applications. Berlin/New<br />
York: Mouton de Gruyter, pp. 183-208.<br />
Petruck, M.R.L, Boas, H.C. (2003): All in a day's week.<br />
In . Hajicova, A. Kotesovcova, J. Mirovsky (Eds.),<br />
Proceedings of CIL 17. CD-ROM. Prague:<br />
Matfyzpress.<br />
Petruck, M.R.L., Fillmore, C.J., Baker, C.F., Ellsworth,<br />
M., Ruppenhofer, J. (2004): Reframing FrameNet<br />
data. In Proceedings of the 11 th EURALEX<br />
International Conference, pp. 405-416.<br />
Pitel, G. (2009): Cross-lingual labeling of semantic<br />
predicates and roles: A low-resource method based on<br />
bilingual L(atent) S(emantic) A(nalysis). In H.C. Boas<br />
(Ed.), Multilingual FrameNets in Computational<br />
Lexicography: Methods and Applications. Berlin/New<br />
York: Mouton de Gruyter, pp. 245-286.<br />
Ruppenhofer, J., Ellsworth, M., Petruck, M.R.L.,<br />
Johnson, C., Scheffczyk, J. (2010): FrameNet II:<br />
Extended theory and practice. Available at http://<br />
framenet.icsi.berkeley.edu<br />
Salkie, R. (2002): Two types of translation equivalence.<br />
In B. Altenberg and S. Granger (Eds.), Lexis in<br />
Contrast. Amsterdam/Philadelphia: John Benjamins,<br />
pp. 51-72.<br />
Schmidt, T. (2009): The Kicktionary – A multilingual<br />
lexical resource of football language. In H.C. Boas<br />
(Ed.), Multilingual FrameNets in Computational<br />
Lexicography: Methods and Applications. Berlin/New<br />
York: Mouton de Gruyter, pp. 101-134.<br />
Subirats, C. (2009): Spanish FrameNet: A frame-<br />
semantic analysis of the Spanish lexicon. In H.C.<br />
Boas (Ed.), Multilingual FrameNets in Computational<br />
Lexicography: Methods and Applications. Berlin/New<br />
York: Mouton de Gruyter, pp. 135-162.<br />
Talmy, L. (1985): Lexicalization patterns: semantic<br />
structures in lexical forms. In T. Shopen (Ed.),<br />
Language Typology and Syntactic Description.<br />
Cambridge: Cambridge University Press, pp. 57-149.<br />
Ziem, A. (2008): Frames und sprachliches Wissen.<br />
Berlin/New York: Mouton de Gruyter.
Multilingual Resources and Multilingual Applications - Invited Talks<br />
The Multilingual Web: Opportunities, Borders and Visions<br />
Felix Sasaki<br />
DFKI, LT-Lab / Univ. of Applied Sciences Potsdam<br />
Alt-Moabit 91c, 10559 Berlin<br />
E-mail: felix.sasaki@dfki.de<br />
Abstract<br />
The Web is growing more and more in languages other than English, leading to the opportunity of a truly multilingual, global<br />
information space. However, the full potential of multilingual information creation and access across language borders has not yet<br />
been developed. We report about a workshop series and project called “MultilingualWeb” that aims at analyzing borders or “gaps”<br />
within Web technology standardization hindering multilinguality on the Web. MultilingualWeb targets at scientific, industrial and<br />
user communities who need to collaborate closely. We conclude with a first concrete outcome of MultilingualWeb: the upcoming<br />
formation of a cross-community, W3C standardization activity that will close some of the gaps that already have been recognized.<br />
Keywords: Multilinguality, Web, standardization, language technology, metadata<br />
1. Introduction: Missing Links between<br />
Languages on the Web<br />
A recent blog post discussed “Languages of the World<br />
(Wide Web)” 1<br />
. Via impressive visualizations, it showed<br />
the amount of content per language and number of links<br />
between languages. By no surprise English is a dominant<br />
language on the Web, and every other language has a<br />
certain number of links to English web pages.<br />
Nevertheless, the amount of content in many other<br />
languages is continuously and rapidly growing.<br />
Unfortunately, the links between these languages and<br />
links to English are rather few.<br />
What does this mean? First, it demonstrates that English<br />
is a lingua franca on the Web. Users who are not capable<br />
or willing to use this lingua franca cannot communicate<br />
with others and are not part of the global information<br />
society; they are residents of local silos on the Web.<br />
Second, the desire to communicate in one’s own<br />
language is high and is growing.<br />
Several issues need to be resolved to tear down the walls<br />
between language communities on the web. One key<br />
issue is the availability of standardized technologies to<br />
create content in your own language, and to access<br />
1 See<br />
http://googleresearch.blogspot.com/<strong>2011</strong>/07/languages-of-worl<br />
d-wide-web.html<br />
content across languages. The need to resolve this issue<br />
led to the creation of the “MultilingualWeb” project.<br />
2. MultilingualWeb: Overview<br />
MultilingualWeb http://www.multilingualweb.eu/ is an<br />
EU-funded thematic network project exploring standards<br />
and best practices that support the creation, localization<br />
and use of multilingual web-based information. It is lead<br />
by the World Wide Web Consortium (W3C), the major<br />
stakeholder for creating the technological building blocks<br />
of the web. MultilingualWeb encompasses 22<br />
partners http://www.multilingualweb.eu/partners, both<br />
from research and various industries, related to content<br />
creation, localization, various software providers etc. The<br />
project main part is a series of four public workshops, to<br />
discuss what standards and best practices currently exist,<br />
and what gaps need to be filled. The project started in<br />
April 2010; as of writing, two workshops have been held.<br />
They have been of enormous success, in terms of the<br />
number of participants, awareness esp. in social media,<br />
and the outcome of discussions. In the reminder of this<br />
abstract, we will discuss current findings 2<br />
of the project<br />
and will take a look at what the two upcoming workshops<br />
and future projects might bring.<br />
2<br />
More details on the findings can be found in workshop reports<br />
on the project website http://www.multilingualweb.eu .<br />
19
Multilingual Resources and Multilingual Applications - Invited Talks<br />
20<br />
3. About Terminology and Communities<br />
One gap is related to the communities, industry and<br />
technology stacks that need to be aware of standards<br />
related to the multilingual Web. Internationalization<br />
deals with the prerequisites to create content in many<br />
languages. This involves technologies and standards<br />
related to character encoding, language identification,<br />
font selection etc. The proper internationalization of (web)<br />
technologies is required for localization: the adaptation<br />
to local markets and cultures. Localization often involves<br />
translation. With more and more content that needs to be<br />
translated and a growing number of target languages, the<br />
use of language technologies (e.g. machine translation)<br />
comes into play.<br />
A huge success of the MultilingualWeb project is that<br />
major stakeholders from the areas of internationalization,<br />
localization and language technologies have been<br />
brought together. This is important since both in terms of<br />
research and industry projects, so far the communities do<br />
not overlap. The same is true for conference series; see<br />
e.g. the (non) overlap of attendees at Localization World,<br />
LREC or the Unicode conferences.<br />
4. Workshop Sessions and Topics<br />
MultilingualWeb provides a common umbrella for these<br />
communities via a set of labels used for the workshop<br />
sessions:<br />
� Developers provide the basic technological building<br />
blocks (e.g. browsers) for multilingual web content<br />
creation.<br />
� Creators use the technologies to create multilingual<br />
content.<br />
� Localizers adapt the content to regions and cultures.<br />
� Machines are used to support multilingual content<br />
creation and access, e.g. via machine translation.<br />
� Users more and more do not only consume content,<br />
but also at the same time become contributors - see<br />
e.g. growing number of users in social networks.<br />
� Policy makers decide about strategies for fostering<br />
multilinguality on the Web. They play an important<br />
role in big multinational companies, regional or<br />
international governmental or non-governmental<br />
bodies, standardization bodies etc.<br />
Of course the above labels serve only as a rough<br />
orientation. But esp. for the detection of gaps (see below)<br />
they have proven to be quite useful. The following<br />
subsections provide a brief summary of some outcomes<br />
ordered via these labels and based on the workshop<br />
reports. For further details the reader may be referred to<br />
these reports.<br />
4.1. Developers<br />
Developers are providing the technological building<br />
blocks that are needed for multilingual content creation<br />
and content access on the web. Many of these building<br />
blocks are still under development, and web browsers<br />
play a crucial role. During the workshops, many<br />
presentations dealt with enhancement of characters and<br />
fonts support, locale data formats, internationalized<br />
domain names and typographic support.<br />
Gaps in this area are also related to handling of<br />
translations: although more and more web content is<br />
being translated, the key web technology HTML so far<br />
has no means to support this process. Here it is important<br />
that the need for such means is being brought to the<br />
standard development organizations, namely W3C, and<br />
esp. to the browser implementers.<br />
Another gap is what technology stacks are being<br />
developed, and how content providers are actually<br />
adopting them. HTML5 plays a crucial role in the future<br />
of web technology development, but for many content<br />
providers its relation to other parts of the technology eco<br />
system is not clear yet.<br />
4.2. Creators<br />
Creators more and more need to bring content to many,<br />
esp. mobile devices. Since these devices lack computing<br />
power, many aspects of multilinguality (e.g. usage of<br />
large fonts) need to be taken care of in a specific manner.<br />
“Content” does not only mean text. It also encompasses<br />
information for multilimodal and voice applications, or<br />
SMS, especially in developing countries. Navigation of<br />
content esp. across languages is another area without<br />
standardized approaches or best practices.<br />
Like in the developer area, translation is important for<br />
content creation too. There is no standardized way to<br />
identify non-translatable content, to identify tools used<br />
for translation, translation quality etc.<br />
4.3. Localizers<br />
Localizers deal with the process of localization, which<br />
involves many aspects: content creation, the distribution
of content to language service providers, further<br />
distribution to translators, etc. To organize this process<br />
there is a need to improve standards and better integrate<br />
them. Metadata plays a crucial role in this respect, as we<br />
will discuss later.<br />
Content itself is becoming more complex and fast<br />
changing - and localization approaches need to be<br />
adapted accordingly. In the area of localization, many<br />
standards have been developed: for the representation of<br />
content in the translation process, for terminology<br />
management, translation memories etc. The gap here is to<br />
understand how the standards interplay. This is not an<br />
easy task, since sometimes there are competing<br />
technologies available. Hence, currently there are quite a<br />
few initiatives dedicated to interoperability in the<br />
localization area, including the integration with web<br />
content creation and machine translation.<br />
4.4. Machines<br />
For machines, that is applications based on language<br />
technology, the need for standardization esp. related to<br />
metadata and the localization process is of outmost<br />
importance. Language resources are crucial in this area,<br />
including their standardized representation and means to<br />
share resources. The META-SHARE infrastructure<br />
currently being developed is expected to play an<br />
important role in this area.<br />
While discussing developers, creators and localizers,<br />
machine translation has been mentioned already. It has<br />
become clear that a close integration of machine<br />
translation technologies to these areas is a major<br />
requirement for the better translation quality.<br />
Machines play a crucial role in building bridges between<br />
smaller and larger languages, and to change the picture<br />
about “languages on the web” that we mentioned at the<br />
beginning of this paper.<br />
4.5. Users<br />
Users normally have no strong voice in the development<br />
of multilingual or other technologies. At the<br />
MultilingualWeb workshops, it became clear that the<br />
worldwide interest in multilingual content is high, but<br />
significant organizational and technical challenges need<br />
to be approached for reaching people in continents such<br />
as Africa and Asia.<br />
Multilingual social media are becoming more important<br />
Multilingual Resources and Multilingual Applications - Invited Talks<br />
and can be supported by language technology<br />
applications like on-the-fly machine translation.<br />
However it is important to have a clear border between<br />
controlled and uncontrolled environments of content<br />
creation. Only in this way the right tools can be chosen to<br />
achieve high quality translation of small amounts of text,<br />
versus gist translation for larger text bodies.<br />
4.6. Policy Makers<br />
The topic of policy makers was not discussed as a<br />
separate session in the first workshop, but only in the 2 nd<br />
one. Nevertheless it is of high importance: many gaps<br />
related to the multilingual web are not technical ones, but<br />
are related to e.g. political decisions about the adoption of<br />
standards. Esp. in the localization and language<br />
technology area, proprietary solutions prevailed for a<br />
long time. Here we are ahead of a radical change, and<br />
MultilingualWeb will play a crucial role in bringing the<br />
right people together.<br />
Some technological pieces have a lot of political aspects.<br />
The META-SHARE infrastructure mentioned before is a<br />
good example. A key aspect of this infrastructure is the<br />
licensing model it will provide, since not everybody will<br />
be willing to share language resources for free.<br />
5. Metadata for Language Related<br />
Technology in the Web<br />
5.1. Introduction<br />
After the broad overview of various gaps that have been<br />
detected, we will now dive deeper into gaps related to<br />
metadata. All communities we mentioned before already<br />
for a while have used such metadata:<br />
� in internationalization, metadata is used to identify<br />
character encoding or language;<br />
� in localization, metadata helps to organize the<br />
localization workflow, e.g. to identify parts of<br />
content that need to be translated;<br />
� in language technology, metadata helps as a heuristic<br />
to complement language technology applications.<br />
Such heuristics can be useful for the language technology<br />
application of automatic detection of the language of<br />
content. The heuristic here can be e.g. the language<br />
identifier given in a web page. However, to be able to<br />
judge its reliability, it is important that many stakeholders<br />
work together and that there are stable bridges between<br />
internationalization, localization and language<br />
21
Multilingual Resources and Multilingual Applications - Invited Talks<br />
technology. As one concrete outcome of the<br />
MultilingualWeb project, a project has been prepared that<br />
will work on creating these bridges. The basic project<br />
idea is summarized below.<br />
5.2. Three gaps related to Metadata<br />
Language technology applications (machine translation,<br />
automatic summarization, cross-language information<br />
retrieval, automatic quality assurance etc.) and resources<br />
(grammars, translation memories, corpora, lexica etc.)<br />
are increasingly becoming available on the web and<br />
integrated into HTML and Web based content and<br />
accessible via web applications and web service APIs.<br />
This approach has partially been successful in fostering<br />
interoperability between language technology resources<br />
and applications. However, it lacks the integration with<br />
the “Open Web Platform”, i.e.: with the growing set of<br />
technologies used for creating and consuming the Web in<br />
many applications, on many devices, for many (and more<br />
and more) users.<br />
From the view of this current platform, language<br />
technology is a black box: Services like online machine<br />
translation receive textual input, and produce some<br />
output. The end users have no means to adjust language<br />
technology to their needs, and they are not able to<br />
influence language technology based processes in detail.<br />
On the other hand, providers of language technology face<br />
difficulties in adapting to specific demands by users in a<br />
timely and cost-effective manner, which is a problem also<br />
experience by Language Service Providers as they<br />
increasing adopt language technologies.<br />
To address the “black box” problem, three gaps that have<br />
been detected during the MultilingualWeb workshops<br />
need to be filled. They play a role in the chain of<br />
multilingual content processing and consumption on the<br />
Web:<br />
� An online machine translation service might make<br />
mistakes like translation of fixed terminology or<br />
named entities. This demonstrates gap no. 1:<br />
language technology does not know about metadata<br />
in the source content, e.g. “What parts of the input<br />
should be translated?”<br />
� In the database from which the translated text has<br />
been generated, the information about translatability<br />
might have been available. However, the machine<br />
translation service does not know about that kind of<br />
22<br />
“hidden Web” information. This reveals gap no. 2:<br />
there is no description of the processes available,<br />
which were the basis for generating “surface Web”<br />
pages.<br />
� The last gap no. 3 is about a standardized approach<br />
for identification. This means first that identification<br />
of information to fill the gaps 1 and 2 is so far not<br />
described in a standardized manner. For example,<br />
there is no commonly identified translate flag<br />
available in core web technologies like HTML.<br />
Second, it means that so far resources used by<br />
language technology applications (e.g. “what<br />
lexicon is used for machine translation?”) and the<br />
applications themselves (e.g. “general purpose<br />
machine translation versus application tailored<br />
towards a specific domain”) cannot be identified<br />
uniquely. This hinders the ad hoc creation of<br />
language technology applications on the Web, i.e.<br />
the re-combination of resources and application<br />
modules.<br />
5.3. Addressing the Gaps: MultilingualWeb-LT<br />
To close the gaps mentioned above, a project called<br />
MultilingualWeb-LT has been formed that is planned to<br />
start in early 2012. The consortium of<br />
MultilingualWeb-LT consists of 14 partners from the<br />
areas of CMS systems, localization service providers,<br />
language technology industry and research etc. As the<br />
forum of work gaps, the project will start a working<br />
group within W3C.<br />
The goal of MultilingualWeb-LT is to define a standard<br />
that fills the gaps, including three mostly open source<br />
reference implementations around three topic areas, in<br />
which metadata is being used:<br />
� Integration of CMS and Localization Chain.<br />
Modules for the Drupal CMS system will be built<br />
that support the creation of the metadata. The<br />
metadata will then be taken up in web-based tools<br />
that support the localization chain: from the process<br />
of gathering of localizable content, the distribution to<br />
translators, to the re-aggregation of the results into<br />
localized output.<br />
� Online MT Systems. MT systems will be made<br />
aware of the metadata, which will lead to more<br />
satisfactory translation results. An online MT system
Multilingual Resources and Multilingual Applications - Invited Talks<br />
will be made sensitive to the outputs of the modified<br />
CMS described above.<br />
� MT Training. Metadata aware tools for training MT<br />
systems will be built. Again these are closely related<br />
to CMS that produce the necessary metadata. They<br />
will lead to better quality for MT training corpora<br />
harvested on the Web.<br />
The above description shows that CMS systems play a<br />
crucial role in MultilingualWeb-LT. The usage of<br />
language identifiers for deciding about the language of<br />
content (see sec. 4) can be enhanced e.g. by the MT<br />
training module mentioned above. However, since<br />
MultilingualWeb-LT will be a regular W3C working<br />
group, other W3C member organizations might join that<br />
group. This is highly desired, hoping not only that further<br />
implementations will be built, but also that consensus<br />
about and usage of the metadata stretches out to the web<br />
community.<br />
5.4. MultilingualWeb-LT: Already a Success Story<br />
Although MultilingualWeb-LT has not started yet, it is<br />
already a success story: It is a direct outcome of the<br />
MultilingualWeb project and of two other projects that<br />
play an important role - among others - for community<br />
building in the area of language technology research and<br />
industry.<br />
� FLaReNet (Fostering Language Resources Network)<br />
has developed a common vision for the area of<br />
language resources. The FLaReNet “Blueprint of<br />
Actions and Infrastructures” is a set of<br />
recommendations to support this vision in terms of<br />
(technical) infrastructure, R&D, and politics. As part<br />
of these recommendations, the task of “putting<br />
standards in action” has been described as highly<br />
important; MultilingualWeb-LT is a direct<br />
implementation of this task.<br />
� META-NET is dedicated to fostering the<br />
technological foundations of a multilingual<br />
European information society, by building a shared<br />
vision and strategic research agenda, an open<br />
distributed facility for the sharing and exchange of<br />
resources (META-SHARE), and by building bridges<br />
to relevant neighbouring technology fields.<br />
MultilingualWeb-LT is a bridge to support the<br />
exchange between the language technology<br />
community and the web community at large.<br />
These projects and the formation of<br />
Multilingual-Web-LT itself demonstrate that a holistic<br />
view prevails, in which the differences between<br />
internationalization, localization and language<br />
technology mentioned before become of less importance,<br />
for the common aim of a truly multilingual web.<br />
6. Upcoming Workshops and the Future<br />
At the time of writing, two workshops are planned for the<br />
MultilingualWeb project. A workshop in September <strong>2011</strong><br />
will take place in Ireland. Naturally it will have a focus in<br />
localization, since many software related companies in<br />
Ireland work on this topic.<br />
The last workshop will take place in Luxembourg in<br />
March 2012 and will wrap up the MultilingualWeb<br />
project. However, the holistic view of a multilingual web,<br />
including the communities of internationalization,<br />
localization, language technology and the web<br />
community itself, will be put forward using the<br />
MultilingualWeb brand. The MultilingualWeb-LT project<br />
is one means to carry on that brand. It is the hope of the<br />
author that other activities will follow and that<br />
cross-community collaboration will become a common<br />
place. Only in this way we will be able to tear down<br />
language barriers on the web and to achieve a truly global<br />
information society.<br />
7. Acknowledgements<br />
This extended abstract has been supported by the<br />
European Commission as part of the Competitiveness<br />
and Innovation Framework Programme and through ICT<br />
PSP Grants: Agreement No. 250500 (MultilingualWeb<br />
contract) and 249119 (META-NET T4ME contract).<br />
23
Multilingual Resources and Multilingual Applications - Invited Talks<br />
24
Multilingual Resources and Multilingual Applications - Invited Talks<br />
Combining various text analysis tools for multilingual media monitoring<br />
Ralf Steinberger<br />
European Commission – Joint Research Centre (JRC)<br />
21027 Ispra (VA), Italy<br />
E-mail: Ralf.Steinberger@jrc.ec.europa.eu, URL: http://langtech.jrc.ec.europa.eu/<br />
Abstract<br />
There is ample evidence that information contained in media reports is complementary across countries and languages. This holds<br />
both for facts and for opinions. Monitoring multilingual and multinational media therefore gives a more complete picture of the<br />
world than monitoring the media of only one language, even if it is a world language like English. Wide coverage and highly<br />
multilingual text processing is thus important. The JRC-developed Europe Media Monitor (EMM) family of applications gathers<br />
about 100,000 media reports per day in 50 languages from the internet, groups related articles, classifies them, detects and follows<br />
trends, produces statistics and issues automatic alerts. For a subset of 20 languages, it also extracts and disambiguates entities<br />
(persons, organisations and locations) and reported speech, links related news over time and across languages, gathers historical<br />
information about entities and produces various types of social networks. More recent R&D efforts focus on event scenario template<br />
filling, opinion mining, multi-document summarisation, and machine translation. This extended abstract gives an overview of EMM<br />
from a functionality point of view rather than providing technical detail.<br />
Keywords: news analysis; multilingual; automatic alerting; text mining; information extraction.<br />
1. EMM: Background and Objectives<br />
The JRC with its 2700 employees working in five<br />
different European locations in a wide variety of<br />
scientific-technical fields is a Directorate General of the<br />
European Commission (EC). It is thus a governmental<br />
body free of national interests and without commercial<br />
objectives. Its main mandate is to provide scientific<br />
advice and technical know-how to European Union (EU)<br />
institutions and its international partners, as well as to EU<br />
member state organisations, with the purpose of<br />
supporting a wide range of EU policies. Lowering the<br />
language barrier in order to increase European integration<br />
and competitiveness is a declared EU objective.<br />
The JRC-developed Europe Media Monitor (EMM) is a<br />
publicly accessible family of four news gathering and<br />
analysis applications consisting of NewsBrief, the<br />
Medical Information System MedISys, NewsExplorer and<br />
EMM-Labs. They are accessible via the single<br />
URL http://emm.newsbrief.eu/overview.html. The first<br />
EMM website went online in 2002 and it has since been<br />
extended and improved continuously. The initial<br />
objective was to complement the manual news clipping<br />
services of the EC, by searching for news reports online,<br />
categorising them according to user needs, and providing<br />
an interface for human moderation (selection and<br />
re-organisation of articles; creation of layout to print<br />
in-house newspapers). EMM users thus typically have a<br />
specific information need and want to be informed about<br />
any media reports concerning their subject of interest.<br />
Monitoring the media for events that are dangerous to the<br />
public health (PH) is a typical example. EMM thus<br />
continuously gathers news from the web, automatically<br />
selects PH-related news items (e.g. on chemical,<br />
biological, radiological and nuclear (CBRN) threats<br />
including disease outbreaks, natural disasters and more),<br />
presents the information on targeted web pages, detects<br />
unexpected information spikes and alerts users about<br />
them. In addition to PH, EMM categories cover a very<br />
wide range of further subject areas, including the<br />
environment, politics, finance, security, various scientific<br />
and policy areas, general information on all countries of<br />
the globe, etc. For an overview of EMM, see Steinberger<br />
et al. (2009).<br />
25
Multilingual Resources and Multilingual Applications - Invited Talks<br />
26<br />
Figure 1. Various aggregated statistics and graphs showing category-based information for one category<br />
(ECOLOGY) derived from reports in multiple languages.<br />
2. Information complementarity across<br />
languages and countries; news bias<br />
While national EMM clients are mostly interested in the<br />
news of their own country and that of surrounding<br />
countries (e.g. for disease outbreak monitoring), they also<br />
need to follow mass gatherings (e.g. for religious,<br />
sport-related or political reasons) because participants<br />
may bring back diseases. In addition to the news in the 23<br />
official EU languages, EMM thus also monitors news in<br />
Arabic, Chinese, Croatian, Farsi, Swahili, etc., to<br />
mention just a few of the altogether 50 languages. While<br />
major events such as wars or natural disasters are usually<br />
well-covered in world languages such as English, French<br />
and Spanish, many small events are typically only<br />
mentioned in the national or even in regional press. For<br />
instance, disease outbreaks, small-scale violent events<br />
and accidents, fraud cases, etc. are usually not reported<br />
outside the national borders. The study by Piskorski et al.<br />
(<strong>2011</strong>) comparing media reports in six languages showed<br />
that only 51 out of 523 events (of the event types violence,<br />
natural disasters and man-made disasters) were reported<br />
in more than one language. 350 out of the 523 events<br />
were found in non-English news.<br />
Due to this information complementarity across<br />
languages and countries, it is crucial that monitoring<br />
systems like EMM process texts in many different<br />
languages. Using Machine Translation (MT) into one<br />
language (usually English) and filtering the news in that<br />
language is only a partial solution because specialist<br />
terms and names are often badly translated. The benefits<br />
of processing texts in the original language was also<br />
formulated by Larkey et al. (2004) in their native<br />
language hypothesis.<br />
We observed the following benefits of applying<br />
multilingual text mining tools:<br />
1) Different languages cover different geographical<br />
areas of the world, for specific subject areas as well<br />
as generally. EMM-NewsBrief’s news clouds<br />
2)<br />
(see http://emm.newsbrief.eu/geo?type=cluster&for<br />
mat=html&language=all) show this clearly.<br />
More information on entities (persons and organisations;<br />
see NewsExplorer entity pages) can be<br />
3)<br />
extracted from multilingual text. This is due to<br />
different contents found, but also to varying<br />
linguistic coverage of the text mining software.<br />
Many more named entity variant spellings (including<br />
across scripts) are found when analysing different<br />
languages (see NewsExplorer entity pages). These
variant spellings can then be used for improved<br />
retrieval, for generating multilingual social networks,<br />
and more.<br />
4) News bias – regarding the choice of facts as well as<br />
the expression of opinions – will be reduced by<br />
looking at the media coming from different countries.<br />
News bias becomes visually evident when looking at<br />
automatically generated social networks (see, e.g.<br />
Pouliquen et al., 2007, and Tanev, 2007). For instance,<br />
mentions of national politicians are usually preferred<br />
in national news, resulting in an inflated view of the<br />
importance of one’s own political leaders.<br />
From the point of view of an organisation with a close<br />
relationship to many international users, there is thus no<br />
doubt that highly multilingual text mining applications<br />
are necessary and useful.<br />
3. Ways to harness the benefits<br />
of multilinguality<br />
Extracting information from multilingual media reports<br />
and merging the information into a single view is<br />
possible, but developing text mining tools for each of the<br />
languages costs effort and is time-consuming. However,<br />
there are various ways to limit the effort per language (for<br />
an overview of documented methods, see Steinberger,<br />
forthcoming). Some monitoring and automatic alerting<br />
functionality can even be achieved with relatively simple<br />
means. This section summarises the main multilingual<br />
media monitoring functionality provided by the EMM<br />
family of applications.<br />
3.1. Multilingual category alerting<br />
EMM categorises all incoming news items into over<br />
Figure 2. Visual alerting of country-category<br />
combinations for all Public-Health-related categories.<br />
The alert level decreases from left to right.<br />
Multilingual Resources and Multilingual Applications - Invited Talks<br />
Figure 3. English news cluster with 26 articles and<br />
automatically generated links pointing to equivalent<br />
news in the other 19 NewsExplorer languages.<br />
1,000 categories, using mostly Boolean combinations or<br />
weighted lists of search words and regular expressions.<br />
As the categories are the same across all languages,<br />
simple statistics can show differences of reporting across<br />
languages and countries and highlight any bias (see<br />
Figure 1 for some examples). Even automatic alerting<br />
about imported events reported in any of the languages is<br />
possible: EMM keeps two-week averages for the number<br />
of articles falling into any country-category combination<br />
(e.g. POLAND-TUBERCULOSIS) so that a higher influx of<br />
articles in only one of these combinations can trigger an<br />
alert even if the overall number of articles about this<br />
category has hardly changed. That way, users are visually<br />
alerted of the sudden increase of articles in that<br />
combination even for languages they cannot read (see<br />
Figure 2). Once aware, they can translate the articles or<br />
search for the cause of the news spike via their<br />
professional contacts. This functionality is much used<br />
and appreciated by centres for disease prevention and<br />
control around the world.<br />
3.2. Linking of related news across languages<br />
Every 24hours, EMM-NewsExplorer clusters the related<br />
news of the day, separately for each of the 20 languages it<br />
covers, and then links the news clusters to the equivalent<br />
clusters in the other languages (see Figure 3). Following<br />
the links allows users – for any news cluster of choice –<br />
to investigate how, and how intensely, the same event is<br />
reported in the different languages. For each news cluster,<br />
the number of articles – and meta-information such as<br />
entity names found (and more) – are displayed. Links to<br />
Google Translate allow the users to get a rough<br />
translation so that they can judge the relevancy of the<br />
articles and get an idea of what actually happened.<br />
27
Multilingual Resources and Multilingual Applications - Invited Talks<br />
The software additionally tracks related news over time,<br />
produces time lines and displays extracted meta-infor-<br />
mation about the news event. For details about the linking<br />
of related news items across languages and over time, see<br />
Pouliquen et al. (2008).<br />
3.3. Multilingual information gathering on<br />
named entities<br />
EMM-NewsExplorer identifies references to person and<br />
organisation names in twenty languages. It automatically<br />
identifies whether newly found names (within the same<br />
script or across different scripts) are simply spelling<br />
variants of another name or whether they are new names<br />
(for details, see Pouliquen & Steinberger, 2009). The<br />
EMM database currently contains up to 400 different<br />
automatically collected spellings for the same entity. Any<br />
EMM application making use of named entity<br />
information uses unique entity identifiers instead of<br />
concrete name spellings, allowing to merge information<br />
across documents, languages and scripts. The EMM<br />
software furthermore keeps track of titles and other<br />
expressions found next to the name, keeps statistics on<br />
where and when the names were found, and which<br />
entities get frequently mentioned together. The latter<br />
28<br />
Figure 4. Information automatically gathered over time by EMM-NewsExplorer<br />
from media reports in twenty or more languages on one named entity.<br />
information is used to generate social networks that are<br />
derived from the international media, thus being<br />
independent of national viewpoints. EMM software also<br />
detects quotations by and about each entity. The<br />
accumulated multilingual results are displayed on the<br />
NewsExplorer entity pages (see Figure 4), through which<br />
users can explore entities, their relations and related news.<br />
Click on any entity name in any of the EMM applications<br />
to explore this application.<br />
3.4. Multilingual event scenario template filling<br />
For a smaller subset of currently seven languages, the<br />
EMM-NEXUS software extracts structured descriptions<br />
of events relevant for global crisis monitoring, such as<br />
natural disasters; accidents; violent, medical and<br />
humanitarian events, etc. (Tanev et al., 2009; Piskorski et<br />
al., <strong>2011</strong>). For each news cluster about any such event,<br />
the software detects the event type; the event location; the<br />
count of dead, wounded, displaced, arrested etc. persons;<br />
the perpetrator in the event, as well as the weapons used,<br />
if applicable. Contradictory information found in<br />
different news articles (such as differing victim counts)<br />
are resolved to produce a best guess. The aggregated
Figure 5. EMM-Labs geographical visualisation of<br />
events extracted from media reports in seven languages.<br />
event information is then displayed on NewsBrief (in text<br />
form) and on EMM-Labs (in the form of a geographic<br />
map 1<br />
; see Figure 5).<br />
4. JRC’s multilingual text mining resources<br />
The previous section gave a rather brief overview of<br />
EMM functionality without giving technical detail.<br />
Scientific-technical details and evaluation results for all<br />
applications have been described in various publications<br />
available at http://langtech.jrc.ec.europa.eu/.<br />
The four main EMM applications are freely accessible<br />
for everybody. Additionally, the JRC has made available<br />
a number of resources (via the same website) that will<br />
hopefully be useful for developers of multilingual text<br />
mining systems. The JRC-Acquis parallel corpus in<br />
22 languages (Steinberger et al., 2006), comprising<br />
altogether over 1 billion words was publicly released in<br />
2006, followed by the DGT-Translation Memory in 2007.<br />
A new resource that can be used both as a translation<br />
memory and as a parallel corpus for text mining use is<br />
currently under preparation. JRC-Names, a collection of<br />
over 400,000 entity names and their multilingual spelling<br />
variants gathered in the course of seven years of daily<br />
news analysis (see Section 3.3), has been released in<br />
1 http://emm.newsbrief.eu/geo?type=event&format=html&langu<br />
age=all displays continuously updated live maps.<br />
Multilingual Resources and Multilingual Applications - Invited Talks<br />
September <strong>2011</strong> (Steinberger et al., <strong>2011</strong>). JRC-Names<br />
also comprises software to look up these known entities<br />
in multilingual text. Finally, the JRC Eurovoc Indexing<br />
software JEX, which categorises text in 23 different<br />
languages according to the thousands of subject domain<br />
categories of the Eurovoc thesaurus 2<br />
, will also be<br />
released soon.<br />
5. Ongoing and forthcoming work<br />
EMM customers have been making daily use of the<br />
media monitoring software for years. While being<br />
generally satisfied with the service, they would like to<br />
have more functionality and even higher language<br />
coverage. JRC’s ongoing research and development work<br />
focuses on three text mining main areas: (1) Multilingual<br />
multi-document summarisation: The purpose is to<br />
automatically summarise the thousands of news clusters<br />
generated every day; (2) Machine Translation (MT):<br />
While commercial MT software currently translates<br />
Arabic and Chinese EMM texts into English and<br />
hyperlinks to Google Translate are offered for all other<br />
languages, the JRC is working on developing its own MT<br />
software, based on Moses (Koehn et al., 2007);<br />
(3) Opinion mining / sentiment analysis: EMM users are<br />
not only interested in receiving contents, but they would<br />
also like to see opinions on certain subjects. They would<br />
like to see differences of opinions across different<br />
countries and media sources, as well as trends showing<br />
changes over time. See the JRC’s Language Technology<br />
website for publications showing the current progress in<br />
these fields.<br />
6. Acknowledgements<br />
Developing the EMM family of applications was a major<br />
multi-annual team effort. We would like to thank our<br />
present and former colleagues in the OPTIMA group for<br />
all their hard work.<br />
7. References<br />
Koehn P., Hoang, H., Birch, A., Callison-Burch, C.,<br />
Federico, M., Bertoldi, N., Cowan, B., Shen, W.,<br />
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,<br />
A., Herbst, E. (2007): Moses: Open Source Toolkit for<br />
Statistical Machine Translation. Proceedings of the<br />
Annual Meeting of the Association for Computational<br />
2 See http://eurovoc.europa.eu/.<br />
29
Multilingual Resources and Multilingual Applications - Invited Talks<br />
Linguistics (ACL), demonstration session, Prague,<br />
Czech Republic, June 2007.<br />
Larkey, L., Feng, F., Connell, M., Lavrenko, V. (2004):<br />
Language-specific Models in Multilingual Topic<br />
Tracking. Proceedings of the 27 th annual international<br />
ACM SIGIR conference on Research and development<br />
in information retrieval, pp. 402-409.<br />
Piskorski, J., Belyaeva, J., Atkinson, M. (<strong>2011</strong>):<br />
Exploring the usefulness of cross-lingual information<br />
fusion for refining real-time news event extraction: A<br />
preliminary study. Proceedings of the 8 th International<br />
Conference ‘Recent Advances in Natural Language<br />
Processing’. Hissar, Bulgaria, 14-16 September <strong>2011</strong>.<br />
Pouliquen B., Steinberger, R. (2009): Automatic<br />
Construction of Multilingual Name Dictionaries. In: C.<br />
Goutte, N. Cancedda, M. Dymetman & G. Foster (eds.),<br />
Learning Machine Translation. MIT Press - Advances<br />
in Neural Information Processing Systems Series<br />
(NIPS), pp. 59-78.<br />
Pouliquen B., Steinberger, R., Deguernel, O. (2008):<br />
Story tracking: linking similar news over time and<br />
across languages. In Proceedings of the 2 nd workshop<br />
‘Multi-source Multilingual Information Extraction and<br />
Summarization’ (MMIES'2008) held at CoLing'2008.<br />
Manchester, UK, 23 August 2008.<br />
Pouliquen, B., Steinberger, R., Belyaeva, J. (2007):<br />
Multilingual multi-document continuously updated<br />
social networks. Proceedings of the Workshop<br />
‘Multi-source Multilingual Information Extraction and<br />
Summarization’ (MMIES'2007) held at RANLP'2007,<br />
pp. 25-32. Borovets, Bulgaria, 26 September 2007.<br />
Steinberger R. (forthcoming): A survey of methods to<br />
ease the development of highly multilingual Text<br />
Mining applications. Language Resources and<br />
Evaluation Journal LRE.<br />
Steinberger R., Pouliquen, B., van der Goot, E. (2009):<br />
An Introduction to the Europe Media Monitor Family<br />
of Applications. In: F. Gey, N. Kando & J. Karlgren<br />
(eds.): Information Access in a Multilingual World -<br />
Proceedings of the SIGIR 2009 Workshop<br />
(SIGIR-CLIR'2009), pp. 1-8. Boston, USA. 23 July<br />
2009.<br />
Steinberger R., Pouliquen, B., Widiger, A., Ignat, C.,<br />
Erjavec, T., Tufiş, D., Varga, D. (2006): The<br />
JRC-Acquis: A multilingual aligned parallel corpus<br />
with 20+ languages. Proceedings of the 5 th<br />
30<br />
International Conference on Language Resources and<br />
Evaluation (LREC'2006), pp. 2142-2147. Genoa, Italy,<br />
24-26 May 2006.<br />
Steinberger R., Pouliquen, B., Kabadjov, M., van der<br />
Goot, E. (<strong>2011</strong>): JRC-Names: A freely available,<br />
highly multilingual named entity resource.<br />
Proceedings of the 8 th<br />
International Conference<br />
‘Recent Advances in Natural Language Processing’.<br />
Hissar, Bulgaria, 14-16 September <strong>2011</strong>.<br />
Tanev, H. (2007): Unsupervised Learning of Social<br />
Networks from a Multiple-Source News Corpus.<br />
Proceedings of the Workshop ‘Multi-source<br />
Multilingual Information Extraction and<br />
Summarization’ (MMIES'2007), held at RANLP'2007,<br />
pp. 33-40. Borovets, Bulgaria, 26 September 2007.<br />
Tanev, H., Zavarella, V., Linge, J., Kabadjov, M.,<br />
Piskorski, J., Atkinson, M., Steinberger, R. (2009):<br />
Exploiting Machine Learning Techniques to Build an<br />
Event Extraction System for Portuguese and Spanish.<br />
LinguaMÁTICA, 2, pp. 55-66.
Multilingual Resources and Multilingual Applications - Invited Talks<br />
Regular Papers
Multilingual Resources and Multilingual Applications - Regular Papers<br />
32
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Generating Inflection Variants of Multi-Word Terms for French and German<br />
Simon Clematide, Luzia Roth<br />
Institute of Computational Linguistics, University of Zurich<br />
Binzmühlestr. 14, 8050 Zürich<br />
E-mail: simon.clematide@uzh.ch, luzia.roth@access.uzh.ch<br />
Abstract<br />
We describe a free Web-based service for the inflection of single words and multi-word terms for French and German. Its primary<br />
purpose is to provide glossary authors (instructors or students) of an open electronic learning management system with a practical<br />
way to add inflected variants for their glossary entries. The necessary morpho-syntactic processing for analysis and generation is<br />
implemented by finite-state transducers and a unification-based grammar framework in a declarative and principled way. The<br />
techniques required for German and French terms cover two typological different types of term creation and both can be easily<br />
transferred to other languages.<br />
Keywords: morphological generation, morphological analysis, multi-word terms, syntactic analysis, syntactic generation<br />
1. Introduction<br />
In the age of electronic media and rapid proliferation of<br />
technical terms and concepts, the use of glossaries and<br />
their dynamic linkage into running text seems to be<br />
important and self-evident in the area of e-learning.<br />
However, depending on the morphological properties of a<br />
language, e.g. the use of compounds or multi-word terms<br />
or the degree of surface modification that inflection<br />
imposes on words, the task of constructing inflected term<br />
variants from typically uninflected glossary entries is not<br />
a trivial task.<br />
In this article, we describe two Web services for inflected<br />
term variant generation that illustrate the different<br />
requirements regarding morphological and syntactic<br />
processing. Whereas French shows modest inflectional<br />
variation in comparison to German, French requires more<br />
effort regarding syntactic analysis of complex nominal<br />
phrases. For German, guessing the correct inflection<br />
class of unknown compounds is more important.<br />
A linguistically informed method for inflected term<br />
variant generation involves morphological and<br />
syntactical analysis and generation. In order to ensure<br />
this bidirectional processing, declarative linguistic<br />
frameworks such as finite-state transducers and<br />
rule-based unification grammars are beneficial. For a<br />
practical system, however, one wants to be able to<br />
analyze a wider range of expressions than what should<br />
actually be generated and presented to the user, e.g.<br />
entries in the form of back-of-the-book indexes should be<br />
understood by the system, but these forms will not appear<br />
in running text.<br />
Figure 1: Screenshot of the glossary author interface<br />
The main application domain for our services is the<br />
e-Learning Management Framework OLAT 1 where we<br />
provide glossary authors with an easy but fully<br />
controllable way to add inflected variants for their<br />
glossary entries. Our free Web-based generation<br />
service 2<br />
is only called once for a given term, viz. when the<br />
1 See<br />
http://www.olat.org for further information about the open<br />
source project OLAT (Online Learning and Training).<br />
2 The service is realized as a Common Gateway Interface (CGI),<br />
and it delivers a simple XML document customized for further<br />
processing in the glossary back-end of the e-learning<br />
33
Multilingual Resources and Multilingual Applications - Regular Papers<br />
glossary author edits an entry. As shown in Fig. 1, the<br />
glossary author is free to select or deselect any of the<br />
generated word forms.<br />
34<br />
2. Methods and Resources<br />
In this section, we first describe the lexical and<br />
morphological resources used for French and German. In<br />
section 2.2 we discuss the implementation of the<br />
syntactic processing module.<br />
2.1. Lexical Resources<br />
2.1.1. Lexical resources for French<br />
Morphalou 3 , a lexicon for inflected word forms in French<br />
(95,810 lemmata, 524,725 inflected forms), was used as a<br />
lexical resource to automatically build the finite-state<br />
transducer 4<br />
which provides all lexical information,<br />
including word forms and morphological tags.<br />
After the first evaluation ofour development set, some<br />
modifications were made to extend the vocabulary: As<br />
derivations with neo-classical elements are quite<br />
common in terminological expressions, all adjectives<br />
5<br />
were additionally combined with the prefixes of a list to<br />
create derivational forms such as audiovisuel,<br />
interethnique or biomédical.<br />
Furthermore, from all lexicon entries containing a<br />
hyphen the beginning from the entry including the<br />
hyphen was extracted. This string was taken as a prefix<br />
and combined with nouns to cover cases like<br />
demi-charge.<br />
2.1.2. Lexical resources for German<br />
We use the lexicon molifde (Clematide, 2008), which was<br />
mainly built by us by exploiting a full form lexicon<br />
generated by Morphy (Lezius, 2000), the German lexicon<br />
of the translation system OpenLogos 6<br />
, and the<br />
morphological resource Morphisto (Zielinski & Simon,<br />
2008). The manually curated resource contains roughly<br />
40,000 lemmas (nouns, adjectives, verbs), and by<br />
management software OLAT. See http://kitt.cl.uzh.ch/kitt/olat.<br />
3<br />
See http://www.cnrtl.fr/lexiques/morphalou for this resource,<br />
which is freely available for educational and academic<br />
purposes.<br />
4<br />
We use the Xerox Finite State Tools (XFST) (Beesley &<br />
Karttunen, 2003), which seamlessly integrate with the Xerox<br />
Linguistic Environment (XLE), see http://www2.parc.com/isl/groups/nltt/xle.<br />
5<br />
http://fr.wiktionary.org/wiki/Catégorie:Préfixes_en_français<br />
6<br />
Containing approx. 120,000 entries with inflection class<br />
categorizations of varying quality, see http://logos-os.dfki.de.<br />
applying automatic rules for derivation and conversion<br />
an additional set of 100,000 lemmas is created.<br />
As noun compounds are the most common and<br />
productive form of terms in German, a suffix-based<br />
inflection class guesser for nouns is necessary. In an<br />
evaluation experiment with 200 randomly selected nouns<br />
from a sociology lexicon 7<br />
, about 40% of the entries were<br />
unknown. We implemented a finite-state based ending<br />
guesser by exploiting frequency counts of lemma endings<br />
(3 up to 5 characters) from our curated lexicon. Roughly<br />
80% of the 73 unknown singular nouns got their correct<br />
inflection class. The finite-state based ending guesser is<br />
tightly coupled with the finite-state transducer derived<br />
from our lexicon. See Clematide (2009) for technical<br />
implementation details.<br />
2.2. Morpho-syntactic Analysis and Generation<br />
While the generation of inflected variants for single<br />
words can be easily done with the help of finite-state<br />
techniques only, this is not the case for a proper treatment<br />
of complex multi-word terms. Therefore, we decided to<br />
use a unification-based grammar framework for syntactic<br />
processing.<br />
The Xerox Linguistic Environment (XLE) has several<br />
benefits for our purposes:<br />
Firstly, finite-state transducers for morphological<br />
processing integrate in a seamless and efficient way.<br />
Additionally, different tokenizer transducers can be<br />
specified for analysis and generation. This proved to be<br />
useful for the treatment of French, e.g. regarding the<br />
treatment of hyphenated compounds.<br />
Secondly, there are predefined commands in XLE for<br />
parsing a term to its functional structure, neutralizing<br />
certain morpho-syntactic features, and generating all<br />
possible strings out of an underspecified functional<br />
structure.<br />
Thirdly, the implementation of optimality theory in XLE<br />
allows a principled way of specifying preference<br />
heuristics, for instance for the part of speech of an<br />
ambiguous word. Additionally, using optimality marks<br />
allows to analyze more constructions than what should be<br />
generated, e.g. terms in the format of back-of-the-book<br />
indexes as Automat, endlich. With the same technique<br />
different lexical specification conventions of French<br />
7 http://www.socioweb.org
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Terms Correct Incorrect Accuracy<br />
Generation Generation<br />
Development Set 400 376 24 94%<br />
parse failure: 19<br />
wrong parse: 5<br />
Test Set 50 48 2 98%<br />
parse failure: 1<br />
wrong parse: 1<br />
Table 1: Evaluation results for French from the development set and test set<br />
adjectives can be handled by the XLE grammar. Lexicon<br />
entries like grand, e or grand/e or grand(e) are parsed<br />
and will result in the same output grand, grande, grands,<br />
grandes.<br />
Lastly, dealing with unknown words is supported in XLE<br />
in a way that parts of a multi-word term that do not<br />
undergo inflection may be analyzed and regenerated<br />
verbatim. This is useful for the treatment of postnominal<br />
prepositional phrases.<br />
The use of a full-blown grammar engineering framework<br />
for the generation of inflected term variants might be<br />
seen as too much machinery at first sight. However, the<br />
experience we gained with this approach is definitely<br />
positive. Despite the expressivity of the framework, the<br />
processing time needed for the processing of one<br />
multi-word term is about 200ms on an AMD Opteron<br />
2200 MHz. Given the fact that our service is only called<br />
when an entry is created by a glossary author, this<br />
performance is adequate.<br />
2.2.1. French multi-word terms<br />
As French is more analytic than German, compounding is<br />
less prominent. The words in a multi-word term are<br />
syntactically depending on each other and require<br />
syntactic processing. The most common construction for<br />
multi-word terms is a noun combined with a preposition<br />
and a noun phrase (e.g. droit de l’individu). Such<br />
constructions typically correspond to German<br />
compounds. Each noun may be modified by one or more<br />
adjectives. For a correct generation of all inflected<br />
variants, the core noun and its core adjectives have to be<br />
identified, as these are the only parts to be altered for<br />
inflected variants. The core part of a French multi-word<br />
term is typically the one preceding the preposition (e.g.<br />
droit de l’individu → droits de l’individu). Due to this<br />
fact, even terms with unknown words can be handled as<br />
long as they follow the preposition. In our XLE grammar,<br />
a default parsing strategy for unknown words occurring<br />
after a preposition is built-in and for the generation side<br />
such input is copied unchanged.<br />
Further constructions for multi-word terms are: a noun<br />
with one or more adjectives, expressions with a hyphen<br />
(e.g. éthylène-glycol), noun-noun combinations (e.g.<br />
assurance maladie) or combinations of several nouns<br />
with et or ou (e.g. cause et effet). For our development set<br />
of 400 terms (see section 3.1.1 for further details), we get<br />
the following distribution: terms with prepositions (190),<br />
terms with adjectives (183), noun-noun combinations<br />
(16), terms with hyphens (9), combination of type noun et<br />
noun (2).<br />
2.2.2. Preference heuristics for French<br />
If the parsing of a one-word input term results in<br />
ambiguous structures, nouns are preferred to adjectives<br />
and verbs, as glossary entries often are nouns. For<br />
ambiguous structures of multi-word input terms the<br />
sequence noun-adjective is preferred to noun-noun, e.g.<br />
église moderne = noun + adjective instead of noun +<br />
noun. If a term is a combination of two nouns, only the<br />
first one is inflected, e.g. assurance maladie →<br />
assurances maladie.<br />
In expressions with a hyphen, inflection is carried out by<br />
treating the hyphenated part of the term as normal word:<br />
Core adjectives or nouns with a hyphen are inflected, all<br />
others are not, e.g. éthylène-glycol → éthylène-glycols,<br />
or document quasi-négociable → documents quasinégociables.<br />
In these two examples, the second part of<br />
the hyphenated expression is a core noun and has to be<br />
inflected. But there are cases where both parts of the<br />
hyphenated expression are non-core nouns. They are not<br />
inflected as in the example égalité homme-femme →<br />
égalités homme-femme. This example follows the<br />
35
Multilingual Resources and Multilingual Applications - Regular Papers<br />
construction of a noun-noun multi-word term and is<br />
treated as such.<br />
2.2.3. German multi-word terms<br />
A detailed technical report on the XLE-based generation<br />
and analysis part for German can be found in Clematide<br />
(2009). Currently, German multi-word terms are<br />
restricted to the combination of an attributive adjective<br />
and a noun that may be given in the textual form of<br />
’adjective noun’ or as back-of-the-book index entry<br />
’noun, adjective’. For instance, the lexicon entry<br />
endlicher Automat (finite state automaton) leads to the<br />
following 6 inflected forms: endlichem Automaten,<br />
endlicher Automat, endlicher Automaten, endlichen<br />
Automaten, endliche Automat, endliche Automate.<br />
2.2.4. Related work<br />
As far as term structures in French are concerned, Daille<br />
(2003) gives an overview that provided a base for our<br />
own analysis of multi-word terms structures. This<br />
classification was adapted and extended according to our<br />
potential glossary entries.<br />
Jacquemin (2001) developed FASTR, a system for<br />
identifying morphological and syntactical term variants<br />
for French and English where also minor lexical<br />
modifications may take place. We did not use this system<br />
mainly for two reasons: we also had to treat German and<br />
the creation of lexical variants was of minor importance<br />
for us too.<br />
In her contrastive study, Savary (2008) discusses<br />
different approaches of computational inflection<br />
regarding multi-word units. She emphasizes the lexical<br />
and sometimes idiosyncratic nature of multi-word<br />
expressions that may lead to problems for simple<br />
rule-based syntactic systems. However, our small-scale<br />
evaluation presented in the next section does not indicate<br />
severe problems for our approach.<br />
36<br />
3. Evaluation<br />
In this section, we present results of our tools derived<br />
from two small-scale evaluations.<br />
3.1.1. French<br />
A development set with 400 and a test set with<br />
50 glossary entries were taken randomly from EuroVoc 8<br />
,<br />
8<br />
http://eurovoc.europa.eu/drupal<br />
the EU’s multilingual thesaurus. Table 1 shows the<br />
results for both data sets. Parsing failures were due to<br />
unknown vocabulary entries such as abbrevations (e.g.<br />
CEC, P et T) or compounds (e.g. désoxyribonucléique,<br />
spéctrométrie). Surprisingly, quite common French<br />
words like jetable and environnemental (appeared<br />
5 times in the development set) were not covered by the<br />
lexicon. To alleviate the problem of missing vocabulary,<br />
additional open resources may be exploited 9<br />
. Wrong<br />
parses were caused by ambiguities between nouns and<br />
adjectives.<br />
3.1.2. German<br />
50 German multi-word terms were selected randomly<br />
from the preferred terms in EuroVoc. Without the<br />
unknown word guesser, the generation of inflected<br />
variants fails for 10 terms, resulting in an accuracy of<br />
80%. Applying the unknown word guesser for nouns<br />
allows a correct generation in 5 cases, thus giving an<br />
accuracy of 90%. 2 cases are due to unknown short nouns<br />
(the guesser requires a minimal length), 2 cases are due to<br />
unknown adjectives, and 1 case originated from an<br />
implementation error concerning adjectival nouns as<br />
Beamter (civil servant).<br />
4. Conclusions<br />
We have built a practical morphological generation<br />
service for French and German terms based on<br />
linguistically motivated processing. For multi-word<br />
terms, more constructions can be easily added through<br />
modifications of the syntactic term grammar.<br />
In order to achieve a higher lexical coverage, other<br />
resources can be integrated. In our French system, there<br />
is already an interface that allows for simple addition of<br />
new regular nouns and adjectives. For German,<br />
additional syntactic constructions for multi-word terms<br />
will be added.<br />
In order to resolve ambiguities on the level of parts of<br />
speech within multi-token terms, a part-of-speech<br />
tagging approach is feasible. However, for that purpose a<br />
specifically trained tagger is necessary<br />
9 E.g. wiktionaries (http://fr.wiktionary.org/wiki/Wiktionnaire),<br />
or different lexica with inflected forms such as lefff - lexique<br />
des formes fléchies du français (http://www.labri.fr/perso/clement/lefff),<br />
Dictionnaire DELA fléchi du français<br />
(http://infolingu.univ-mlv.fr), or Lexique3 (http://www.lexique.org),<br />
a lexicon with lemmata and grammatical<br />
categories.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
In a future step, we plan to extract nominal groups from a<br />
syntactically annotated corpus and use that material for<br />
the training of a part-of-speech tagger.<br />
5. Acknowledgements<br />
The University of Zurich supported this work by IIL<br />
grant funds. Luzia Roth implemented the French part<br />
under the supervision of Simon Clematide. The<br />
implementation of the lexicographic interface in OLAT<br />
was realized by Roman Haag under the supervision of<br />
Florian Gnägi.<br />
6. References<br />
Beesley, K.R., Karttunen, L. (2003): Finite-State<br />
Morphology: Xerox Tools and Techniques. CSLI<br />
Publications.<br />
Clematide, S. (2008): An OLIF-based open inflectional<br />
resource and yet another morphological system for<br />
German. In A. Storrer et al. (Eds.), Text Resources<br />
And Lexical Knowledge: selected papers from the 9th<br />
Conference on Natural Language Processing,<br />
KONVENS, Mouton de Gruyter, pp. 183-194.<br />
Clematide, S. (2009): A morpho-syntactic generation<br />
service for German glossary entries. In S. Clematide,<br />
M. Klenner, and M. Volk (Eds.), Searching Answers:<br />
Festschrift in Honour of Michael Hess on the Occasion<br />
of His 60th Birthday, Münster, Germany: Monsenstein<br />
und Vannerdat, pp. 33-43.<br />
Daille, B. (2003): Conceptual Structuring Through Term<br />
Variations. In Proceedings of the ACL 2003 workshop<br />
on multiword expressions analysis acquisition and<br />
treatment, pp. 9-16.<br />
Jacquemin, C. (2001): Spotting and Discovering Terms<br />
through Natural Language Processing. Massachusetts<br />
Institute of Technology.<br />
Lezius, W. (2000): Morphy - German morphology,<br />
Part-of-Speech tagging and applications. In<br />
Proceedings of the 9th EURALEX International<br />
Congress, Stuttgart, pp. 619-623.<br />
Savary, A. (2008): Computational Inflection of<br />
Multi-Word Units. A contrastive study of lexical<br />
approaches. Linguistic Issues in Language Technology<br />
- LiLT, 1(2).<br />
Zielinski, A., Simon C. (2008): Morphisto: An<br />
Open-Source Morphological Analyzer for German. In<br />
Proceedings of the FSMNLP 2008, pp. 177-184.<br />
37
Multilingual Resources and Multilingual Applications - Regular Papers<br />
38
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Tackling the Variation in International Location Information Data: An<br />
Approach Using Open Semantic Databases<br />
Janine Wolf 1 , Manfred Stede 2 , Michaela Atterer 1<br />
1 Linguistic Search Solutions R&D GmbH, Rosenstraße 2, 10178 Berlin<br />
2 <strong>Universität</strong> Potsdam, Karl-Liebknecht-Straße 24-25, 14476 Potsdam<br />
E-mail: janine@wolf-velten.de, stede@uni-potsdam.de, michaela.atterer@lssrd.de<br />
Abstract<br />
International location information ranges from mere relational descriptions of places or buildings over semi-structured address-like<br />
information up to fully structured postal address data. In order to be utilized, e.g. for associating events or people with<br />
geographical information, these location descriptions have to be decomposed and the relevant semantic information units have to<br />
be identified. However, they show a high amount of variation in order, occurrence and presentation of these semantic information<br />
units. In this work we present a new approach of using a semantic database and a rule-based algorithm to tackle the variation in<br />
such data and segment semi-structured location information strings into pre-defined elements. We show that our method is highly<br />
suitable for data cleansing and classifying address data into countries, reaching an f-score of up to 97 for the segmentation task, an<br />
f-score of 91 for the labelled segmentation task, and a success rate of 99% in the classification task.<br />
Keywords: address parsing, OpenStreetMap, address segmentation, data cleansing<br />
1. Introduction<br />
Databases of international location information, as<br />
maintained by most companies, often contain<br />
incomplete address data, variation in the order of<br />
elements, mixing of international conventions for<br />
address formatting or even semi-translated address parts.<br />
Moreover, the address data can be structured<br />
insufficiently or erroneously according to the database<br />
fields which makes the data unusable for further<br />
classification, querying and data cleansing tasks.<br />
Table 1 shows a number of possible variations of the<br />
same German address.<br />
address string problem description<br />
Willy-Brandt Street 1, Berlin partial translations<br />
#1 Willy-Brandt Street, Berlin 1000 non-standard format<br />
Willy-Brand-Str. 1 incorrect spelling<br />
Willy-Brandt-Str. 1, 1000 Berlin 20 politically outdated<br />
Willy-Brandt-Str.1, Haus 1<br />
presence of more<br />
3.Et., Zi. 101<br />
detailed information<br />
In der Willy-Brandt-Str in Berlin incomplete, e.g.<br />
extracted from free text<br />
Table 1: Examples of variation in postal addresses based on<br />
the German address Willy-Brandt-Str. 1, 10557 Berlin<br />
Apart from this kind of variation we also face variation<br />
in the description of location objects such as colloquial<br />
variations as Big Apple for New York, historical<br />
variations (Chemnitz/Karl-Marx-Stadt), transcription<br />
variants (Peking/Beijing) or translation variants<br />
(München/Munich).<br />
International addresses create further variation in<br />
address data as the typical Japanese address shown in<br />
Table 2 exemplifies.<br />
part of<br />
description string<br />
element type<br />
11-1 street number (mixed information:<br />
estate and building no.)<br />
Kamitoba-hokotate-cho city district<br />
Minami-ku ward of a city (town)<br />
Kyoto city (here: also prefecture)<br />
601-8501 postal code<br />
Table 2: Address elements: Japanese postal address<br />
example 11-1 Kamitoba-hokotate-cho,<br />
Minami- ku, Kyoto 601-8501<br />
All these variations pose major problems for data<br />
warehousing, such as deduplication, record linkage and<br />
identity matching.<br />
In this work we propose a method which is highly<br />
suitable for data cleansing. Tests on German, Australian<br />
and Japanese data show that it is moreover suitable for<br />
classifying address data into countries.<br />
39
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Our approach is rule-based and uses the open<br />
geographical database OpenStreetMap1 as well as<br />
country-specific rules and patterns. It is robust and<br />
easily extensible to further languages.<br />
2. Related Work<br />
Most work concerned with the segmentation of location<br />
information is based on statistical techniques<br />
(Borkar et al., 2001; Agichtein & Ganti, 2004; Christen<br />
& Belacic, 2005; Christen et. Al, 2002; Peng &<br />
McCallum, 2003; Marques & Gon Calves, 2004; Cortez<br />
& De Moura, 2010).<br />
However, as the experiments by Borkar et al. (2001)<br />
show, these methods drastically decrease in performance<br />
once confronted with a mixture of location strings from<br />
different countries. While an experiment on uniformly<br />
formatted address data from the U.S. reaches 99.6%<br />
accuracy, performance drops to 88.9% when trained and<br />
tested on addresses from mixed countries2 .<br />
There are only few published approaches on rule-based<br />
systems (Appelt et al, 1992; Riloff, 1993). Rule-based<br />
systems are generally thought to be less robust to noise,<br />
harder to adapt to other languages and thus considered<br />
suitable mainly for small domains. However a<br />
comparison of a rule based and statistical system<br />
(Borkar et al., 2001) showed that rules can compete with<br />
statistical approaches especially on inhomogeneous data.<br />
Given the fact, that huge geographical databases have<br />
become available in recent years, high-quality rulebased<br />
systems can be developed for large unrestricted<br />
domains with relatively little effort and easily be<br />
extended to more languages by adding more databases<br />
for the relevant countries.<br />
3. Location information Segmentation<br />
Figure 1 shows the general architecture of our system.<br />
In a preprocessing step the location information string is<br />
tokenised and normalised according to country-specific<br />
normalisation patterns (e.g. str becomes straße). Initial<br />
grouping is done if applicable, i.e. if indications for<br />
grouping already exist. These steps are necessary for a<br />
1 http://www.openstreetmap.org<br />
2 The accuracy measure used in this article is an overall<br />
measure of all element-wise measurements for the address<br />
elements under consideration and similar to the labelled recall<br />
measure used in Section 4.<br />
40<br />
later OpenStreetMap query because abbreviations or<br />
partial street names cannot be found in the database.<br />
Figure 1: The system architecture<br />
In the succeeding tagging step all identifiable<br />
geographical names are tagged by querying<br />
OpenStreetMap, allowing tagging ambiguities. Countryspecific<br />
string patterns aid the tagging of elements<br />
containing numbers. One of the difficulties within this<br />
step is that address elements often consist of more than<br />
one token. The challenge lies in querying for Oxford<br />
Street and not erroneously tagging Oxford as the place<br />
name and leaving Street untagged. This is achieved by a<br />
longest match policy. However, all match information is<br />
preserved. Multiple queries are used to account for<br />
diacritical variation as in umlauts in German (e.g. ü) and<br />
parentheses as parts of official geographical names (as<br />
in Frankfurt(Main)/Frankfurt Main) and other non<br />
alpha-numercial marks such as hyphens.<br />
As a result of this step, the elements are tagged with<br />
OpenStreetMap (OSM) internal types and not yet the<br />
address element types we are looking for. OSM types<br />
are often ambiguous. The string Potsdam, Brandenburg<br />
is tagged as Potsdam (county/city/hamlet) Brandenburg<br />
(town, state) , for instance3 .<br />
3 For a human reader familiar with the location it is clear, that<br />
this denotes the city of Potsdam within the state of<br />
Brandenburg, even though there is also a city called<br />
Brandenburg in the state of Brandenburg, for instance, or a<br />
hamlet called Potsdam in the state of Schleswig-Holstein.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
The following step maps OSM types to address<br />
elements. In the OpenStreetMap project, every country-<br />
specific subproject, e.g. the Japanese or the German<br />
OSM project, has its own guidelines about how to tag<br />
locations according to their administrative unit status (as<br />
being a city, town or hamlet4 ). Therefore we use country<br />
specific mappings from OSM internal types to one or<br />
more of the desired target address element types we<br />
define.<br />
The enrichment step provides rules for labelling<br />
address elements which have not been attributed a tag<br />
by a previous step because they were not found in the<br />
knowledge base due to spelling errors, for instance. The<br />
completion rules are of the following form:<br />
(type1 , type2 , . . . , typen ) − − > targetAddressElement<br />
If for each typex, x = 1 .. n, for the token at index x, the<br />
respective type can be found in the list of possible types,<br />
the tokens in the sequence are grouped and labelled with<br />
the type targetAddressElement. Examples for language<br />
specific completion token types are found in Table 3.<br />
A token tagged with one of these affix types indicates a<br />
(possibly still unlabelled) preceding/following location<br />
name and the token group is labelled appropriately<br />
including the marker token.<br />
compl. type examples description<br />
town_suf ku Suffix marking a<br />
station_suf Station,<br />
Ekimae,<br />
Meieki<br />
town/ward (Japan)<br />
Word marking a train<br />
station (Japan)<br />
village_suf mura, son Suffix marking a<br />
village (Japan)<br />
city_dist_pref Aza, Koaza Prefix usually<br />
preceding<br />
a city district or sub-<br />
district (Japan)<br />
street_suf Avenue, Suffix marking a street<br />
Road name (Australia)<br />
state_pref Freistaat Prefix marking a state<br />
name (Germany)<br />
Table 3: Completion types<br />
Some examples of completion rules are listed in Table 4.<br />
The left hand side of the rules specifies the token type<br />
pattern, the right hand side defines the target address<br />
element. An @ means that the token at the respective<br />
4 A hamlet is a small town or village.<br />
position must not have other possible types than the<br />
specified one.<br />
completion rule matching<br />
example<br />
(city_prefix,city) --> city Hansestadt<br />
(orientation_prefix,other,<br />
street_suffix) --> street_name<br />
Hamburg<br />
Lower Geoge<br />
Street (instead of<br />
George Street)<br />
(orientation_prefix,city) East<br />
Launcheston<br />
(contains_street_suffix) --> Ratausstraße<br />
street_name<br />
(instead of<br />
Rathausstraße<br />
(city,loc_suffix) --> city_district Berlin Mitte<br />
(state_prefix,state) --> state Freistaat Bayern<br />
(@city,@city) --> city Munich<br />
(München)<br />
(street_number,street_number_ext)<br />
--> street_number<br />
34a<br />
(street_number,sep_last_alphanum)<br />
--> street_number<br />
34 - 36<br />
Table 4: Example completion rules<br />
The final disambiguation step provides rules which<br />
decide which of the attributed types for each element is<br />
selected. In the aforementioned example, Brandenburg<br />
would thus be tagged a state and not a city.<br />
The disambiguation rules take the form<br />
(leftNeighbourType, currentType, rightNeighbourType)<br />
where currentType is the target address element type of<br />
the token group under consideration. Either<br />
rightNeighbourType or leftNeighbourType may be empty<br />
(i.e. any type is allowed). If such a rule can be applied,<br />
the token group under consideration will be labelled<br />
with currentType.<br />
4. Experiments<br />
4.1. Data<br />
We conducted our experiments using two different<br />
datasets. The first dataset was collected from the<br />
Internet, the second corpus was a company internal<br />
database. Eleven external annotators collected variations<br />
of location information data from the Internet and<br />
annotated them according to the annotation guidelines<br />
given in Wolf (<strong>2011</strong>). They collected 154 strings for<br />
German, 35 of which were used for development and the<br />
rest for testing. For Australia they collected 143 strings,<br />
41
Multilingual Resources and Multilingual Applications - Regular Papers<br />
34 of which were used for development. The Japanese<br />
data were collected and annotated by the first author. 76<br />
of the 242 data points were used for development.<br />
The company internal database contained 57 examples<br />
for Germany, 162 examples for Australia and 56 for<br />
Japan. They were already (sometimes not correctly)<br />
attributed to 3 database fields address, postal code and<br />
city. To obtain a gold standard, a correct re-ordering of<br />
the elements was done manually by the first author.<br />
4.2. Segmentation<br />
Our first experiment consisted of correctly segmenting<br />
the internet data with our system. As a baseline we used<br />
unsophisticated systems for each language which took<br />
about 1.5 hours to program each and use patterns for<br />
postal code, a small list of endings for street names and<br />
knowledge about the typical order of address elements<br />
in the country. Our evaluation should thus reflect the<br />
superiority of a full-fledged system compared to an adhoc<br />
solution.<br />
Tables 5, 6 and 7 show the evaluation results for the<br />
segmentation task for each country using f-scores based<br />
on recall and precision as computed by the PARSEVAL<br />
measures (cf. Manning & Schütze, 19<strong>96</strong>), which are<br />
suitable for evaluating systems generating bracketed<br />
structures with labels.<br />
42<br />
F-score type baseline system<br />
unlabelled 87.36 <strong>96</strong>.91<br />
labelled 70.23 91.36<br />
Table 5: Evaluation results for German data<br />
F-score type baseline system<br />
unlabelled 68.05 95.85<br />
labelled 64.93 86.60<br />
Table 6: Evaluation results for Australian data<br />
F-score type baseline system<br />
unlabelled 75.45 91.80<br />
labelled 45.47 73.50<br />
Table 7: Evaluation results for Japanese data<br />
The baseline systems showed above all problems with<br />
multi-token address elements (Frankfurt (Main), Bad<br />
Homburg) and addresses that did not conform to the<br />
standard ordering.<br />
The full-fledged system clearly outperforms the<br />
baselines by a difference in f-score (when counting<br />
correct labels and not only correct element boundaries)<br />
of 21 points for Germany, 12 for Australia and 28 for<br />
Japan.<br />
The contribution of the completion patterns was an<br />
increase in f-score of up to 13.03 points for the Japanese<br />
data (unlabelled) and a minimum of 0.28 for Australia<br />
(labelled).<br />
4.3. Data cleansing<br />
In a second experiment we tested whether the system is<br />
suitable for data cleansing. A problem already<br />
mentioned in the introduction is erroneous data<br />
structuring according to the fields of a database. By<br />
using the system for attributing address elements to the<br />
database field we could reduce the rate of elements in an<br />
incorrect database field for the company internal<br />
database by 16.77 percentage points (pp) for German,<br />
19.31pp for Australia, and 29.84pp for Japan.<br />
4.4. Address classification<br />
We also conducted an experiment to find out whether<br />
the system is able to correctly guess the country of a<br />
location information string. Our testing method ignores<br />
country information (Japan, Germany, Australia) if<br />
present, and selects the country by computing the rate of<br />
tokens in the input which could not be classified by the<br />
system, neither by the database nor by the country<br />
specific patterns for suffixes, prefixes, special words or<br />
alphanumeric strings. As a result the system selects the<br />
country with the lowest rate of unlabelled tokens. For<br />
this experiment, we used 518 location information<br />
strings from both the Internet and the company internal<br />
data (166 for German, 271 from Australia, 81 from<br />
Japan), 99.22% of which were correctly attributed to<br />
their country.<br />
5. Discussion and Future Work<br />
We present a system that successfully deals with the<br />
high variability in international textual location<br />
information, by classifying the components of location<br />
strings. The implemented system is robust and easily
Multilingual Resources and Multilingual Applications - Regular Papers<br />
extensible to more countries. We tested the system with<br />
3 countries with strongly diverging standards for the<br />
expression of location information (Germany, Australia<br />
and Japan). New countries can be added within a few<br />
hours, as only certain country specific files have to be<br />
edited and the corresponding OpenStreetMap knowledge<br />
base has to be plugged in. Most European countries are<br />
similar to Germany, and the U.S. and Canada almost<br />
identical to the Australian system, so that a large part of<br />
the world can easily be covered.<br />
The system was shown to successfully improve the<br />
address element segmentation in a company internal<br />
database with high variation in orthography and<br />
formatting, even containing translated names.<br />
Moreover, the system is able to almost always correctly<br />
guess the country that textual location information can<br />
be attributed to.<br />
In future work, the system can be further improved to<br />
deal with a greater variety of typographical or<br />
transcription errors by using phonetic indexing<br />
algorithms as Soundex for English or Traphoty matching<br />
rules (Lisbach, 2010) for international languages.<br />
6. Acknowledgements<br />
We would like to thank all external annotators that<br />
helped gathering and annotating the test data and the<br />
LSS R&D GmbH for making a company internal<br />
address database available to us in order to test the<br />
system.<br />
7. References<br />
Agichtein, E., Ganti, V. (2004): Mining Reference<br />
Tables for Automatic Text Segmentation. In KDD ’04:<br />
Proceedings of the tenth ACM SIGKDD international<br />
conference on Knowledge discovery and data mining,<br />
Seattle, WA, USA, ACM.<br />
Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D., Tyson, M.<br />
(1992): FASTUS: A Finite-state Processor for<br />
Information Extraction from Real-world Text.<br />
Borkar, V.; Deshmukh, K., Sarawagi, S. (2001):<br />
Automatic segmentation of text into structured<br />
records .<br />
Christen, P., Belacic, D. (2005): Automated Probabilistic<br />
Address Standardisation and Veri- fication.<br />
Australasian Data Mining Conference 2005<br />
(AusDM05).<br />
Christen, P.; Churches, T., Zhu, J.X. (2002): Case-<br />
Probabilistic Name and Address Cleaning and<br />
Standardisation. The Australasian Data Mining<br />
Workshop 2002.<br />
Cortez, E., De Moura, E.S. (2010): ONDUX: On-<br />
Demand Unsupervised Learning for Information<br />
Extraction. In Proceedings of the 2010 international<br />
conference on Management of data (SIGMOD ’10 ),<br />
pp. 807–818.<br />
Lisbach, B. (2010): Linguistisches Identity Matching.<br />
Vieweg+Teubner. ISBN 978-3-8348-9791- 6. URL<br />
http://dx.doi.org/10.1007/ 978-3-8348-9791-6\_11.<br />
Manning, C.D., Schütze, H. (1999): Foundations of<br />
Statistical Natural Language Processing. The MIT<br />
Press, Cambridge, Massachusetts.<br />
Marques, N.C., Gon Calves, S. (2004): Applying a Partof-Speech<br />
Tagger to Postal Address Detection on the<br />
Web, 2004.<br />
Peng, F., McCallum, A. (2003): Accurate In- formation<br />
Extraction from Research Papers using Conditional<br />
Random Fields. In: Information Processing<br />
Management.<br />
Riloff, E. (1993): Automatically Constructing a Dictionary<br />
for Information Extraction Tasks, AAAI<br />
Press / MIT Press. pp. 811–816.<br />
Wolf, J. (<strong>2011</strong>): Classifying the components of textual<br />
location information. Diploma Thesis, Department <strong>für</strong><br />
Linguistik, <strong>Universität</strong> Potsdam.<br />
43
Multilingual Resources and Multilingual Applications - Regular Papers<br />
44
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Towards Multilingual Biographical Event Extraction<br />
– Initial Thoughts on the Design of a new Annotation Scheme –<br />
Michaela Geierhos * , Jean-Leon Bouraoui § , Patrick Watrin §<br />
* CIS, Ludwig-Maximilians-<strong>Universität</strong> München, Geschwister-Scholl-Platz 1, D-80539 München, Germany<br />
§ CENTAL, Université Catholique de Louvain, place Blaise Pascal 1, B-1348 Louvain-la-Neuve, Belgium<br />
E-mail: micha@cis.uni-muenchen.de, mehdi.bouraoui@uclouvain.be, patrick.watrin@uclouvain.be<br />
Abstract<br />
Within this paper, we describe the special requirements of a semantic annotation scheme used for biographical event extraction in the<br />
framework of the European collaborative research project Biographe. This annotation scheme supports interlingual search for people<br />
due to its multilingual support covering four languages such as English, German, French and Dutch.<br />
Keywords: biographical event extraction for interlingual people search, semantic annotation scheme<br />
1. Introduction<br />
In everyday life, people search is frequently used for<br />
private interests such as locating classmates and old<br />
friends, finding partners for relationships or checking<br />
someone’s background.<br />
1.1. People Search within Business Context<br />
In a business context, finding the right person with the<br />
appropriate skills and knowledge is often crucial to the<br />
success of projects being undertaken (Mockus &<br />
Herbsleb, 2002). For instance, an employee may want to<br />
ascertain who worked on a particular project to find out<br />
why particular decisions were made without having to<br />
crawl through documentation (if there is any). Or, he may<br />
require a highly trained specialist to consult about a very<br />
specific problem in a particular programming language,<br />
standard, law, etc. Identifying experts may reduce costs<br />
and facilitate a better solution than could be achieved<br />
otherwise.<br />
Possible scenarios could be the following ones:<br />
� A personnel officer wants to find information about a<br />
person who applied for a specific position and has to<br />
collect additional career-related information about<br />
the applicant;<br />
� A company requires a description of the<br />
state-of-the-art in some field and, therefore, wants to<br />
locate an expert this knowledge area;<br />
� An enterprise has to set up an additional team<br />
supporting an existing group and has to find new<br />
employees with similar expertise;<br />
� Organizers of a conference have to match<br />
submissions with reviewers;<br />
� Job centers or even labor bureaus are interested in<br />
mapping appropriate job offers to personal data<br />
sheets.<br />
These scenarios demonstrate that it is a real challenge<br />
within any commercial, scientific, or governmental<br />
organization to manage the expertise of employees such<br />
that experts in a particular area can be identified.<br />
1.2. Background: The Biographe Project<br />
A step beyond document retrieval, people search is<br />
restricted to person-related facts. The Biographe project 1<br />
develops grammar-based analysis tools to extract<br />
person-related facts in four languages (English, German,<br />
French, and Dutch). The project received the Eurostars 2<br />
label in 2009. Kick-off has been given in March 2010 and<br />
the project lasts for 24 months. The research consortium<br />
is composed of four companies and two public research<br />
departments, based in four European countries (France,<br />
Belgium, Germany, and Austria). The team creates a<br />
multipurpose people search platform able to reconstruct<br />
biographies of people. It uses all available information<br />
1<br />
http://www.biographe.org<br />
2 http://www.eurostars-eureka.eu<br />
45
Multilingual Resources and Multilingual Applications - Regular Papers<br />
sources such as profiles on social websites, press articles,<br />
CVs or private documents. The platform collects,<br />
extracts and structures this multilingual information in<br />
indexes and relational databases ready to be used by<br />
different task-oriented people search applications.<br />
In this context, a semantic annotation scheme is<br />
commonly used. But conceiving such a scheme entails<br />
several technical, scientific and task-specific issues,<br />
especially when the platform is multilingual, which is<br />
still quite rare.<br />
1.3. Multilinguality<br />
One innovative feature of our people search platform is<br />
its multilinguality, or – to be precise – its ability to<br />
structure information, coming from the four different<br />
European languages detailed above, in a common<br />
database. By using this multilingual database, it is<br />
possible to create applications searching people through<br />
queries and documents in different languages – a feature<br />
known as interlingual search (or Cross Language<br />
Information Retrieval (CLIR)). Creating a common<br />
multilingual database allows the development of a<br />
pan-European and wholly accessible search engine<br />
offering interfaces in English and in several major<br />
European languages. Besides, this people search engine<br />
is able to handle diacritical marks such as accents<br />
(circumflex, trema, tilde, double grave accent, etc.). This<br />
apparently simple feature is very rare, due to the<br />
dominance of American search engines neglecting all<br />
accents. Accents, diacritics and not-latin symbols are<br />
very important in order to differentiate between people.<br />
1.4. Objectives of the Paper<br />
This extended abstract states our initial thoughts on the<br />
design of a new annotation scheme in such a specific<br />
framework. Since we cooperate with companies<br />
providing business and people search solutions, they<br />
already have established parsing technologies and our<br />
annotation scheme has to fulfill their technical<br />
requirements. Therefore, we only mention one of the<br />
main state-of-the-art schemes that is commonly used for<br />
biographical annotation tasks. We do not give a critical<br />
overview of all existing schemes because we have to<br />
develop an integrated solution and therefore try to<br />
somehow reinvent the wheel. Furthermore, we have to<br />
46<br />
discuss the particular context of multilingual annotation<br />
and finally give an example of our annotation scheme.<br />
2. Yet another Annotation Scheme?<br />
2.1. Linguistic Notion of Biographical Events<br />
We define biographical events as predicative relations<br />
linking several arguments out of which one is an instance<br />
belonging to the argument type . There is no<br />
restriction on the selection of the other elements<br />
participating in a biographical relationship. However, we<br />
observed that other arguments are typically instances of<br />
the semantic classes , , ,<br />
, , ,<br />
, , etc.<br />
a. John Miller retired as senior<br />
accountant in 1909.<br />
b. Michael Caine won the Academy Award<br />
for Best Supporting Actor<br />
.<br />
c. Jim Sweeney will also be joining <br />
AmeriQuest as Vice<br />
President.<br />
2.2. Events in the Information Extraction Tas k<br />
One approach to defining events is used for Information<br />
Extraction (IE), being “the automatic identification of<br />
selected types of entities, relations, or events in free text”<br />
(Grishman, 2003:545). In general, information extraction<br />
tasks use surface-based patterns to identify concepts and<br />
relations between them. Patterns may be handcrafted or<br />
learned automatically, but typically include a<br />
combination of character strings, part of speech or<br />
phrasal information (Grishman, 1997). A succession of<br />
regular expressions is normally used to identify these<br />
structures; they are applied when triggered by keywords<br />
(McDonald, 19<strong>96</strong>). Most information extraction systems<br />
either use hand written extraction patterns or use a<br />
machine learning algorithm that is trained on a manually<br />
annotated corpus. Both of these approaches require<br />
massive human effort and hence prevent information<br />
extraction from becoming more widely applicable.<br />
The problem that we are addressing is related to this<br />
traditional IE task covered by the sixth and seventh<br />
Message Understanding Conferences (MUC) 3<br />
and later<br />
3<br />
http://www-nlpir.nist.gov/related_projects/muc/
Multilingual Resources and Multilingual Applications - Regular Papers<br />
replaced by the Automatic Content Extraction (ACE)<br />
campaigns. According to the MUC campaigns,<br />
identifying an IE event is to extract fillers for a<br />
predefined event template. In this framework, IE events<br />
were identified by rule-based, lexicon-driven, machine<br />
learning or other systems.<br />
2.3. The ACE Annotation Guidelines for Events<br />
Since 1999, ACE (Automatic Content Extraction) 4<br />
has<br />
replaced MUC and has extended the task definition for<br />
the campaigns, including more and more scenarios. For<br />
the ACE task (Doddington et al., 2004), the participating<br />
systems are supposed to recognize several predefined<br />
semantic types of events (life, movement, transaction,<br />
business, conflict, personell, etc.) together with the<br />
constituent parts corresponding to these events (agent,<br />
object, source, target, time, location, etc.). For example,<br />
Table 1 provides an overview of the LIFE event type<br />
(with several subtypes including, BORN, DIED, etc.),<br />
together with the arguments which should be extracted<br />
for these events.<br />
There exist approaches that identify events according to<br />
the TimeML annotation guidelines using rule-based<br />
(Saurí et al., 2005) or machine learning approaches<br />
(Bethard & Martin, 2006). The TimeML specification<br />
language was used to create the TimeBank (Pustejovsky<br />
et al., 2003) corpus.<br />
Life event subtype Arguments<br />
BE-BORN Person, Time, Place<br />
MARRY Person, Time, Place<br />
DIVORCE Person, Time, Place<br />
INJURE<br />
Agent, Victim, Instrument,<br />
Time, Place<br />
DIE<br />
Agent, Victim, Instrument,<br />
Time, Place<br />
Table 1: An overview of ACE LIFE event subtypes<br />
2.4. Limits of ACE annotation scheme<br />
Since we dedicated our research to biographical events,<br />
we only address the LIFE and PERSONELL event types<br />
defined by the ACE English Annotation Guidelines for<br />
Events (Linguistic Data Consortium, 2005, p. 65 and sq.).<br />
Concerning the ACE English Annotation Guidelines for<br />
Events, the number of arguments considered as relevant<br />
4 http://projects.ldc.upenn.edu/ace/<br />
is quite limited. For example, the BE-BORN event type<br />
disregards useful information such as the birth name,<br />
family background, or birth defects. Especially, birth<br />
names are useful to distinguish between people by<br />
identifying that, for example, Stefani Joanne Angelina<br />
Germanotta and Lady Gaga is the same person in the<br />
following context:<br />
d. Lady Gaga was born as Stefani Joanne Angelina<br />
Germanotta on March 28, 1986.<br />
Since we need more detailed information about people,<br />
their work and occupations, we dismiss the ACE<br />
annotation standard for biographical event types. Hence<br />
we propose a more suitable one, which we present in the<br />
next sections.<br />
3. Requirements of the Annotation Scheme<br />
3.1. Compatible with Local Grammars<br />
Within the Biographe Project, we focus on a linguistic<br />
description of biographical events. For example, the<br />
(born ... died ...) parentheses typically used in<br />
biographical articles help us to spot the date of birth and<br />
death in the first line of the biography. However, there are<br />
variations in expressing a lifetime period, e.g. Jane Smith<br />
(June 1<strong>96</strong>5 – September 14, 2001). In this case, the<br />
keywords born and died are totally missing. There are<br />
many syntactic variations in heterogeneous text<br />
expressing the same types of biographical information<br />
(e.g. birth, death) which are reduced to the basics in a<br />
structured representation. Our project partners created<br />
local grammars (Gross, 1997) using the free software tool<br />
Unitex 5<br />
(Paumier, 2010) in order to describe the syntactic<br />
and lexical structures of biographical information.<br />
Formally, local grammars are recursive transition<br />
networks (Woods, 1970), symbolized by graphs<br />
(cf. Figure 1).<br />
In the framework described above, the need for named<br />
entity annotation is evident. Indeed, it relies on the<br />
accurate identification of the named entities and their<br />
corresponding relations. Consequently, it is necessary to<br />
design an annotation scheme that is capable of being<br />
integrated into the local grammar concept and can be<br />
applied to all languages provided by our system.<br />
5 http://www-igm.univ-mlv.fr/~unitex<br />
47
Multilingual Resources and Multilingual Applications - Regular Papers<br />
48<br />
Figure 1: Local grammar for the extraction of persondata fields belonging to the event “Birth”<br />
3.2. Definition of Annotation Units<br />
As stated above, the scheme is used for biographical<br />
information annotation. Hence, we defined a set of<br />
named entity categories as well as the relations between<br />
them. More precisely, this scheme follows three<br />
principles:<br />
1) The definition of “entity patterns”: they are the<br />
basis components of the annotation scheme. They<br />
benefit from the main characteristics that can be used<br />
for describing an entity; e.g. “location”, “date” …<br />
Until now, there are 20 different entity patterns;<br />
2) The next higher level is the definition of “event<br />
patterns”: they are composed of two or more entity<br />
patterns. In “event patterns”, entity patterns play<br />
different roles: one will always be the head of a<br />
pattern. Other optional or mandatory patterns can be<br />
attached to this head. For instance, the event pattern<br />
“awards” has as head the entity pattern “person”,<br />
which arguments are the entity patterns “domain”,<br />
“date”, and optionally another “person” if there is<br />
more than one award winner. At the same level, we<br />
also define so-called “relation patterns”. They<br />
build up the relationship between different entity<br />
partners in order to express the type of relation<br />
between them;<br />
3) The highest level embodies the definition of<br />
“template sets”. They are driven from different<br />
event patterns and/or relation patterns. For example,<br />
the template set “career” comprises two event<br />
patterns: “profession” and “awards”, which<br />
themselves consist of different entity patterns, as we<br />
explained above.<br />
3.3. Annotation Sample<br />
Here is an instance of the use of this annotation scheme.<br />
e. Elio Di Rupo was born on July, 18th, 1951, at<br />
Morlanwelz, from Italian parents who arrived in<br />
Belgium in 1947.<br />
After applying the annotation scheme described above<br />
we get the following result (use of one event pattern and<br />
of two entity patterns):<br />
<br />
{Elio Di Rupo,.N+comp+PERS} was born<br />
on {July 18th 1951,.ADV+Time+Moment}<br />
atMorlanwelz,<br />
from Italian {parents,.N+comp+FAMILY+IMMEDIATE}<br />
who arrived in Belgium {in 1947,.ADV+Time+Moment<br />
+Imprecis}.
3.4. Annotation Features<br />
Multilingual Resources and Multilingual Applications - Regular Papers<br />
The sample sentence (e) annotated in Section 3.3 shows<br />
that the scheme foresees the future application of<br />
anaphora resolution tools. Until now, it only works with<br />
anaphoric pronouns but it is planned to extend its<br />
capabilities to more complex anaphoric terms.<br />
There is a {} notation used beside XML because the<br />
annotation scheme has to be processed by the UNITEX 6<br />
system (Paumier, 2010:44-46) which expects such a kind<br />
of meta-syntax in order to treat multi-word expressions<br />
(e.g. “July 18 th 1951”) on the one hand and assign<br />
lexico-semantic types (e.g. ADV+Time) to text units on<br />
the other hand.<br />
Moreover, the attribute “variable” can be assigned to the<br />
values 0 or 1 if a syntactic variability is possible for a<br />
recognized unit (e.g. “Elio de Rupo”). Since the city<br />
name given in our sample sentence (here: “Morlanwelz”)<br />
cannot change its structure, we assign 0 to the attribute<br />
“variable”. However, “Elio de Rupo” can appear another<br />
time in the text as “de Rupo” or “Elio” or “de Rupo, Elio”<br />
or “Mr de Rupo” and so on. We therefore assign 1 to the<br />
attribute “variable”.<br />
3.5. Technical Basis<br />
The scheme is defined in XML format. It will be applied<br />
to the text to annotate in conjunction with the use of the<br />
DBPedia ontology 7 . Remind that it is an ontology based<br />
on the extraction and the organisation of Wikipedia 8<br />
information. This ontology features different categories,<br />
each one corresponding to the description of a<br />
characteristic of an object or concept. These categories<br />
are linked, using the Web Ontology Language (OWL),<br />
defined by the World Wide Web Consortium (W3C) 9<br />
.<br />
This fine-grained ontology would be largely sufficient to<br />
cover all of the task needs. Besides, it could easily be<br />
used for producing annotations in a Resource Description<br />
10<br />
Format (RDF) triple format , also defined by the W3C.<br />
This entails that it could be easily used for conceiving<br />
and implementing a database.<br />
Besides, such a database could be requested thanks to the<br />
6<br />
http://www-igm.univ-mlv.fr/~unitex<br />
7<br />
http://dbpedia.org<br />
8<br />
http://www.wikipedia.org<br />
9<br />
http://www.w3.org/TR/owl-features<br />
10 http://www.w3.org/RDF<br />
SPARQL language 11<br />
, which is a SQL like query language<br />
especially designed by the W3C to be compatible with<br />
OWL and RDF.<br />
This solution meets all of the specifications required by<br />
the project: knowledge representation, indexation for<br />
most of the human languages (beyond English, German,<br />
French, and Dutch), updatability of the database, etc.<br />
4. Conclusion and future works<br />
This paper described an annotation scheme conceived<br />
and implemented in the framework of a European project.<br />
In regards to other scheme, its main advantages are the<br />
multilingual support and its generality for any named<br />
entity related task.<br />
Our short term perspective is to evaluate its robustness,<br />
especially when automatically applied by local grammars.<br />
In future, we will adopt it to other named entity related<br />
tasks and additional natural languages.<br />
5. Acknowledgements<br />
This work is supported by the Eurostars Programme, a<br />
R&D initiative funded by the European Community, The<br />
Brussels Institute for Research and Innovation<br />
(INNOV IRIS ), and by the German Federal Ministry of<br />
Education and Research (Grant No. 01QE0902B). We<br />
express our sincere thanks to all for financing this<br />
research within the collaborative research project<br />
Biographe E!4621 (http://www.biographe.org).<br />
6. References<br />
Bethard, S., Martin, J. (2006): Identification of event<br />
mentions and their semantic class, in Proceedings of<br />
the Conference on Empirical Methods in Natural<br />
Language Processing (EMNLP–2006), Association for<br />
Computational Linguistics, Sydney, Australia,<br />
pp. 146-154.<br />
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw,<br />
L., Strassel, S., Weischedel, R. (2004): The Automatic<br />
Content Extraction (ACE) Program. Tasks, Data, and<br />
Evaluation, in Proceedings of the Fourth International<br />
Conference on Language Resources and Evaluation<br />
(LREC 2004), Canary Islands, Spain.<br />
Grishman, R. (1997): Information Extraction:<br />
Techniques and Challenges, in M. T. Pazienza (ed.),<br />
Proceedings of the Information Extraction<br />
11 http://www.w3.org/TR/rdf-sparql-query<br />
49
Multilingual Resources and Multilingual Applications - Regular Papers<br />
50<br />
International Summer School (SCIE-97),<br />
Springer-Verlag.<br />
Grishman, R. (2003): Information Extraction, in R.<br />
Mitkov (ed.), The Oxford Handbook of Computational<br />
Linguistics, Oxford University Press, pp. 545-559.<br />
Gross, M. (1997): The Construction of Local Grammars,<br />
in E. Roche & Y. Schabes (eds), Finite-State Language<br />
Processing, MIT Press, Cambridge, Massachusetts,<br />
USA: 329-354.<br />
Linguistic Data Consortium (2005): ACE English<br />
Annotation Guidelines for Events, Version 5.4.3<br />
2005.07.01,<br />
http://www.ldc.upenn.edu/Projects/ACE/docs/English<br />
-Events-Guidelines_v5.4.3.pdf<br />
McDonald, D. (19<strong>96</strong>): Internal and External Evidence in<br />
the Identification and Semantic Categorization of<br />
Proper Names, in Corpus Processing for Lexical<br />
Acquisition: MIT Press, pp. 31-43.<br />
Mockus, A., Herbsleb, J.D. (2002): Expertise browser: a<br />
quantitative approach to identifying expertise. In<br />
ICSE’02: Proceedings of the 24th International<br />
Conference on Software Engineering, pp. 503–512.<br />
Paumier, S. (2010): Unitex User Manual 2.1,<br />
http://igm.univ-mlv.fr/~unitex/UnitexManual2.1.pdf.<br />
Pustejovsky, J., Castaño, J, Ingria, R., Saurí, R.,<br />
Gaizauskas, R., Setzer, A., Katz, G., Radev, D. (2003):<br />
TimeML: A specification language for temporal and<br />
event expressions, in Proceedings of the International<br />
Workshop of Computational Semantics (IWCS–2003),<br />
Tilburg, The Netherlands.<br />
Saurí, R., Verhagen, M., Pustejovsky, J. (2005), Evita: A<br />
robust event recognizer for QA systems, in<br />
Proceedings of the Joint Human Language Technology<br />
Conference and Conference on Empirical Methods in<br />
Natural Language Processing (HLT/EMNLP-2005),<br />
Vancouver, Canada, pp. 700-707.<br />
Woods, W. A. (1970): Transition network grammars for<br />
natural language analysis, in Communications of the<br />
ACM, n° 10, vol. 13, ACM, New York, NY, USA,<br />
pp. 591–606.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
The Corpus of Academic Learner English (CALE): A new resource for the study<br />
of lexico-grammatical variation in advanced learner varieties<br />
Marcus Callies, Ekaterina Zaytseva<br />
Johannes-Gutenberg-<strong>Universität</strong> Mainz, Department of English and Linguistics<br />
Jakob-Welder-Weg 18, 55099 Mainz<br />
E-mail: mcallies@uni-mainz.de, zaytseve@uni-manz.de<br />
Abstract<br />
This paper introduces the Corpus of Academic Learner English (CALE), a Language for Specific Purposes learner corpus that is<br />
currently being compiled for the quantitative and qualitative study of lexico-grammatical variation patterns in advanced learners'<br />
written academic English. CALE is designed to comprise seven academic genres produced by learners of English as a foreign<br />
language in a university setting and thus contains discipline- and genre-specific texts. The corpus will serve as an empirical basis to<br />
produce detailed case studies that examine individual (or the interplay of several) determinants of lexico-grammatical variation, e.g.<br />
semantic, structural, discourse-motivated and processing-related ones, but also those that are potentially more specific to the<br />
acquisition of L2 academic writing such as task setting, genre and writing proficiency. Another major goal is to develop a set of<br />
linguistic criteria for the assessment of advanced proficiency conceived of as "sophisticated language use in context". The research<br />
findings will be applied to teaching English for Academic Purposes by creating a web-based reference tool that will give students<br />
access to typical collocational patterns and recurring phrases used to express rhetorical functions in academic writing.<br />
Keywords: learner English, academic writing, lexico-grammatical variation, advanced proficiency<br />
1. Introduction<br />
Recently, second language acquisition (SLA) research has<br />
seen an increasing interest in advanced stages of<br />
acquisition and questions of near-native competence.<br />
Corpus-based research into learner language (Learner<br />
Corpus Research, LCR) has contributed to a much clearer<br />
picture of advanced interlanguages, providing evidence<br />
that learners of various native language (L1) backgrounds<br />
have similar problems and face similar challenges on their<br />
way to near-native proficiency. Despite the growing<br />
interest in advanced proficiency, the fields of SLA and<br />
LCR are still struggling with i) a definition and<br />
clarification of the concept of "advancedness", ii) an<br />
in-depth description of ALVs, and iii) the<br />
operationalization of such a description in terms of criteria<br />
for the assessment of advancedness. In this paper, we<br />
introduce the Corpus of Academic Learner English<br />
(CALE), a Language for Specific Purposes learner corpus<br />
that is currently being compiled for the quantitative and<br />
qualitative study of lexico-grammatical variation patterns<br />
in advanced learners' written academic English.<br />
2. Corpus design and composition<br />
Already existing learner corpora, such as the<br />
International Corpus of Learner English (Granger et al.,<br />
2009) include learner writing of a general argumentative,<br />
creative or literary nature, and thus not academic writing<br />
in a narrow sense. Thus, several patterns of variation that<br />
predominantly occur in academic prose (or are subject to<br />
the characteristic features of this register) are not<br />
represented at all or not frequently enough in general<br />
learner corpora. CALE is designed to comprise academic<br />
texts produced by learners of English as a foreign<br />
language (EFL) in a university setting. CALE may<br />
therefore be considered a Language for Specific Purposes<br />
learner corpus, containing discipline- and genre-specific<br />
texts (Granger & Paquot, forthcoming). Similar corpora<br />
that contain native speaker (NS) writing and may thus<br />
serve as control corpora for CALE are the Michigan<br />
Corpus of Upper-Level Student Papers (MICUSP, Römer<br />
& Brook O'Donnell, forthcoming) and the British<br />
Academic Written English corpus (BAWE, Alsop &<br />
Nesi, 2009).<br />
51
Multilingual Resources and Multilingual Applications - Regular Papers<br />
CALE's seven academic text types ("genres") are written<br />
as assignments by EFL learners in university courses, see<br />
Figure 1.<br />
52<br />
Figure 1: Academic text types in CALE<br />
We are currently collecting texts and bio data from<br />
German, Chinese and Portuguese students, and are<br />
planning to include data from EFL learners of other L1<br />
backgrounds to be able to draw cross-linguistic and<br />
typological comparisons as to potential L1 influence.<br />
The text classification we have developed for CALE is<br />
comparable with the NS control corpora mentioned<br />
above, but we have created clear(er) textual profiles,<br />
adopting the situational characteristics and linguistic<br />
features identified for academic prose by Biber and<br />
Conrad (2009). A text's communicative purpose or goal<br />
serves as the main classifying principle, which helps to<br />
set apart the seven genres in terms of<br />
a) text's general purpose<br />
b) its specific purpose(s)<br />
c) the skills the author demonstrates, and<br />
d) the author's stance.<br />
In addition, we list the major features of each text type<br />
as to<br />
a) structural features<br />
b) length, and<br />
c) functional features.<br />
3. Corpus annotation<br />
Students submit their texts in electronic form (typically<br />
in .doc, .docx or .pdf file format). Thus, some manual<br />
pre-processing of these incoming files is necessary.<br />
Extensive "non-linguistic" information (such as table of<br />
contents, list of references, tables and figures, etc.) is<br />
deleted and substituted by placeholder tags around their<br />
headings or captions. The body of the text is then<br />
annotated for meta-textual, i.e. underlying structural<br />
features (section titles, paragraphs, quotations, examples,<br />
etc.) with the help of annotation tools. The texts are also<br />
annotated (in a file header) for metadata, i.e. learner<br />
variables such as L1, age, gender, etc. which are collected<br />
through a written questionnaire. The file header also<br />
includes metadata that pertain to each individual text<br />
such as genre, type of course and discipline the text was<br />
written in, the setting in which the text was produced etc.<br />
This information is also collected with the help of a<br />
questionnaire that accompanies each text submitted to the<br />
corpus. In the future, we also intend to implement further<br />
linguistic levels of annotation, e.g. for rhetorical function<br />
or sentence type.<br />
4. Research program<br />
In the following sections, we outline our research<br />
program. We adopt a variationist perspective on SLA,<br />
combining a learner corpus approach with research on<br />
interlanguage variation and near-native competence.<br />
4.1. The study of variation in SLA research<br />
Interlanguages (ILs) as varieties in their own right are<br />
characterized by variability even more than native<br />
languages. Research on IL-variation since the late 1970s<br />
has typically focused on beginning and intermediate<br />
learners and on variational patterns in pronunciation and<br />
morphosyntax, i.e. the (un-)successful learning of<br />
actually invariant linguistic forms and the occurrence of<br />
alternations between native and non-native equivalent<br />
forms. Such studies revealed developmental patterns,<br />
interpreted as indicators of learners' stages of acquisition,<br />
and produced evidence that IL-variation co-varies with<br />
linguistic, social/situational and psycholinguistic context,<br />
and is also subject to a variety of other factors like<br />
individual learner characteristics and biographical<br />
variables (e.g. form and length of exposure to the L2).<br />
Since the early 2000s there has been an increasing<br />
interest in issues of sociolinguistic and sociopragmatic<br />
variation in advanced L2 learners (frequently referred to<br />
as sociolinguistic competence), e.g. learners' use of<br />
dialectal forms or pragmatic markers (mostly in L2<br />
French, see e.g. Mougeon & Dewaele, 2004; Regan,<br />
Howard & Lemée, 2009). This has marked both a shift
Multilingual Resources and Multilingual Applications - Regular Papers<br />
from the study of beginning and intermediate to advanced<br />
learners, and a shift from the study of norm-violations to<br />
the investigation of differential knowledge as evidence of<br />
conscious awareness of (socio-)linguistic variation.<br />
4.2. Advanced Learner Varieties (ALVs)<br />
There is evidence that advanced learners of various<br />
language backgrounds have similar problems and face<br />
similar challenges on their way to near-native proficiency.<br />
In view of these assumed similarities, some of which will<br />
be discussed in the following, we conceive of the<br />
interlanguage of these learners as Advanced Learner<br />
Varieties (ALVs).<br />
In a recent overview of the field, Granger (2008:269)<br />
defines advanced (written) interlanguage as "the result of<br />
a highly complex interplay of factors: developmental,<br />
teaching-induced and transfer-related, some shared by<br />
several learner populations, others more specific".<br />
According to her, typical features of ALVs are overuse of<br />
high frequency vocabulary and a limited number of<br />
prefabs, a much higher degree of personal involvement,<br />
as well as stylistic deficiencies, "often characterized by<br />
an overly spoken style or a somewhat puzzling mixture of<br />
formal and informal markers".<br />
Moreover, advanced learners typically struggle with the<br />
acquisition of optional and/or highly L2-specific<br />
linguistic phenomena, often located at interfaces of<br />
linguistic subfields (e.g. syntax-semantics, syntaxpragmatics,<br />
see e.g. DeKeyser, 2005:7ff). As to academic<br />
writing, many of their observed difficulties are caused by<br />
a lack of understanding of the conventions of academic<br />
writing, or a lack of practice, but are not necessarily a<br />
result of interference from L1 academic conventions<br />
(McCrostie, 2008:112).<br />
4.3. Patterns and determinants of variation in L2<br />
academic writing<br />
Our research program involves the study of L2 learners’<br />
acquisition of the influence of several factors on<br />
constituent order and the choice of constructional<br />
variants (e.g. genitive and dative alternation,<br />
verb-particle placement, focus constructions). One<br />
reason for this is that such variation is often located at the<br />
interfaces of linguistic subsystems, an area where<br />
advanced learners still face difficulties. Moreover,<br />
grammatical variation in L2 has not been well researched<br />
to date and is only beginning to attract researchers'<br />
attention (Callies, 2008, 2009; Callies & Szczesniak,<br />
2008).<br />
There are a number of semantic, structural,<br />
discourse-motivated and processing-related determinants<br />
that influence lexico-grammatical variation whose<br />
interplay and influence on speakers' and writers'<br />
constructional choices has been widely studied in<br />
corpus-based research on L1 English. Generally speaking,<br />
in L2 English these determinants play together with<br />
several IL-specific ones such as mother tongue (L1) and<br />
proficiency level, and in (academic) writing, some<br />
further task-specific factors like imagined audience (the<br />
people to whom the text is addressed), setting, and genre<br />
add to this complex interplay of factors, see Figure 2.<br />
Figure 2: Determinants of variation in L1 and L2<br />
academic writing<br />
It is important to note at this point that differences<br />
between texts produced by L1 and L2 writers that are<br />
often attributed to the influence of the learners' L1 may in<br />
fact turn out to result from differences in task-setting<br />
(prompt, timing, access to reference works, see Ädel,<br />
2008), and possibly task-instruction and imagined<br />
audience (see Ädel, 2006:201ff for a discussion of corpus<br />
comparability). Similarly, research findings as to<br />
learners' use of features that are more typical of speech<br />
than of academic prose have been interpreted as<br />
unawareness of register differences, but there is some<br />
evidence that the occurrence of such forms may also be<br />
caused by the influence of factors like the development of<br />
writing proficiency over time (novice writers vs. experts,<br />
see Gilquin & Paquot, 2008; Wulff & Römer, 2009),<br />
task-setting and -instruction, imagined audience and<br />
register/genre (e.g. academic vs. argumentative writing,<br />
see Zaytseva, <strong>2011</strong>).<br />
53
Multilingual Resources and Multilingual Applications - Regular Papers<br />
4.4. Case study<br />
In this section, we provide an example of how<br />
lexico-grammatical variation plays out in L2 academic<br />
writing. In a CALE pilot study of the (non-)<br />
representation of authorship in research papers written by<br />
advanced German EFL learners, Callies (2010) examined<br />
agentivity as a determinant of lexico-grammatical<br />
variation in academic prose. He hypothesized that even<br />
advanced students were insecure about the representation<br />
of authorship due to a mixture of several reasons:<br />
conflicting advice by teachers, textbooks and style guides,<br />
the diverse conventions of different academic disciplines,<br />
students' relative unfamiliarity with academic text types<br />
and lack of linguistic resources to report events and<br />
findings without mentioning an agent. Interestingly, the<br />
study found both an overrepresentation of the first person<br />
pronouns I and we, but also an overrepresentation of the<br />
highly impersonal subject-placeholders it and there<br />
(often used in the passive voice) as default strategies to<br />
suppress the agent, see examples (1) and (2).<br />
(1) There are two things to be discussed in this section.<br />
(2) It has been shown that…<br />
While this finding seems to be contradictory, it can be<br />
explained by a third major finding, namely the significant<br />
underrepresentation of inanimate subjects which are,<br />
according to Biber and Conrad (2009:162), preferred<br />
reporting strategies in L1 academic English, exemplified<br />
in (3) and (4).<br />
(3) This paper discusses…<br />
(4) Table 5 shows that…<br />
Callies (2010) concluded that L2 writers have a narrower<br />
inventory of linguistic resources to report events and<br />
findings without an overt agent, and their insecurity and<br />
unfamiliarity with academic texts adds to the observed<br />
imbalanced clustering of first person pronouns,<br />
dummy-subjects and passives. The findings of this study<br />
also suggest that previous studies that frequently explain<br />
observed overrepresentations of informal, speech-like<br />
features by pointing to learners' higher degree of<br />
subjectivity and personal involvement (Granger, 2008) or<br />
unawareness of register differences (Gilquin & Paquot,<br />
2008), may need to be supplemented by studies taking<br />
54<br />
into account a more complex interplay of factors that also<br />
includes the limited choice of alternative strategies<br />
available to L2 writers.<br />
5. Implications for language teaching<br />
and assessment<br />
The project we have outlined in this paper has some<br />
major implications for EFL teaching and assessment. The<br />
research findings will be used to provide<br />
recommendations for EFL teachers and learners by<br />
developing materials for teaching units in practical<br />
language courses on academic writing and English for<br />
Academic Purposes. In the long run, we plan to create a<br />
web-based reference tool that will help students look up<br />
typical collocations and recurring phrases used to express<br />
rhetorical moves/functions in academic writing (e.g.<br />
giving examples, expressing contrast, drawing<br />
conclusions etc.). This application will be geared towards<br />
students' needs and can be used as a self-study reference<br />
tool at all stages of writing an academic text. Users will<br />
be able to access information in two ways:<br />
1) form-to-function, i.e. looking up words and phrases in<br />
an alphabetical index to see how they can express<br />
rhetorical functions, and 2) function-to-form, i.e.<br />
accessing a list of rhetorical functions to find words and<br />
phrases that are typically used to encode them.<br />
Most importantly, the tool will present in a comparative<br />
manner structures that emerged as problematic in<br />
advanced learners' writing, for example untypical lexical<br />
co-occurrence patterns and over- or underrepresented<br />
words and phrases, side by side with those structures that<br />
typically occur in expert academic writing. This will<br />
include information on the immediate and wider context<br />
of use of single items and multi-word-units.<br />
While the outcome is thus particularly relevant for future<br />
teachers of English, it may also be useful for students and<br />
academics in other disciplines who have to write and<br />
publish in English. Unlike in the Anglo-American<br />
education system, German secondary schools and<br />
universities do not usually provide courses in academic<br />
writing in the students' mother tongue, so that first-year<br />
students have basically no training in academic writing<br />
at all.<br />
It has been mentioned earlier that the operationalization<br />
of a quantitatively and qualitatively well-founded<br />
description of advanced proficiency in terms of criteria
Multilingual Resources and Multilingual Applications - Regular Papers<br />
for the assessment of advancedness is still lacking. Thus,<br />
a major aim of the project is to develop a set of linguistic<br />
descriptors for the assessment of advanced proficiency.<br />
The descriptors and can-do-statements of the Common<br />
European Framework of Reference (CEFR) often appear<br />
too global and general to be of practical value for<br />
language assessment in general, and for describing<br />
advanced learners' competence as to academic writing in<br />
particular. Ortega and Byrnes (2008) discuss four ways in<br />
which advancedness has commonly been operationalised,<br />
ultimately favouring what they call "sophisticated<br />
language use in context", a construct that includes e.g. the<br />
choice among registers, repertoires and voice. This<br />
concept can serve as a basis for the development of<br />
linguistic descriptors that are characteristic of academic<br />
prose, e.g. the use of syntactic structures like inanimate<br />
subjects, phrases to express rhetorical functions (e.g. by<br />
contrast, to conclude, in fact), reporting verbs (discuss,<br />
claim, suggest, argue, propose etc.), and lexical<br />
co-occurrence patterns (e.g. conduct, carry out and<br />
undertake as typical verbal collocates of experiment,<br />
analysis and research).<br />
6. References<br />
Ädel, A. (2006): Metadiscourse in L1 and L2 English.<br />
Amsterdam: Benjamins.<br />
Ädel, A. (2008): Involvement features in writing: do time<br />
and interaction trump register awareness? In G.<br />
Gilquin, S. Papp, & M.B. Diez-Bedmar (Eds.),<br />
Linking up Contrastive and Learner Corpus Research.<br />
Amsterdam: Rodopi, pp. 35-53.<br />
Alsop, S., Nesi, H. (2009): Issues in the development of<br />
the British Academic Written English (BAWE) corpus.<br />
Corpora, 4(1), pp. 71-83.<br />
Biber, D., S. Conrad (2009): Register, Genre, and Style.<br />
Cambridge: Cambridge University Press.<br />
Callies, M. (2008): Easy to understand but difficult to use?<br />
Raising constructions and information packaging in<br />
the advanced learner variety. In G. Gilquin, S. Papp &<br />
M.B. Diez-Bedmar (Eds.), Linking up Contrastive and<br />
Learner Corpus Research. Amsterdam: Rodopi,<br />
pp. 201-226.<br />
Callies, M. (2009): Information Highlighting in<br />
Advanced Learner English. Amsterdam: Benjamins.<br />
Callies, M. (2010): The (non-)representation of<br />
authorship in L2 academic writing. Paper presented at<br />
ICAME 31 "Corpus Linguistics and Variation in<br />
English", 26-30 May 2010, Giessen/Germany.<br />
Callies, M., Szczesniak, K. (2008): Argument realization,<br />
information status and syntactic weight - A<br />
learner-corpus study of the dative alternation. In P.<br />
Grommes & M. Walter (Eds.), Fortgeschrittene<br />
Lernervarietäten. Korpuslinguistik und<br />
Zweitspracherwerbsforschung. Tübingen: Niemeyer,<br />
pp. 165-187.<br />
DeKeyser, R. (2005): What makes learning second<br />
language grammar difficult? A review of issues.<br />
Language Learning, 55(s1), pp. 1-25.<br />
Gilquin, G., Paquot, M. (2008): Too chatty: Learner<br />
academic writing and register variation. English Text<br />
Construction, 1(1), pp. 41-61.<br />
Granger, S. (2008): Learner corpora. In A. Lüdeling & M.<br />
Kytö (Eds.), Corpus Linguistics. An international<br />
handbook, Vol. 1. Berlin & New York: Mouton de<br />
Gruyter, pp. 259-275.<br />
Granger, S., Paquot, M. (forthcoming): Language for<br />
Specific Purposes learner corpora. In T.A. Upton & U.<br />
Connor (Eds.), Language for Specific Purposes. The<br />
Encyclopedia of Applied Linguistics. New York:<br />
Blackwell.<br />
Granger, S., Dagneaux, E., Meunier, F., Paquot, M.<br />
(2009): The International Corpus of Learner English.<br />
Version 2. Handbook and CD-ROM.<br />
Louvain-la-Neuve: Presses Universitaires de Louvain.<br />
McCrostie, J. (2008): Writer visibility in EFL learner<br />
academic writing: A corpus-based study. ICAME<br />
Journal, 32, pp. 97-114.<br />
Mougeon, R., Dewaele, J.-M. (2004): Patterns of<br />
variation in the interlanguage of advanced second<br />
language learners. Special issue of International<br />
Review of Applied Linguistics in Language Teaching<br />
(IRAL), 42(4).<br />
Ortega, L., Byrnes, H. (2008): The longitudinal study of<br />
advanced L2 capacities: An introduction. In L. Ortega<br />
& H. Byrnes (Eds.), The Longitudinal Study of<br />
Advanced L2 Capacities. New York: Routledge/Taylor<br />
& Francis, pp. 3-20.<br />
Regan, V., Howard, M., Lemée, I. (2009): The<br />
Acquisition of Sociolinguistic Competence in a Study<br />
Abroad Context. Clevedon: Multilingual Matters.<br />
Römer, U., Brook O’Donnell, M. (forthcoming): From<br />
student hard drive to web corpus: The design,<br />
55
Multilingual Resources and Multilingual Applications - Regular Papers<br />
56<br />
compilation, annotation and online distribution of<br />
MICUSP. Corpora.<br />
Wulff, S., Römer, U. (2009): Becoming a proficient<br />
academic writer: Shifting lexical preferences in the use<br />
of the progressive. Corpora, 4(2), pp. 115-133.<br />
Zaytseva, E. (<strong>2011</strong>): Register, genre, rhetorical functions:<br />
Variation in English native-speaker and learner writing.<br />
Hamburg Working Paper in Multilingualism.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
From Multilingual Web-Archives to Parallel Treebanks in Five Minutes<br />
Markus Killer, Rico Sennrich, Martin Volk<br />
University of Zurich<br />
Institute of Computational Linguistics, Binzmühlestrasse 14, CH-8050 Zurich, Switzerland<br />
E-mail: markus.killer@uzh.ch, sennrich@cl.uzh.ch, volk@cl.uzh.ch<br />
Abstract<br />
The Tree-to-Tree (t2t) Alignment Pipe is a collection of Python scripts, generating automatically aligned parallel treebanks from<br />
multilingual web resources or existing parallel corpora. The pipe contains wrappers for a number of freely available NLP software<br />
programs. Once these third party programs have been installed and the system and corpus specific details have been updated, the<br />
pipe is designed to generate a parallel treebank with a single program call from a unix command line. We discuss alignment quality<br />
on a fully automatically processed parallel corpus.<br />
Keywords: parallel treebank, automatic tree-to-tree alignment, TreeAligner, Text-und-Berg<br />
1. Introduction<br />
The process of creating parallel treebanks used to be a<br />
tedious task, involving a tremendous amount of manual<br />
annotation (see e.g. Samuelsson & Volk, 2007). Zhechev<br />
and Way (2008:1) state that ”[b]ecause of this, only a<br />
few parallel treebanks exist and none are of sufficient<br />
size for productive use in any statistical MT<br />
application”. Since Zhechev (2009) introduced the Sub-<br />
Tree Aligner, a program for the automatic generation of<br />
parallel treebanks, the feasibility of obtaining large scale<br />
annotated parallel treebanks has increased. However, the<br />
amount of preprocessing needed as well as the missing<br />
conversion of the output into a more human readable<br />
format might have kept potential users of the Sub-Tree<br />
Aligner at a distance. The collection of Python scripts<br />
combined in the Tree-to-Tree Alignment Pipe (t2t-pipe)<br />
described below takes care of all necessary pre- and<br />
postprocessing of Zhechev’s Sub-Tree Aligner,<br />
supporting German, French and English as source and<br />
target languages. The focus of this paper is on the<br />
following two questions, both aimed at maximizing the<br />
quality of the automatic alignments:<br />
� How big does the parallel corpus have to be in order<br />
to get satisfactory results?<br />
� What can be said about the role of the text<br />
domain/topic of the parallel corpus?<br />
2. Related Work<br />
Zhechev (2009) and Koehn (2009) provide an overview<br />
of recent developments in tree-to-tree alignment, subtree<br />
alignment and the subsequent generation of parallel<br />
treebanks for use in statistical machine translation<br />
systems.<br />
Tiedemann and Kotzé (2009) and Tiedemann (2010)<br />
propose a supervised approach to tree-to-tree alignment,<br />
requiring a small manually aligned or manually<br />
corrected treebank of at least 100 sentence pairs1 for<br />
training purposes.<br />
In terms of script design, the training-script for the<br />
Moses SMT system (Koehn, 2010b) inspired the<br />
organization of the t2t-pipe into several steps that can be<br />
run independently.<br />
3. Parallel Corpora<br />
In an ideal world, one could be inclined to take a<br />
number of parallel articles from a bilingual text<br />
collection and let the t2t-pipe combined with the Sub-<br />
Tree Aligner do the rest. Yet this is only possible if a<br />
suitable word alignment model2 is available, as we will<br />
show in section 5 .<br />
1 See http://stp.lingfil.uu.se/~joerg/Lingua/index.html<br />
(accessed: 21/08/11)<br />
2 All word alignment models used in this paper can be<br />
downloaded from: http://t2t-pipe.svn.sourceforge.net/<br />
(accessed: 21/08/11)<br />
57
Multilingual Resources and Multilingual Applications - Regular Papers<br />
With the aim of collecting information on the role of<br />
corpus size and text domain/topic in creating an<br />
automatically aligned parallel treebank, the following<br />
corpora were used:<br />
3.1. Corpus for Tree-to-Tree Alignment<br />
A subcorpus of the Text+Berg corpus (Volk et al., 2010)<br />
consisting of four parallel articles from the Swiss Alpine<br />
Club Yearbook 1977 served as test corpus (see [TUB-4-<br />
ART] in table 1). Details on the corpus with regard to<br />
the extraction of parallel articles and sentence pairs are<br />
described in Sennrich and Volk (2010). For the purpose<br />
of this paper it is sufficient to note that the vast majority<br />
of texts can be attributed to the journalistic textual<br />
domains article/report/review with a strong topical focus<br />
on activities performed by members of the Swiss Alpine<br />
Club (climbing, hiking, trekking) and the alpine<br />
environment in general. As the corpus has been digitised<br />
from printed books it contains OCR errors.<br />
Corpus Lang. Tokens Sentence Pairs<br />
[TUB-<br />
4-ART]<br />
DE<br />
FR<br />
21,689<br />
25,388<br />
1,171<br />
(GIZA++: 1,023)<br />
58<br />
[TUB] DE 1,617,301 92,518<br />
FR 1,921,583 (GIZA++: 80,698)<br />
[EPARL] DE 35,371,164 1,562,563<br />
FR 42,427,755 (GIZA++: 1,190,609)<br />
Table 1: Parallel Corpora<br />
[TUB-4-ART] Text+Berg Corpus 4 Articles SAC YB 1977<br />
[TUB] Text+Berg Corpus SAC Yearbooks 1957-1982<br />
[EPARL] Europarl Corpus 19<strong>96</strong>-2009<br />
3.2. Corpora for Word Alignment<br />
Additionally, we used the complete Text+Berg corpus<br />
[TUB] , the Europarl corpus (Koehn, 2010a) [EPARL]<br />
and combinations of these two corpora to compute<br />
different word alignment models (see table 1 for basic<br />
corpus information). Word alignment is automatically<br />
computed through GIZA++ (Och & Ney, 2003), which<br />
implements the IBM word alignment models. For<br />
performance reasons, we set the maximum sentence<br />
length to 40 tokens3 . Therefore, we used only 83% of<br />
3 See http://www.statmt.org/wmt11/baseline.html<br />
(accessed: 21/08/11)<br />
the of the [TUB] corpus and 76% of the [EPARL]<br />
corpus to estimate word alignment probabilities (see<br />
table 1 for absolute values in brackets).<br />
We used [EPARL] to test the impact of corpus size on<br />
the results. Moreover, texts from the [EPARL] corpus<br />
belong to a completely different textual domain<br />
(parliament proceedings) and cover a wide range of<br />
political, economic and cultural topics (see Koehn,<br />
2009:53), making it possible to use the data to figure out<br />
the role of text domain/topic in the alignment process.<br />
4. The t2t-pipe<br />
Taking an existing parallel corpus4 as input, the t2t-pipe<br />
runs through seven steps to generate automatic<br />
alignments for individual words and syntactic<br />
constituents in each parallel sentence pair. The<br />
configuration file is deliberately designed in a way that a<br />
number of different third party programs can be chosen<br />
for most of the steps, enabling easy switching between<br />
different configurations. In the brief outline of the<br />
following steps, the configuration that worked best is<br />
indicated (please refer to the t2t-pipe README file5 for<br />
details on all 12 programs used):<br />
4.1. Steps 1-5 – Preprocessing<br />
1) Extraction of Parallel Articles<br />
2) Tokenization<br />
(Python NLTK Punkt-Tokenizer)<br />
Rudimentary OCR cleaning/<br />
Fixing of word division errors<br />
3) Sentence Alignment<br />
(Hunalign with dict.cc dictionary)<br />
4) Statistical Phrase Structure Parsing<br />
(Stanford Parser for German,<br />
Berkeley Parser for French)<br />
5) Word Alignment<br />
(GIZA++ through Moses training script,<br />
enhanced with dict.cc dictionary,<br />
see section 4.2 for an example),<br />
data not lower-cased<br />
4 If no parallel corpus is available, the pipe includes scripts for<br />
the on-the-fly construction of a parallel corpus from the web<br />
archives of the bilingual Swiss Alpine Club magazine<br />
(German-French).<br />
5 Available from: http://t2t-pipe.svn.sourceforge.net/<br />
(accessed: 21/08/11)
Multilingual Resources and Multilingual Applications - Regular Papers<br />
4.2. Step 6 - Tree-to-Tree Alignment<br />
This is the most important step in a complete run of the<br />
t2t-pipe, as the automatic alignments are generated by<br />
Zechev's Sub-Tree Aligner. The process can best be<br />
described by looking at a parallel sentence pair, taken<br />
from [TUB-4-ART]:<br />
1) German sentence: Man versuche einmal einen<br />
solchen Mann abzubremsen.<br />
2) French sentence: Essayez donc de freiner un tel<br />
homme. 6<br />
� Input:<br />
a. Bracketed parse trees of source and target language<br />
(output of the two parsers combined into one file):<br />
(ROOT (NUR (S (PIS Man) (VVFIN versuche) (ADV<br />
einmal) (VP (NP (ART einen) (PIDAT solchen) (NN<br />
Mann)) (VVIZU abzubremsen))) ($. !))) \n<br />
(ROOT (SENT (VN (V Essayez)) (ADV donc) (VPinf (P<br />
de) (VN (V freiner)) (NP (D un) (A tel) (N<br />
homme))) (. !)))\n\n\n<br />
b. Two lexical translation files generated by the Moses<br />
training script and GIZA++, enhanced using a<br />
dict.cc dictionary:<br />
lex.e2f (French – German – Probability)<br />
Homme Mann 1.0000000<br />
homme Mann 1.0000000<br />
mari Mann 1.0000000<br />
ralentir abzubremsen 0.0666667<br />
freiner abzubremsen 0.0666667<br />
lex.f2e (German – French – Probability)<br />
abzubremsen ralentir 0.0053476<br />
abzubremsen freiner 0.0035842<br />
Mann Homme 1.0000000<br />
Mann homme 1.0000000<br />
Mann mari 1.0000000<br />
� Output:<br />
Indexed bracketed parse trees of source and target<br />
language with alignment indices on a separate line<br />
(see Figure 1 for graphical alignments). In our<br />
example sentence, the Sub-Tree Aligner produced<br />
one wrong alignment, linking the German personal<br />
pronoun man to the French finite verb essayez<br />
(emphasised below):<br />
6 Sentences 1) and 2) translate roughly as: [(Why don't) you try<br />
to slow down a man like that (a heavy man)!]<br />
(ROOT::NUR-2 (S-3 (PIS-4 Man)(VVFIN-5 versuche)<br />
(ADV-6 einmal)(VP-7 (NP-8 (ART-9 einen)(PIDAT-10<br />
solchen)(NN-11 Mann))(VVIZU-12 abzubremsen)))($.-<br />
13 !)) \n<br />
(ROOT::SENT-2 (VN::V-4 Essayez)(ADV-5 donc)<br />
(VPinf-6 (P-7 de)(VN::V-9 freiner)(NP-10 (D-11<br />
un)(A-12 tel)(N-13 homme)))(.-14 !)) \n<br />
2 2 4 4 6 5 7 6 8 10 9 11 10 12 11 13 12 9 13 14<br />
4.3. Step 7 - Conversion to TigerXML/TMX<br />
We converted the output of Zhechev’s Sub-Tree Aligner<br />
into two language specific TigerXML files and an<br />
additional XML file containing information on node<br />
alignments. These files can be easily imported into the<br />
graphical interface of the Stockholm TreeAligner<br />
(Lundborg et al., 2007). Figure 1 shows the previously<br />
introduced sentence pair – including the automatically<br />
computed links – in the treebank browser perspective of<br />
the Stockholm TreeAligner.<br />
Figure 1: Automatically aligned sentence pair in<br />
Stockholm TreeAligner<br />
The second supported output format is TMX, a format<br />
for current translation memory systems (tested with<br />
OmegaT7 ).<br />
5. Treebank Alignment Quality<br />
We ran six experiments (summarized in table 2) on the<br />
test corpus [TUB-4-ART] (see table 1). In each<br />
experiment, the corpus used to compute the lexical<br />
translation probabilities with GIZA++ either differed<br />
7 Available from: http://www.omegat.org (accessed: 21/08/11)<br />
59
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Corpus 1 [TUB-4-<br />
60<br />
ART]<br />
2 [TUB-4-<br />
ART]<br />
3 [EPARL] 4 [TUB] 5 [TUB-<br />
EPARL]<br />
6 [TUB-<br />
EPARL]<br />
Corpus Size GIZA++ 1,023 SP 1,023 SP 1,190,609 SP 80,698 SP 258,971 SP 1,271,307 SP<br />
In-domain (%) 100.0% 100.0% 0.0% 100.0% 31.0% 6.0%<br />
Dict.cc SA/WA NO YES YES YES YES YES<br />
Precision WA 57.8% 61.1% 51.3% 65.9% 69.1% 69.2%<br />
Precision PhA 58.3% 65.4% 51.8% 81.7% 79.5% 80.4%<br />
Precision allA 57.9% 62.1% 51.4% 69.2% 71.3% 71.7%<br />
Correct links per SP 8.66 9.63 9.02 12.48 13.64 13.98<br />
Table 2: Alignment precision and average number of correct links in treebank of [TUB-4-ART] corpus (1,171<br />
sentence pairs) with respect to size, enhancement through additional lexical resources and textual domain of the<br />
corpus used to compute the lexical translation probabilities.<br />
Precision = Correct Alignments / Suggested Alignments, SP: Sentence Pair(s) SA: Sentence Alignment, WA: Word Alignment,<br />
PhA: Phrase Alignment, allA: Word & Phrase Alignments, In-domain: domain correspondence of treebank and WA corpus<br />
with respect to corpus size and textual domain or<br />
enhancement by external lexical resources (dict.cc<br />
dictionary).<br />
We manually checked an average of 545 alignments<br />
(77% word alignments 23% phrase alignments) in 32<br />
randomly selected sentence pairs8 for each of the six<br />
resulting treebanks, using the Stockholm TreeAligner.<br />
Our information on changes in recall is based on the<br />
absolute number of correct links in the manually<br />
checked sentence pairs (average no. of correct links =<br />
average no. of all links9 x precision10 ).<br />
5.1. Corpus Size<br />
Looking at the configuration outlined in section 4, three<br />
of the seven steps in the t2t-pipe directly depend on the<br />
corpus size (Tokenization (Dehyphenation), Sentence<br />
Alignment and Word Alignment). The analysis of the<br />
alignment quality in the resulting parallel treebank<br />
shows that roughly 1000 sentence pairs are not enough<br />
to get satisfactory results with an overall precision of<br />
57.9% (see table 2, experiment 1). Initial tests have<br />
shown that Zhechev’s Sub-Tree Aligner is highly<br />
8 This number proved to be sufficient to include at least 100<br />
Phrase Alignments in the sample. The identity of the treebank<br />
was masked for the manual evaluation.<br />
9 computed by Sub-Tree Aligner for the whole treebank<br />
10 computed from manually checked sentence pairs<br />
dependent on the quality of the word alignments<br />
supplied. Even though the algorithm does not directly<br />
replicate the GIZA++ alignments:<br />
[M]y system uses a probabilistic bilingual<br />
dictionary derived from the GIZA++ word<br />
alignments, thus being able to side-step errors<br />
present in the original word-alignment data and<br />
to find new possible alignments that GIZA++<br />
had skipped for the particular sentence pair.<br />
(Zhechev, 2009:73)<br />
We employed two measures to increase the precision of<br />
the alignments:<br />
1) We enhanced the lexical translation probabilities<br />
computed by GIZA++ by extracting all 1-to-1 word<br />
translations from the freely available dict.cc<br />
dictionary (DE-FR), leading to a substantial<br />
increase in precision (+ 4.2%) and in recall (+ 0.97<br />
correct links per sentence pair).<br />
2) Step-by-step, we increased the corpus size, making<br />
use of all available resources. In experiment 3 it<br />
becomes clear that a huge increase of corpus size<br />
alone is no guarantee for better alignment results:<br />
When we use the 1,190,609 sentence pair [EPARL]<br />
corpus on its own, the recall drops by 0.61 correct
Multilingual Resources and Multilingual Applications - Regular Papers<br />
links per sentence pair and the precision by 10.7%<br />
compared to experiment 2. However, increasing the<br />
size of the [TUB] corpus from 1,023 to 80,698<br />
sentence pairs as a basis for the word alignment<br />
model leads to the biggest leap in the experiment<br />
sequence in both precision (+ 7.1%) and recall<br />
(+2.85 correct links per sentence pair) compared to<br />
experiment 2.<br />
5.2. Domain/Topic Specific Content<br />
The data collected in table 2 suggests that when using<br />
the unsupervised approach proposed by Zhechev (2009)<br />
the domain of the corpus used to compute the lexical<br />
translation probabilities seems to be of great importance.<br />
In experiment 3, we observe the poorest precision of all<br />
experiments with the second biggest corpus [EPARL].<br />
Apart from a few common lexical items (e.g. mountain,<br />
valley, river, ...) there is hardly any overlap in terms of<br />
textual domain/topic (see section 3) and the [TUB- 4-<br />
ART] corpus itself was not used to compute lexical<br />
probabilities in experiment 3 (hence the 0%<br />
correspondence between the two corpora).<br />
Comparing these results to the supervised approach by<br />
Tiedemann and Kotzé (2009), there seems to be an<br />
important difference, as they observe ”only a slight drop<br />
in performance when training on a different textual<br />
domain” (204) . The main reason for this might be that<br />
in the supervised approach the program trains phrase<br />
alignments from manually aligned training data<br />
(relatively domain/topic independent), whereas in the<br />
unsupervised approach the parallel corpus is used to<br />
compute lexical translation probabilities (heavily<br />
dependent on domain/topic).<br />
5.3. The Right Balance of Corpus Size and<br />
Domain/Topic Specific Content<br />
Bearing this difference of the two approaches in mind, it<br />
is not surprising that balancing (in terms of textual<br />
domain/topic - experiment 5) or expanding (maximising<br />
corpus size - experiment 6) the word alignment model<br />
affects the results in a different way:<br />
When using a better model for estimating<br />
lexical probabilities (more data:<br />
Europarl+SMULTRON) the performance<br />
improves only slightly to about 58.64% [F-<br />
Score compared to 57.57%]<br />
(Tiedemann & Kotzé, 2009:204)<br />
In the unsupervised approach (used in the t2t-pipe)<br />
however, the use of a better word alignment model<br />
[TUB-EPARL] increases the recall by another 1.16 and<br />
1.50 correct links per sentence pair, respectively<br />
(experiments 5/6), compared to the largest corpus with a<br />
100% domain correspondence (experiment 4). For<br />
phrase alignments, we achieved a precision of roughly<br />
80% from a corpus size of approx. 80,000 sentence pairs<br />
of the same domain (experiments 4-6). The maximum<br />
precision of word alignments in this set-up (data not<br />
being lower-cased) seems to be around 70% from a<br />
corpus size of about 250,000 sentence pairs, while the<br />
recall can still be slightly increased by supplying more<br />
and more data to estimate lexical probabilities. As long<br />
as there is a solid basis of several 10,000 sentence pairs<br />
belonging to the same textual domain as the parallel<br />
corpus to be aligned, expanding the corpus used to<br />
compute lexical probabilities with material of another<br />
textual domain does not seem to harm the results but can<br />
still help to increase overall precision and recall by a<br />
small margin.<br />
6. Conclusion and Outlook<br />
We designed the t2t-pipe considering the following areas<br />
of application:<br />
1) Assisting human annotators of a parallel treebank<br />
by supplying good alignment suggestions: The<br />
results discussed in section 5 have shown that this<br />
can be achieved by employing a large enough<br />
parallel corpus of approx. 250,000 sentence pairs<br />
with data of the same textual domain. If the corpus<br />
is not big enough, the results can be improved by<br />
adding language material of a completely different<br />
textual domain. We achieved an overall precision of<br />
71.7% (approx. 80% for phrase alignments). Using<br />
a corpus of 500-1,000 sentence pairs (a common<br />
size for human annotated parallel treebanks) or a<br />
word alignment model trained solely on a different<br />
textual domain does not lead to reasonable<br />
automatic alignments. However, if there already is a<br />
suitable word alignment model for a specific text<br />
61
Multilingual Resources and Multilingual Applications - Regular Papers<br />
domain/topic, the generation of a brand new<br />
treebank is just five minutes away.<br />
2) Visualisation/manual evaluation of the results of<br />
different components of a tree-based SMT system<br />
(e.g. Parsing, Word/Phrase Alignment): The data<br />
collected and analysed in section 5 is one possible<br />
application of the t2t-pipe in this category.<br />
3) As a by-product, the t2t-pipe produces phrase<br />
alignments for translation memory systems: With a<br />
corpus of approx. 80,000 sentence pairs, the<br />
precision of the alignments is around 80%. These<br />
alignments can be manually checked and a new<br />
TMX file can be easily generated from the corrected<br />
alignment data.<br />
In future versions of the program, the two approaches<br />
presented by Zhechev (2009) and Tiedemann and Kotzé<br />
(2009) could be combined. We see additional potential<br />
for improvement in using lower-cased data and a corpus<br />
free of OCR errors for word and subtree alignment.<br />
7. References<br />
Koehn, P. (2009): Statistical Machine Translation.<br />
Cambridge: Cambridge University Press.<br />
Koehn, P. (2010a): European Parliament Proceedings<br />
Parallel Corpus 19<strong>96</strong>-2009. Release v5. TXT- Format.<br />
Descrition in: Europarl: A Parallel Corpus for<br />
Statistical Machine Translation, Philipp Koehn, MT<br />
Summit 2005. URL: http://www.statmt.org/europarl.<br />
Koehn, P. (2010b): MOSES. Statistical Machine<br />
Translation System. User Manual and Code Guide,<br />
November. URL:<br />
http://www.statmt.org/moses/manual/manual.pdf.<br />
Lundborg J., Marek T., Mettler M., Volk, M. (2007):<br />
Using the Stockholm TreeAligner. In Proceedings of<br />
the Sixth International Workshop on Treebanks and<br />
Linguistic Theories (TLT’06). Bergen, Norway:<br />
Northern European Association for Language<br />
Technology, pp. 73–78.<br />
Och, F. J., Ney, H. (2003): A Systematic Comparison of<br />
Various Statistical Alignment Models. Computational<br />
Linguistics 29, pp. 19–51.<br />
Samuelsson, Y. , Volk, M. (2007): Alignment Tools for<br />
Parallel Treebanks. In Proceedings of the GLDV<br />
Frühjahrstagung, Tübingen, Germany.<br />
62<br />
Sennrich R., Volk, M. (2010): MT-based Sentence<br />
Alignment for OCR-generated Parallel Texts. In<br />
Proceedings of the Ninth Conference of the<br />
Association for Machine Translation in the Americas<br />
(AMTA 2010).<br />
Tiedemann J., Kotzé, G. (2009): Building a Large<br />
Machine-Aligned Parallel Treebank. In Proceedings<br />
of the Eighth International Workshop on Treebanks<br />
and Linguistic Theories (TLT’08). Milano, Italy:<br />
EDUCatt: pp. 197–208.<br />
Tiedemann J. (2010): Lingua-Align: An Experimental<br />
Toolbox for Automatic Tree-to-Tree Alignment. In<br />
Proceedings of the 7th International Conference on<br />
Language Resources and Evaluation (LREC’2010),<br />
Valetta, Malta.<br />
Volk, M., Bubenhofer, N., Althaus A., Bangerter, M.,<br />
Marek T., Ruef, B. (2010): Text+Berg-Korpus (Pre-<br />
Release 118+ Digitale Edition Die Alpen 1957-1982).<br />
XML-Format, May. Digitale Edition des Jahrbuch des<br />
SAC 1864-1923 und Die Alpen 1925-1995. URL:<br />
http://www.textberg.ch.<br />
Zhechev V., Way, A. (2008): Automatic Generation of<br />
Parallel Treebanks. In Proceedings of the 22nd<br />
International Conference on Computational<br />
Linstuistics. Manchester, UK: pp. 1105–1112.<br />
Zhechev, V. (2009): Automatic Generation of Parallel<br />
Treebanks. An Efficient Unsupervised System.<br />
Dissertation, School of Computing, Dublin City<br />
University.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Querying multilevel annotation and alignment for<br />
detecting grammatical valence divergencies<br />
Oliver Čulo<br />
FTSK, <strong>Universität</strong> Mainz<br />
An der Hochschule 2, 76726 Germersheim<br />
E-mail: culo@uni-mainz.de<br />
Abstract<br />
The valence concept has been used in machine translation as well as didactics on order to build up valence dictionaries for the<br />
respective uses. Most valence dictionaries have been built up manually, but given the growing number of parallel resources, it<br />
would be desirable to automatically exploit them as basis for building up bilingual valence dictionaries. The present contribution<br />
reports on a pilot study on a German-English parallel corpus. In this study, patterns of verb plus grammatical functions were<br />
extracted from parallel sentences. The paper reports on some of the basic findings of this extraction, regarding divergencies both in<br />
valence patterns as well as syntactic realisations of the predicate, i.e. the verb. These findings set the agenda for further research,<br />
which should focus on how to detect semantic shifts of valence carriers in translation and how this affects valence.<br />
Keywords: valence, valence extraction, parallel corpora, translation<br />
1. Introduction<br />
The concept of valence (Tesnière, 1959) has been<br />
endorsed in multilingual research domains in various<br />
ways. Various machine translation systems use some<br />
notion of valence in the core of their analysis and<br />
transfer structures (see relevant descriptions e.g. for<br />
EUROTRA (Steiner, Schmidt & Zelinsky-Wibbelt,<br />
1988), METAL (Gebruers, 1988), Verbmobil (Emele et<br />
al., 2000) or TectoMT (Žabokrtský, Ptáček & Pajas<br />
2008)). For didactic purposes, various bilingual valence<br />
dictionaries have been compiled (D. Rall, Rall, &<br />
Zorrilla, 1980; Engel & Savin, 1983; Bianco, 19<strong>96</strong>;<br />
Simon-Vandenbergen, Taeldeman & Willems 19<strong>96</strong>).<br />
Most of the valence resources mentioned are based on<br />
manually compiled valence dictionaries. Nowadays, as<br />
ever more and larger parallel corpus resources are<br />
available, it is desirable to exploit these in order to gain<br />
more data for bilingual valence dictionary creation.<br />
There have been various attempts at extracting bilingual<br />
valence dictionaries from parallel corpora. In some<br />
cases, the extraction process is tackled from a high-level<br />
semantic level, as in the case of bilingual frame<br />
semantic dictionaries (Boas, 2002; 2005). Other<br />
approaches choose a syntactic annotation, as in the case<br />
of the Prague Czech-English Dependency Treebank<br />
(Čmejrek et al., 2004). In both cases, the semantic or<br />
„deep“ dependency (or tectogrammatical, see (Sgall,<br />
Hajičová & Panevová, 1986)) annotation abstracts away<br />
from syntactic variation, making the extraction task<br />
somewhat less complex. In the course of the FUSEproject<br />
(Cyrus, 2006), predicate-argument annotation<br />
and alignment between German and English sentences<br />
serves as basis for the study of both syntactic and<br />
semantic valence divergencies. Padó (2007) investigates<br />
the (frame) semantic dimension of valence divergencies.<br />
In the former case, the annotation is very specifically<br />
tailored to the project itself, making the methods harder<br />
to reproduce when applied to other corpora. In the latter<br />
study, the level of investigation again abstracts away<br />
from syntactic variation.<br />
The study presented here focusses on grammatical<br />
differences in valence pattern between German and<br />
English. Both for the detection and description of<br />
differences, top-level grammatical function like subject,<br />
direct object etc. are used. This follows the tradition of<br />
using grammatical functions rather than syntactic<br />
63
Multilingual Resources and Multilingual Applications - Regular Papers<br />
categories as e.g. in the previously listed bilingual<br />
valence dictionaries. Grammatical functions abstract<br />
away from syntactic variation but as compared to e.g.<br />
the tectogrammatical approach of (Čmejrek et al., 2004),<br />
no deep annotation is needed in order to retrieve<br />
grammatical functions of a sentence.<br />
The corpus used in the study is annotated and aligned on<br />
multiple linguistic levels, but not with a specific focus<br />
on valence. Also, the method of querying multiple<br />
annotation and alignment levels at once is outlined. On<br />
top of that, valence divergencies are discussed with<br />
respect to factors like contrastive differences, register or<br />
translation properties and strategies.<br />
64<br />
2. Study setup<br />
2.1. The corpus<br />
The corpus used in the study was built to investigate<br />
contrastive commonalities and differences between<br />
English and German as well as peculiarities in<br />
translations. It consists of English originals (EO), their<br />
German translations (GTrans) as well as German<br />
originals (GO) and their English translations (ETrans).<br />
Both translation directions are represented in eight<br />
registers with at least 10 texts totalling 31,250 words per<br />
register. In the present paper, examples are taken from<br />
the registers SHARE (corporate communications),<br />
SPEECH (political speeches) and FICTION (fictional<br />
texts). Altogether, the corpus comprises one million<br />
words. Additionally, register-neutral reference corpora<br />
are included for German and English including 2,000<br />
word samples from 17 registers.<br />
All texts are annotated with part-of-speech information<br />
using the TnT tagger (Brants, 2000), morphology using<br />
MPRO (Maas, Rösener & Theofilidis, 2009), and<br />
grammatical functions and chunk categories, manually<br />
annotated with MMAX2 (Müller & Strube, 2006).<br />
Furthermore, all texts are aligned on word level using<br />
GIZA++ (Och & Ney, 2003), on chunk level indirectly<br />
by mapping the grammatical functions onto each other,<br />
on clause level manually again using MMAX2, and on<br />
sentence level using the WinAlign component of the<br />
Trados Translator’s Workbench (Heyn, 19<strong>96</strong>) with<br />
additional manual correction.<br />
2.2. A format independent API for multilevel<br />
queries<br />
The API designed for the corpus is made up of three<br />
parts. On top, there is the interface, containing control<br />
methods with basic read/write and iteration calls for the<br />
corpus. Under the hood, a package called CORETOOL is<br />
used to represent linguistic structures in stratified layers,<br />
and the parallel structures (e.g. aligned words,<br />
sentences, etc.) as sets of pairs. The intermediate level<br />
handles the XML-based data format of the corpus.<br />
Queries are mainly written using the format-independent<br />
CORETOOL data structures and are thus re-usable for<br />
other corpora as well. The layers dealing with corpus<br />
management and format handling can, in theory, be<br />
exchanged depending on the corpus used. This<br />
stratificational approach is a major difference between<br />
this corpus API and other APIs, where programming<br />
data structures and underlying data format are more<br />
closely linked.<br />
Fundamental within CORETOOL is the notion of TEXT. A<br />
CORPUS is made up of an ordered collection of TEXTS,<br />
which again is made up of an ordered collection of<br />
SENTENCES, which again is made up of an ordered<br />
collection of TOKENS. This structure is so to speak the<br />
backbone of CORETOOL and the minimum of data that we<br />
expect in a corpus. In addition, a CORPUS can be divided<br />
into REGISTERS which also relate to collections of TEXTS<br />
(from the CORPUS). Likewise, a SENTENCE can contain<br />
CLAUSES or CHUNKS which relate to the TOKENS of the<br />
SENTENCE. For each of these sub-units of a text (including<br />
TOKENS), it is possible to have aligned counterparts.<br />
Every single alignment is represented as a pair; so if unit<br />
U is aligned with U' and U'', there will be two pairs<br />
Multilingual Resources and Multilingual Applications - Regular Papers<br />
for every wordPair in wordPairs<br />
end for<br />
slWord := getSlWord(wordPair)<br />
tlWord := getTlWord(wordPair)<br />
slChunk := getChunkForWord(slWord)<br />
tlChunk := getChunkForWord(tlWord)<br />
if (not mappable(getGramFunc(slChunk), getGramFunc(tlChunk))<br />
then markCrossingLine(slWord, tlWord, slChunk, tlChunk)<br />
end if<br />
Figure 1: Pseudo-Code of the query for crossing lines between grammatical functions and<br />
words<br />
the linguistic representation of CORETOOL is currently<br />
restricted to syntactic structures. However, the need to<br />
extend the package with further functionalities, e.g. in to<br />
be able to operate with semantic annotation as well, may<br />
or will hopefully soon be rendered unnecessary by latest<br />
developments of query tools like e.g. ANNIS21 .<br />
2.3. Querying for empty links and crossing<br />
lines<br />
Two concepts are used to detect instances of valence<br />
divergencies. These concepts are based on well-known<br />
concepts from translation studies. Elements which have<br />
no alignment exhibit an empty link. Such 0:1-equivalents<br />
have been described e.g. by Koller (2001). Elements<br />
which are aligned, but which are embedded in higher<br />
units that are not aligned, result in crossing lines. This<br />
would e.g. be the case for two aligned words which are<br />
embedded in different grammatical functions. Crossing<br />
lines relate to the concept of shifts (in the given example<br />
a shift in grammatical function) as described e.g. by<br />
Catford (1<strong>96</strong>5).<br />
The corpus is queried for empty links and crossing lines<br />
using the CORETOOL package. Empty links can be<br />
detected by simply querying one alignment level. For<br />
crossing lines, querying combinations of both annotation<br />
and alignment levels is necessary. A query for a shift in<br />
function requires (1) going through pairs of aligned<br />
words, (2) for each pair: getting the chunks the aligned<br />
words are embedded in, and (3) checking the mapping<br />
of these chunks, i.e. check whether the grammatical<br />
1 http://www.sfb632.uni-potsdam.de/d1/annis/<br />
functions they’ve been assigned are compatible (cf.<br />
figure 1). As in this study setup the same set of<br />
grammatical functions was used for German and<br />
English, mapping was straightforward.<br />
3. Divergencies in valence patterns for<br />
grammatical functions<br />
The ideal situation for valence extraction from parallel<br />
corpora would be that of sentence pairs with equivalent<br />
verbs at their core and perfectly matching syntactic<br />
patterns. Minor shifts, e.g. in the type of grammatical<br />
functions governed by the verb, can easily be accounted<br />
for. However, besides differences in realisation of<br />
arguments, there may also be differences in the<br />
realisation of the predicate. Such a typical shift is the<br />
head switch, in examples like Ich schwimme gern – I<br />
like swimming, where the German adverb gern<br />
‘willingly, with pleasure’ becomes the full verb like in<br />
English. As we will see, there may be other factors for<br />
different kinds of shifts in the verb. We will be looking<br />
at more semantically/pragmatically triggered shifts, for a<br />
more syntactic investigation especially of shifts in the<br />
realisation of the predicate, e.g. support verb<br />
constructions versus full verbs, see (Čulo, 2010).<br />
Probably the simplest case for a valence divergency on<br />
the level of grammatical functions is that of differences<br />
in the kinds of grammatical function as which an<br />
argument is realised. Compare, for instance, the<br />
sentence pair in figure 2, with the English original on<br />
top and the German translation at the bottom, and let us<br />
focus on the phrase “Most admired Company in<br />
65
Multilingual Resources and Multilingual Applications - Regular Papers<br />
America”. This phrase is embedded in a predicative<br />
complement (tag: COMPL) in English, as is governed<br />
by verbs like name, appoint, elect etc. The COMPL<br />
function has no equivalent in German, resulting in an<br />
empty link (indicated by the vertical lines with only<br />
linked to only one box). In order to understand, though,<br />
what is happening in that case, one has to evaluate the<br />
links from within the phrase: the word Company, for<br />
instance, is aligned with the equivalent word<br />
Unternehmen which is, however, embedded in a<br />
prepositional object (PROBJ) in German. The cause for<br />
this shift lies in a contrastive difference in the valence<br />
patterns of a whole class of verbs (namely the APPOINT<br />
class, following Levin (1993)). But, as there currently is<br />
no semantic annotation present in the corpus, there is no<br />
automatic way of linking the verb sense to this particular<br />
66<br />
Figure 2: A crossing line for the words Company and Unternehmen and the grammatical<br />
functions COMPL and PROBJ<br />
shift. We will come back to this point when discussing<br />
the last example.<br />
A similar shift from COMPL to a different function is<br />
shown in figure 3. Here, however, the shift is not<br />
triggered by the fact that two equivalent verbs have<br />
different valence patterns, but by a change of the main<br />
verb which does not match known concepts like head<br />
switches.<br />
be → sein be → sein<br />
E2G_SHARE 37 % (126) 63 % (215)<br />
E2G_FICTION 45 % (138) 54 % (168)<br />
E2G_SPEECH 60 % (224) 40 % (147)<br />
Table 1: Proportions of be translated as either sein or<br />
with a different verb than sein<br />
Figure 3: From English copular verb to German full verb
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Figure 4: Multiple shifts as a result of translation strategies<br />
The English copular verb be is translated with the<br />
transitive verb betragen in German. This particular kind<br />
of verb shift can be observed very often in the register<br />
SHARE, as shown in table 1. The reason for this lies in<br />
differences in style between English and German<br />
SHARE texts: English uses a more colloquial style<br />
where German puts rather formulaic expressions, using<br />
more full verbs than copular verbs.<br />
Many of the shifts found in translations can be attributed<br />
to translation strategies as described e.g. by (Vinay &<br />
Darbelnet, 1958) for French and English. An example of<br />
a modulation can be seen in figure 4. Here, what can be<br />
described by looking at the surface realisation, is that the<br />
word order from the German original has been kept in<br />
the English translation, probably to preserve the stress<br />
which is put on the phrase Die Frauen`the women`. But,<br />
while in German the first constituent is a direct object,<br />
this order of grammatical functions cannot be easily<br />
reproduced in English. A possible solution, as presented<br />
in the given example, is to shift the direct object to<br />
another function, here: the subject. In the given<br />
example, the verb is shifted, too, from transitive<br />
gemacht `made` to the copular weren’t. One could<br />
hypothesise that this happens in order to adapt to the<br />
different configuration of functions and their semantic<br />
content. However, in order to really explain the more<br />
complex cases of multiple shifts in one sentence, further<br />
data /annotations may be needed.<br />
If, for instance, we add frame semantic annotation, we<br />
may be able to describe the shift of the verb with<br />
relation to shifts in semantic content. In the example in<br />
figure 4, one could annotate the first sentence with the<br />
Cause_change frame (with das as Cause and Die<br />
Frauen as Entity), the second one with the<br />
state_of_entity frame. The English sentence could thus<br />
be interpreted as a translation of only a partial<br />
component of the sense of the original sentence: the<br />
English translation focusses on the outcome of the<br />
Cause_change process in the German original, giving<br />
more stress to the Entity (the women) in the<br />
State_of_entity by placing it to the sentence initial<br />
position. How to deal with such shifts – whether to<br />
include them in an extraction process or not – remains a<br />
matter of discussion. Data from process-based<br />
translation experiments may prove helpful for shedding<br />
light on the reasons for such a “partial” translation.<br />
4. Conclusion and outlook<br />
As has been shown, empty links and crossing lines have<br />
proven to be reliable indicators for detecting and in<br />
some cases a basis for describing differences in<br />
grammatical valence patterns. Furthermore, it has been<br />
shown that annotation and alignment on multiple levels<br />
can be used for studying valence divergencies and<br />
possibly for extracting bilingual valence dictionaries,<br />
without resorting to an annotation scheme specialised on<br />
these purposes only.<br />
67
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Future work shall concentrate on a broader<br />
categorisation of valence divergencies with respect to<br />
more factors than those listed in this paper. In order to<br />
be able to link verb senses and certain types of shifts, the<br />
next step is to add (frame) semantic annotation to the<br />
corpus. Also, the purely product based data presented<br />
here could be complemented by process-based studies in<br />
the future, which should yield a more sound explanation<br />
of shifts as depicted in figure 4.<br />
5. References<br />
Bianco, M. T. (19<strong>96</strong>): Valenzlexikon deutsch-italienisch.<br />
Deutsch im Kontrast 17. Heidelberg: Julius Groos.<br />
Boas, H. C. (2002): Bilingual FrameNet dictionaries for<br />
machine translation. In Proccedings of the third<br />
international conference on language resources and<br />
evaluation, 4:1364-1371. Las Palmas, Spanien.<br />
---- (2005): Semantic frames as interlingual<br />
representations for multilingual lexical databases.<br />
International Journal of Lexicography 4, no. 18: 445-<br />
478.<br />
Catford, J. C. (1<strong>96</strong>5): A linguistic theory of translation.<br />
an essay in applied linguistics. Oxford: Oxford<br />
University Press.<br />
Čmejrek, M., Cuřín, J., Havelka, J., Hajič, J., Kubon. V.<br />
(2004): Prague Czech-English dependency treebank:<br />
syntactically annotated resources for machine<br />
translation. In Proceedings of LREC 2004, 5:1597-<br />
1600. Lisbon, Portugal.<br />
Čulo, O. (2010): Valency, translation and the syntactic<br />
realisation of the predicate. In D. Vitaš and C. Krstev,<br />
Proceedings of the 29th International Conference on<br />
Lexis and Grammar (LGC), 73-82. Belgrade, Serbia.<br />
Cyrus, L. (2006): Building a resource for studying<br />
translation shifts. In Proceedings of LREC 2006.<br />
Emele, M. C., Dorna, M., Lüdeling, A., Zinsmeister, H.,<br />
Rohrer, C. (2000): Semantic-based transfer. In W.<br />
Wahlster (ed.), Verbmobil, 359-376. Artificial<br />
intelligence. Berlin ; Heidelberg [u.a.]: Springer.<br />
Engel, U., Savin, E. (1983): Valenzlexikon deutschrumänisch.<br />
Deutsch im Kontrast 3. Heidelberg: Julius<br />
Groos.<br />
Gebruers, R. (1988): Valency and MT: recent<br />
developments in the METAL system. In Proceedings<br />
68<br />
of the second conference on applied natural language<br />
processing, 168-175.<br />
Koller, W. (2001): Einführung in die<br />
Übersetzungswissenschaft. Narr Studienbücher.<br />
Tübingen: Gunter Narr.<br />
Levin, B. (1993): English verb classes and alternations.<br />
The University Chicago Press.<br />
Padó, S. (2007): Translational equivalence and crosslingual<br />
parallelism: the case of framenet frames. In<br />
Proceedings of the nodalida workshop on building<br />
frame semantics resources for scandinavian and baltic<br />
languages. Tartu, Estonia.<br />
Rall, D., Rall, M., Zorrilla, O. (1980): Diccionario de<br />
valencias verbales: aleman-español. Tübingen: Gunter<br />
Narr.<br />
Sgall, P., Hajičová, E., Panevová, J. (1986): The<br />
meaning of the sentence in its semantic and pragmatic<br />
aspects. Springer Netherland.<br />
Simon-Vandenbergen, A.-M., Taeldeman, J. ,Willems,<br />
D. (eds) (19<strong>96</strong>): Aspects of contrastive verb valency.<br />
Studia Germanica Gandensia 40.<br />
Steiner, E., Schmidt, P., Zelinsky-Wibbelt. C. (1988):<br />
From syntax to semantics: insights from machine<br />
translation. London: Francis Pinter.<br />
Tesnière, L. (1959): Éléments de syntaxe structurale.<br />
Paris: Klinksieck.<br />
Vinay, J.-P., Darbelnet, J. (1958): Stylistique comparée<br />
du français et de lʼanglais. Méthode de translation.<br />
Paris: Didier.<br />
Žabokrtský, Z., Ptáček, J., Pajas. P. (2008). TectoMT:<br />
highly modular MT system with tectogrammatics<br />
used as transfer layer. In Proceedings of WMT 2008.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
SPIGA - A Multilingual News Aggregator<br />
Leonhard Hennig † , Danuta Ploch † , Daniel Prawdzik § , Benjamin Armbruster § , Christoph<br />
Büscher § , Ernesto William De Luca † , Holger Düwiger § , Sahin Albayrak †<br />
† DAI-Labor, TU Berlin<br />
§ Neofonie GmbH<br />
Berlin, Germany Berlin, Germany<br />
E-mail: {leonhard.hennig,danuta.ploch,ernesto.deluca,sahin.albayrak}@dai-labor.de,<br />
{daniel.prawdzik,benjamin.armbruster,christoph.buescher,holger.düwiger}@neofonie.de<br />
Abstract<br />
News aggregation web sites collect and group news articles from a multitude of sources in order to help users navigate and consume<br />
large amounts of news material. In this context, Topic Detection and Tracking (TDT) methods address the challenges of identifying<br />
new events in streams of news articles, and of threading together related articles. We propose a novel model for a multilingual news<br />
aggregator that groups together news articles in different languages, and thus allows users to get an overview of important events and<br />
their reception in different countries. Our model combines a vector space model representation of documents based on a multilingual<br />
lexicon of Wikipedia-derived concepts with named entity disambiguation and multilingual clustering methods for TDT. We describe<br />
an implementation of our approach on a large-scale, real-life data stream of English and German newswire sources, and present an<br />
evaluation of the Named Entity Disambiguation module, which achieves state-of-the-art performance on a German and an English<br />
evaluation dataset.<br />
Keywords: topic detection and tracking, named entity disambiguation, multilingual clustering, news personalization<br />
1. Introduction<br />
News aggregation web sites such as Google News 1 and<br />
Yahoo! News 2<br />
collect and group news articles from a<br />
multitude of sources in order to help users navigate and<br />
consume large amounts of news material. Such systems<br />
allow users to stay informed on current events, and to<br />
follow a news story as it evolves over time. In this<br />
context, an event is defined as something that happens at<br />
a specific time and place (Fiscus & Doddington, 2002),<br />
e.g. “the earthquake that struck Japan on March 11th,<br />
<strong>2011</strong>”.<br />
Topic Detection and Tracking (TDT) methods address<br />
two main challenges of such systems: The detection of<br />
new events (topics) and the tracking of articles related to<br />
a known topic in newswire streams (Allan, 2002).<br />
Addressing these tasks typically requires a comparison of<br />
text models. In topic tracking, the comparison is between<br />
a document and a topic, which is often represented as a<br />
centroid vector of the topic’s documents. Topic detection<br />
compares a document to all known topics, to decide if the<br />
1<br />
http://news.google.com<br />
2<br />
http://news.yahoo.com<br />
document is about a novel topic. Text models are often<br />
based on the Vector Space Model, or are represented as<br />
language models (Larkey, 2004).<br />
Going one step further, multilingual news aggregation<br />
enables users to get an overview of the press coverage of<br />
an event in different countries and languages, and has<br />
been a part of TDT evaluations since 1999 (Wayne, 2000).<br />
For multilingual TDT, topic and document comparisons<br />
require the use of multilingual text models, or<br />
alternatively the translation of documents (Larkey, 2004).<br />
Previous research has typically used machine translation<br />
to convert stories to a base language (Wayne, 2000).<br />
Machine-translated documents, however, are of lower<br />
quality than human-translated documents, and<br />
full-fledged machine translation of complete documents<br />
is costly in terms of required models and linguistic tools<br />
(Larkey, 2004). Moreover, real-life TDT systems have to<br />
filter large amounts of new documents as they arrive over<br />
time, and thus require the use of efficient, scalable<br />
approaches.<br />
As news stories typically revolve around people, places,<br />
and other named entities, Shah et al. (2006) show that<br />
using concepts, such as named entities and topical<br />
69
Multilingual Resources and Multilingual Applications - Regular Papers<br />
keywords, rather than all words for vector representations<br />
can lead to a higher TDT performance. While there are<br />
many ways to extract concepts from documents,<br />
Wikipedia has gained much interest recently as a lexical<br />
resource (Mihalcea, 2007), as it covers concepts from a<br />
wide range of domains and is freely available in many<br />
languages. Furthermore, Wikipedia’s inter-language<br />
links can be used to translate multilingual concepts.<br />
However, previous research in multilingual TDT has not<br />
attempted to utilize Wikipedia as a resource for concept<br />
extraction and translation.<br />
Representing documents as concept vectors raises the<br />
additional challenge of dealing with natural language<br />
ambiguities, such as ambiguous name mentions and the<br />
use of synonyms (Cucerzan, 2007). For example, the<br />
name mention ‘Jordan’ may refer to several different<br />
persons, a river, and a country. As these phenomena<br />
lower the quality of vector representations, it is necessary<br />
to resolve ambiguous name mentions against their correct<br />
real-world referent. This task is known as Named Entity<br />
Disambiguation (NED) (Bunescu & Pasca, 2006).<br />
State-of-the-art approaches to NED employ supervised<br />
machine learning algorithms to combine features based<br />
on document context knowledge with entity information<br />
stored in an encyclopedic knowledge base (KB)<br />
(Bunescu & Pasca, 2006; Zhang et al., 2010). Common<br />
features include popularity (Dredze et al., 2010),<br />
similarity metrics exploring Wikipedia’s concept<br />
relations (Han & Zhao, 2009), and string similarity. In<br />
current research, NED has mainly been considered as an<br />
isolated task (Ji & Grishman, <strong>2011</strong>), and has not yet been<br />
applied in the context of TDT.<br />
The contributions of this paper are twofold: We propose a<br />
novel model for a multilingual news aggregator that<br />
combines Wikipedia-based concept extraction, named<br />
entity disambiguation, and multilingual TDT (Section 2).<br />
Our model is based on a representation of documents and<br />
topics as vectors of concepts. This choice of<br />
representation, combined with concept translation,<br />
enables the application of a wide range of well-known<br />
TDT algorithms regardless of the language of the input<br />
documents, and leads to efficient and scalable<br />
implementations. We also describe an implementation of<br />
our model on a large-scale, multilingual news stream.<br />
Furthermore, we extend our NED algorithm previously<br />
proposed in (Ploch, 2010) to a German KB, and present<br />
70<br />
an evaluation of the Named Entity Disambiguation<br />
module on a newly-created German dataset (Section 3).<br />
2. Multilingual News Aggregation Model<br />
Our approach to multilingual TDT is schematically<br />
outlined in Figure 1. For each news article, we<br />
successively perform language-dependent concept<br />
extraction (Section 2.1), NED (Section 2.2) and<br />
multilingual TDT (Section 2.3). In addition, we outline<br />
an algorithm for news personalization in Section 2.4.<br />
Finally, we give details of the implementation of our<br />
model in Section 2.5, and describe a user interface for the<br />
presentation of news stories in Section 2.6.<br />
Figure 1: Multilingual News Aggregation Model<br />
2.1. Concept extraction<br />
We create a lexicon of terms, phrases and named entities<br />
by collecting titles, internal anchor texts, and redirects,<br />
from Wikipedia articles. The use of Wikipedia as the<br />
basis of our lexicon allows us to construct concept<br />
vectors for news articles in different languages, and<br />
facilitates the creation of new lexicons. We utilize the<br />
inter-language tables of Wikipedia to create a mapping<br />
between concepts in different languages. In the final<br />
lexicon, each concept is represented by an image, which<br />
is used to uniquely identify the concept, and a list of<br />
linguistic variants (inflected forms, synonyms and<br />
abbreviations). For example, the concept ‘Jordan<br />
(Country)’ may be referred to by ‘Jordan’, ‘Urdun’, or<br />
‘Hashemite Kingdom of Jordan’.<br />
After concept extraction, each news article is represented<br />
as a weighted bag-of-concepts. All other words contained
Multilingual Resources and Multilingual Applications - Regular Papers<br />
in the document are discarded. We weight concepts using<br />
a variant of the traditional tf.idf-weighting scheme (Allan,<br />
2005). The document frequency is calculated over a<br />
sliding time window in order to better reflect the<br />
changing significance of terms in a dynamic collection of<br />
news articles:<br />
w(<br />
c , d<br />
i<br />
j<br />
)<br />
n(<br />
ci<br />
, d j )<br />
=<br />
n(<br />
c , d ) + 0.<br />
5 + 1.<br />
5 × d<br />
i<br />
j<br />
log(( D + 0.<br />
5)<br />
/ nD<br />
( ci<br />
))<br />
×<br />
,<br />
log( 1 + D )<br />
j<br />
/ d<br />
where 𝑤𝑤𝑤𝑤(𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖, 𝑑𝑑𝑑𝑑 𝑗𝑗𝑗𝑗 ) is the weight of concept 𝑖𝑖𝑖𝑖 in document 𝑗𝑗𝑗𝑗,<br />
𝐷𝐷𝐷𝐷 is the collection of documents, 𝑛𝑛𝑛𝑛(𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖, 𝑑𝑑𝑑𝑑 𝑗𝑗𝑗𝑗 ) is the<br />
frequency of concept 𝑖𝑖𝑖𝑖 in document 𝑗𝑗𝑗𝑗 and 𝑛𝑛𝑛𝑛 𝐷𝐷𝐷𝐷(𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖) is the<br />
number of documents containing 𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖.<br />
2.2. Multilingual Named Entity Disambiguation<br />
The concept vector of a document may initially<br />
encompass ambiguous concepts, and in particular<br />
ambiguous name mentions. If a document contains e.g.<br />
the name mention ‘Michael Jordan’ the real-world<br />
referent might be the famous basketball player, but also<br />
the researcher in machine learning known under this<br />
name. The same document may also refer to ‘Air Jordan’,<br />
which is a synonymous name for the basketball player. In<br />
both cases the challenge is to figure out the correct<br />
meaning of the name mention for clearly constructing the<br />
concept vector of the document.<br />
Our approach to NED is based on our earlier work<br />
described in (Ploch, 2010), which we extend here to a<br />
German KB. We disambiguate name mentions found in a<br />
text by utilizing an encyclopedic reference knowledge<br />
base (KB) to link a name mention to at most one entry in<br />
the KB (Bunescu & Pasca, 2006). Furthermore, we also<br />
determine if a name mention refers to an entity not<br />
covered by the KB, which is known as Out-of-KB<br />
detection (Dredze et al., 2010). This may occur for less<br />
popular but still newsworthy entities with no<br />
corresponding KB entry. Especially challenging is the<br />
disambiguation of common names, like for instance ‘Paul<br />
Smith’, of unknown entities sharing their name with a<br />
popular namesake.<br />
Our approach to NED is based on the observation that<br />
entities in texts co-occur with other entities. We therefore<br />
utilize the entities surrounding an ambiguous name for<br />
their resolution. On the basis of Wikipedia’s internal link<br />
graph we create a reference KB containing for each entity<br />
its known surface forms (i.e. name variants) and its links<br />
to other entities and concepts (Wikipedia articles).<br />
Given a name mention identified in a document, the<br />
candidate selection component retrieves a set of<br />
candidate entities from the KB, using a fuzzy, weighted<br />
search on index fields storing article titles, redirect titles,<br />
and name variants. We cast NED as a supervised<br />
classification task and train two Support Vector Machine<br />
(SVM) classifiers (Vapnik, 1995). The first classifier<br />
ranks the candidate KB entities for a given surface form.<br />
Subsequently, the second classifier determines whether<br />
the surface form refers to an Out-of-KB entity. Besides<br />
calculating well-known NED features like the<br />
bag-of-words similarity, the popularity of an entity given<br />
a specific surface form and the string similarity (baseline<br />
feature set), we implement features that exploit<br />
Wikipedia’s link graph. To this end, we represent the<br />
document context of an ambiguous entity and each<br />
candidate as a vector of links that are associated with the<br />
candidate entities in our KB, and compute several<br />
similarity features using the resulting bag-of-links<br />
vectors. The full approach is described in more detail in<br />
(Ploch, 2010).<br />
2.3. Multilingual Topic Detection and Tracking<br />
Given the disambiguated concept vector representation<br />
of a document, we employ a hierarchical agglomerative<br />
clustering approach for TDT. The centroid vector of a<br />
topic is created by averaging the concept weights of the<br />
documents assigned to that topic. The clustering<br />
algorithm then compares a new document to the centroid<br />
vectors of existing topics using a combination of the two<br />
vectors’ cosine similarity and a time-dependent penalty.<br />
The time factor is included to prefer assigning new<br />
documents to more recent events, and to limit the infinite<br />
growth of old events (Nallapati et al., 2004). If a<br />
document’s similarity to all clusters is lower than a<br />
predefined threshold, we assume that this document deals<br />
with a new event, and starts a new cluster.<br />
In order to cluster documents from different languages,<br />
we utilize the inter-language mappings and translate the<br />
concept vectors to a single language. Thus, the document<br />
concept vectors as well as the cluster centroid vectors<br />
share a common space of concepts, to which we can<br />
apply our clustering approach.<br />
71
Multilingual Resources and Multilingual Applications - Regular Papers<br />
2.4. News Personalization<br />
The Personal News Agent (PNA) enables the user to<br />
personalize the news stream to match her information<br />
need. We define a user profile as a weighted vector 𝑢𝑢𝑢𝑢<br />
consisting of components 𝑢𝑢𝑢𝑢 + and 𝑢𝑢𝑢𝑢 − , which represent<br />
the concepts that a user is interested respectively not<br />
interested in. We include 𝑢𝑢𝑢𝑢 − to allow for a more<br />
fine-grained control of news selection. Similar to the<br />
centroid vectors of document clusters, this approach<br />
enables a language-independent representation of a<br />
user’s information needs.<br />
The process of identifying relevant news articles is<br />
performed analogously to the TDT algorithm described<br />
in the previous section. The relevance of a new document<br />
with respect to the user profile is calculated as the cosine<br />
similarity of the document’s concept vector and 𝑢𝑢𝑢𝑢 .<br />
Documents with a similarity higher than a predefined<br />
threshold are assumed to match a user’s information need,<br />
and presented to the user.<br />
2.5. System Implementation<br />
Our implementation of the approach described in the<br />
previous sections consists of three main components, and<br />
is shown in Figure 1. We used a crawler that collects<br />
news articles and associated metadata from<br />
approximately 1400 German and English newswire<br />
sources. The news articles are processed in a pipeline<br />
based on the Apache UIMA framework 3<br />
. Events and the<br />
news articles associated with them are presented to the<br />
user via a web interface. The system is geared towards<br />
large-scale processing of newswire streams in near<br />
real-time. It processes approximately 70.000 news<br />
articles per day, and manages up to 200.000 event<br />
clusters over a time span of four weeks.<br />
The current system processes English and German news,<br />
using a lexicon of 1.5 and 1.1 million concepts<br />
respectively, and is planned to include French, Italian and<br />
Spanish news sources. The usable intersection between<br />
the German and English lexicons amounts to 700K<br />
concepts. Concepts are identified in text with a<br />
longest-matching substring strategy (Gusfield, 1999).<br />
The concept weighting uses a time span of 4 weeks to<br />
determine document frequency.<br />
Our implementation of the NED module utilizes<br />
3 Apache UIMA– Unstructured Information Management<br />
Architecture (http://uima.apache.org/)<br />
72<br />
classifier models trained on the TAC-KBP 2009 dataset<br />
and a German dataset (see Section 3), both of which are<br />
based on newswire documents.<br />
The TDT component’s parameters, such as cluster<br />
similarity thresholds and time penalty values, are<br />
currently tuned manually based on an analysis of the<br />
clusters produced by the algorithm. We utilize the<br />
concept set of the German Wikipedia as the basis for<br />
translating the concept vectors of English news articles.<br />
In addition, concept types are weighted differently, as for<br />
example places and person names are more helpful than<br />
general topics to detect events in news streams.<br />
For the news personalization component, the creation of<br />
a user profile is based on the selection of news articles by<br />
the user according to her interests. Concept vectors are<br />
extracted from user-selected articles as described in<br />
Section 2.1. The concept vectors are then merged and<br />
weighted to create a centroid vector 𝑢𝑢𝑢𝑢, with concepts<br />
having a negative weight representing the component 𝑢𝑢𝑢𝑢 − .<br />
The news personalization module uses a slightly different<br />
weighting scheme than the TDT component, assigning a<br />
higher weight to general topics (e.g. elections, tax cuts)<br />
than to named entities.<br />
2.6. User Interface<br />
We present events and news articles to users via a web<br />
interface. The interface includes a start page giving an<br />
overview of the most important events in several news<br />
categories, as well as pages for each category. Given the<br />
large amount of news stories published every day, our<br />
system implements several methods to rank event<br />
clusters for presentation to the user. These include<br />
measures based on cluster novelty, size, and hotness. The<br />
hotness measure is calculated as a weighted combination<br />
of a cluster’s total growth since its creation time, and its<br />
recent growth in a sliding time window. For our system,<br />
we determined the weights experimentally over a range<br />
of settings. This approach ensures that breaking news are<br />
presented first both on the start page and on category<br />
pages. In addition, we implement a filtering strategy for<br />
news articles to provide users with an in-depth,<br />
diversity-oriented overview of each event, instead of<br />
merely listing an event’s news articles in order of their<br />
age. Figure 2 shows the overview page of an example<br />
event, displaying the event’s lead article as well as two<br />
earlier news articles in German and English.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Figure 2: A sample multilingual news cluster<br />
3. Evaluation of NED<br />
We evaluate the quality of our NED approach on two<br />
datasets to examine how its performance compares to<br />
other state-of-the art systems, and which accuracy it<br />
achieves for different languages.<br />
The first dataset is the TAC-KBP 2009 dataset for<br />
English (Simpson et al., 2009). It consists of 3,904<br />
queries (name mention-document pairs) with 57%<br />
queries for Out-of-KB entities. The KB queries are<br />
divided into 69% queries for organizations and 15%<br />
queries for persons and geopolitical entities each. In<br />
addition to the English NED dataset we created a German<br />
dataset with 2,359 queries. This dataset consists of 30%<br />
Out-of-KB queries and 70% KB queries, where 46 % of<br />
the queries relate to organizations, 27% to persons and 24%<br />
to geopolitical entities. 3% are of an unknown type<br />
‘UKN’.<br />
Micro-averaged accuracy<br />
1.00<br />
0.95<br />
0.90<br />
0.85<br />
0.80<br />
0.75<br />
0.70<br />
0.65<br />
0.60<br />
0.55<br />
0.50<br />
Baseline Best feature set Dredze et al. Zheng et al.<br />
All queries KB Out-of-KB<br />
Figure 3: Micro-averaged accuracy of different<br />
approaches to English NED for the TAC-KBP 2009<br />
dataset on all, KB and Out-of-KB queries.<br />
For both datasets, we perform 10-fold cross-validation by<br />
training the SVM classifiers on 90% of the queries and<br />
testing on the remaining 10%. Results reported in this<br />
paper are then averaged across the test folds. We utilize<br />
the official TAC-KBP 2009 evaluation measure of<br />
micro-averaged accuracy, which is computed as the<br />
fraction of correctly answered queries.<br />
Figure 3 and Figure 4 show the micro-averaged<br />
accuracies for all, KB and Out-Of-KB queries. As shown<br />
in Figure 3 for the English dataset, our best feature set<br />
improves the accuracy of the baseline model by 2.7%,<br />
and achieves a micro-averaged accuracy of 0.84.<br />
Regarding other systems tested on the same dataset<br />
(Dredze et al., 2010; Zheng et al., 2010), our results<br />
compare favorably. In particular, the detection of<br />
Out-of-KB entities outperforms that of other systems.<br />
The experiments confirm our assumption that<br />
co-occurring entities and their relations are suitable for<br />
NED. Similar results are obtained for the German dataset,<br />
as shown in Figure 4. The overall accuracy of 0.77 on this<br />
dataset is slightly lower than for the TAC 2009 dataset.<br />
Again, the accuracy for Out-of-KB queries is higher than<br />
the disambiguation accuracy for KB queries, but<br />
compared to TAC 2009 the results are more balanced.<br />
Micro-averaged accuracy<br />
1.00<br />
0.95<br />
0.90<br />
0.85<br />
0.80<br />
0.75<br />
0.70<br />
0.65<br />
0.60<br />
0.55<br />
0.50<br />
Englisch TAC-KBP 2009 German dataset <strong>2011</strong><br />
All queries KB Out-of-KB<br />
Figure 4: Comparison of micro-averaged NED accuracy<br />
on the English TAC-KBP 2009 and the German dataset.<br />
4. Conclusions<br />
We described a model for a multilingual news aggregator<br />
which combines Wikipedia-based concept extraction,<br />
named entity disambiguation and multilingual TDT to<br />
detect and track events in multilingual news streams. Our<br />
approach exploits Wikipedia as a large-scale,<br />
multilingual knowledge source both for representing<br />
documents as concept vectors and for resolving<br />
ambiguous named entities. We also described a<br />
73
Multilingual Resources and Multilingual Applications - Regular Papers<br />
fully-operational implementation of our approach on a<br />
real-life, large scale multilingual news stream. Finally,<br />
we presented an evaluation of the Named Entity<br />
Disambiguation module on a German and an English<br />
dataset. Our approach achieves state-of-the-art results on<br />
the TAC-KBP 2009 dataset, and shows similar<br />
performance on a German dataset.<br />
In future work, we plan to evaluate the Topic Detection<br />
and Tracking component using the TDT 3 dataset (Wayne,<br />
2000), in order to verify the validity of our overall<br />
approach. We also plan to evaluate the effect of NED on<br />
the performance of the TDT algorithm.<br />
Furthermore, we intend to include more languages to<br />
provide a pan-European overview of news events. This<br />
will raise additional challenges related to the mapping of<br />
concepts in different languages, the disambiguation of<br />
named entities, and the clustering strategies applicable to<br />
the resulting vector representation, since many Wikipedia<br />
versions are often significantly smaller than the English<br />
one. For example, we plan to extend our link-based NED<br />
approach by exploiting cross-lingual information.<br />
5. Acknowledgments<br />
The authors wish to express their thanks to the Neofonie<br />
GmbH team who strongly contributed to this work. The<br />
project SPIGA is funded by the Federal Ministry of<br />
Economics and Technology (BMWi).<br />
6. References<br />
Allan, J. (2002): Introduction to topic detection and<br />
tracking. In: Topic detection and tracking, pp. 1–16.<br />
Kluwer Academic Publishers.<br />
Allan, J., Harding, S., Fisher, D., Bolivar, A.,<br />
Guzman-Lara, S., Amstutz, P. (2005): Taking topic<br />
detection from evaluation to practice. In: Proc. of<br />
HICSS ’05.<br />
Bunescu, R., Pasca, M. (2006): Using encyclopedic<br />
knowledge for named entity disambiguation. In: Proc.<br />
of EACL-06, pp. 9–16.<br />
Cucerzan, S. (2007): Large-Scale named entity<br />
disambiguation based on Wikipedia data. In: Proc. of<br />
EMNLP-CoNLL’07, pp. 708–716.<br />
Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.<br />
(2010): Entity disambiguation for knowledge base<br />
population. In: Proc. of Coling 2010, pp. 277–285.<br />
Fiscus, J., Doddington G. (2002): Topic detection and<br />
74<br />
tracking evaluation overview. In: Topic detection and<br />
tracking, pp. 17-31. Kluwer Academic Publishers.<br />
Gusfield, D. (1999): Algorithms on Strings, Trees and<br />
Sequences: Computer Science and Computational<br />
Biology. Cambridge University Press.<br />
Han, X., Zhao, J. (2009): Named entity disambiguation<br />
by leveraging wikipedia semantic knowledge. In: Proc.<br />
of CIKM 2009, pp. 215–224.<br />
Ji, H., Grishman, R. (<strong>2011</strong>): Knowledge Base Population:<br />
Successful Approaches and Challenges. In: Proc. of<br />
ACL <strong>2011</strong>, pp. 1148-1158.<br />
Larkey, L.S., Feng, F., Connell, M., Lavrenko, V. (2004):<br />
Language-specific models in multilingual topic<br />
tracking. In: Proc. of SIGIR '04, pp. 402-409.<br />
Mihalcea, R., Csomai, A. (2007): Wikify!: linking<br />
documents to encyclopedic knowledge. In: Proc. of<br />
CIKM '07, pp. 233-242.<br />
Nallapati, R., Feng, A., Peng, F., Allan, J. (2004): Event<br />
threading within news topics. In: Proc. of CIKM 2004,<br />
pp. 446–453.<br />
Ploch, D. (<strong>2011</strong>): Exploring Entity Relations for Named<br />
Entity Disambiguation. In: Proc. of ACL <strong>2011</strong>, pp.<br />
18–23.<br />
Shah, C., Croft, W., Jensen, D. (2006): Representing<br />
documents with named entities for story link detection<br />
(SLD). In: Proc. of CIKM ’06, pp. 868-869.<br />
Simpson, H., Strassel, S., Parker, R., McNamee, P. (2009):<br />
Wikipedia and the web of confusable entities:<br />
Experience from entity linking query creation for TAC<br />
2009 knowledge base population. In: Proc. of<br />
LREC ’10.<br />
Vapnik, V.N. (1995): The nature of statistical learning<br />
theory. Springer-Verlag, New York, NY, USA.<br />
Wayne, C. (2000): Multilingual topic detection and<br />
tracking: Successful research enabled by corpora and<br />
evaluation. In: Proc. of LREC ’00.<br />
Zhang, W., Su, J., Lim, C., Tan W., Wang, T. (2010):<br />
Entity linking leveraging automatically generated<br />
annotation. In: Proc. of Coling 2010, pp. 1290–1298.<br />
Zheng, Z., Li, F., Huang, M., Zhu, X. (2010): Learning to<br />
link entities with knowledge base. In: Proc. of<br />
NAACL-HLT ’10, pp. 483–491.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
From Historic Books to Annotated XML:<br />
Building a Large Multilingual Diachronic Corpus<br />
Magdalena Jitca, Rico Sennrich, Martin Volk<br />
Institute of Computational Linguistics, University of Zurich<br />
Binzmühlestrasse 14, 8050 Zürich<br />
E-mail: mjitca, sennrich, volk @ifi.uzh.ch<br />
Abstract<br />
This paper introduces our approach towards annotating a large heritage corpus, which spans over 100 years of alpine literature. The<br />
corpus consists of over 16.000 articles from the yearbooks of the Swiss Alpine Club, 60% of which represent German texts, 38%<br />
French, 1% Italian and the remaining 1% Swiss German and Romansh. The present work describes the inherent difficulties in<br />
processing a multilingual corpus by referring to the most challenging annotation phases such as article identification, correction of<br />
optical character recognition (OCR) errors, tokenization, and language identification. The paper aims to raise awareness for the<br />
efforts in building and annotating multilingual corpora rather than to evaluate each individual annotation phase.<br />
Keywords: multilingual corpora, cultural heritage, corpus annotation, text digitization<br />
1. Introduction<br />
In the project Text+Berg 1 we are digitizing publications<br />
of the Alpine clubs from various European countries,<br />
which consist mainly of reports on the following topics:<br />
mountain expeditions, the Alpine culture, the flora,<br />
fauna and geology of the mountains.<br />
The resulting corpus is a valuable knowledge base to<br />
study the changes in all these areas. Moreover, it enables<br />
the quantitative analysis of diachronic language changes<br />
as well as the study of typical language structures,<br />
linguistic topoi, and figures of speech in the<br />
mountaineering domain.<br />
This paper describes the particularities of our corpus and<br />
gives an overview of the annotation process. It presents<br />
the most interesting challenges that our multilingual<br />
corpus brought up, such as text structure identification,<br />
optical character recognition (OCR), tokenization, and<br />
language identification. We focus on how the<br />
multilingual nature of the text collection poses new<br />
problems in apparently trivial processing steps (e.g.<br />
tokenization).<br />
1 See www.textberg.ch<br />
2. The Text+Berg Corpus<br />
The focus of the Text+Berg project is to digitize the<br />
yearbooks of the Swiss Alpine Club from 1864 until<br />
today. The resulting corpus contains texts which focus<br />
on conquering and understanding the mountains and<br />
covers a wide variety of text genres such as expedition<br />
reports, (popular) scientific papers, book reviews, etc.<br />
The corpus is multilingual and contains articles in<br />
German (some also in Swiss German), French, Italian<br />
and even Romansh. Initially, the yearbooks contained<br />
mostly German articles and few in French. Since 1957<br />
the books appeared in parallel German and French<br />
versions (with some Italian articles), summing up to a<br />
total of 53 parallel editions German-French and 90<br />
additional multilingual yearbooks. The corpus contains<br />
16.000 articles, 60% of which represent German texts,<br />
38% French, 1% Italian and the remaining 1% Swiss<br />
German and Romansh. This brings our corpus to 35,75<br />
million words extracted from almost 87.000 book pages,<br />
10% of which representing parallel texts. This feature of<br />
the corpus allows for interesting cross-language<br />
comparisons and has been used as training material for<br />
Statistical Machine Translation systems (Sennrich &<br />
Volk, 2010).<br />
75
Multilingual Resources and Multilingual Applications - Regular Papers<br />
76<br />
3. The Annotation Phases<br />
This section introduces our pipeline for processing and<br />
annotating the Text+Berg corpus. More specifically, the<br />
input consists of HTML files containing the scanned<br />
yearbooks (for yearbooks in paper format), as they are<br />
exported by the OCR software. We work with two stateof-the-art<br />
OCR programs (Abbyy FineReader 7 and<br />
OmniPage 17) in order to convert the scan images into<br />
text and then export the files in HTML format. Our<br />
processing pipeline takes them through ten consecutive<br />
stages: 1) HTML cleanup, 2) structure reducing, 3) OCR<br />
merging, 4) article identification, 5) parallel book<br />
combination, 6) tokenization, 7) correction of OCR<br />
errors, 8) named entity recognition, 9) Part of Speech<br />
(POS) tagging and 10) additional lemmatization for<br />
German. The final output consists of XML documents<br />
which mark the article structure (title, author), as well as<br />
sentence boundaries, tokens, named entities (restricted<br />
to mountain, glacier and cabin names), POS tags and<br />
lemmas. Our document processing approach is similar to<br />
other annotation pipelines, such as GATE (Cunningham<br />
et al., 2002), but it is customized for our alpine corpus.<br />
In terms of space complexity, the annotated output files<br />
require almost three times more storage space than the<br />
input HTML files and 2,3 times more space than the<br />
tokenized XML files, respectively.<br />
In the following subsections we expand on the<br />
processing stages that are especially challenging for a<br />
multilingual corpus.<br />
3.1. Article Identification<br />
The identification of articles in the text is performed<br />
during the fourth processing stage. The text is annotated<br />
conforming to an XML schema which marks the article<br />
boundaries (start, end), its title and author, paragraphs,<br />
page breaks, footnotes and captions. Some of the text<br />
structure information can be checked against the table of<br />
contents (ToC) and table of figures (where available),<br />
which are manually corrected in order to have a clean<br />
database of all articles in the corpus. Another relevant<br />
resource for the article boundary identification is the<br />
page mapping file that is automatically generated in the<br />
second stage, which relates the number printed on the<br />
original book page with the page number assigned<br />
during scanning. The process of matching entries from<br />
the table of contents to the article headers in the books is<br />
not trivial, as it requires that the article title, the author<br />
name(s) and the page number in the book are correctly<br />
recognized. We allow small variations and OCR errors,<br />
as long as they are below a specific threshold (usually a<br />
maximum deviation of 20% of characters is allowed).<br />
For example, the string K/aIbard -Eine Reise in die<br />
Eiszeit. will be considered a match for the ToC entry<br />
Svalbard - Eine Reise in die Eiszeit, although not all<br />
their characters coincide.<br />
Proper text structuring relies on the accurate<br />
identification of layout elements such as article<br />
boundaries, graphics and captions, headers and<br />
footnotes. Over the 145 years the layout of the<br />
yearbooks has changed significantly. Therefore we had<br />
to adapt different processing steps for all the various<br />
designs. The particularities of these layouts have been<br />
discussed in (Volk et al., 2010a).<br />
The yearbooks since 19<strong>96</strong> are a collection of monthly<br />
editions and their pagination is no longer continuous (it<br />
starts over every month). This change affects the page<br />
mapping process, which performs well only when page<br />
numbers are monotonically increasing. Moreover, article<br />
boundaries are hard to determine when a single page<br />
contains several small articles and not all of them<br />
specify their author's name. These particularities are also<br />
reflected in the layout, as the header lines (where<br />
existing) no longer contain information about author or<br />
title, but about the article genre. Under these<br />
circumstances, we still achieved a percentage of 80%<br />
identified articles for these new yearbooks, a value<br />
comparable to the overall percentage of the corpus.<br />
3.2. Correction of OCR Errors<br />
The correction process aims to detect and overcome the<br />
errors introduced by the OCR systems and is carried out<br />
in two different stages of the annotation process. The<br />
first revision is done in the third stage (OCR merging),<br />
where the input is still raw text, with no additional<br />
information about either the structure or the language of<br />
the articles. At this stage we combine the output of our<br />
two OCR systems. The algorithm computes the<br />
alignments in a page-level comparison of the input files<br />
provided by each system and searches the Longest<br />
Common Subsequence in a n-character window. In case
Multilingual Resources and Multilingual Applications - Regular Papers<br />
of mismatch, the system disambiguates among the<br />
different candidates and selects the word with the<br />
highest probability in that context (computed based on<br />
the word's frequency in the Text+Berg corpus). The<br />
implemented algorithm and the evaluation results are<br />
thoroughly discussed in (Volk et al., 2010b).<br />
OCR-merging is a worthwhile approach since there are<br />
many situations where one system can fix the other's<br />
errors. Our experience has shown that Abbyy<br />
FineReader performs the better OCR, with over 99%<br />
accuracy (Volk et al., 2010b). But there are also cases<br />
where it fails to provide the correct output, whereas<br />
OmniPage provides the right one. For example, the<br />
sequence Cependant, les cartes disponibles sont squvent<br />
approximatives (English: However, the available maps<br />
are often approximate) is provided by FineReader. The<br />
system has introduced the spelling mistake squvent,<br />
which doesn't appear in the output of the second system<br />
(here souvent). This triggers the replacement of the nonword<br />
squvent with the correct version souvent.<br />
During the seventh annotation stage, after tokenization,<br />
we correct errors caused by graphemic similarities. The<br />
automatic correction is performed at the word-level by<br />
pattern matching over sequences of characters. In order<br />
to achieve this, we have compiled lists of common error<br />
patterns and their possible replacements. For example, a<br />
word-initial 'R' is often misinterpreted as 'K', resulting in<br />
words such as Kedaktion instead of Redaktion (English:<br />
editorial office). For each tentative replacement we<br />
check against the word frequency list in order to decide<br />
whether a candidate word appears in the corpus more<br />
frequently than the original or the other possible<br />
replacement candidates. In this case, Redaktion has 1127<br />
occurrences in the corpus, whereas Kedaktion only 9.<br />
Reynaert (2008) describes a similar statistical approach<br />
for both historical and contemporary texts.<br />
As the yearbooks until 1957 contained articles written in<br />
several languages, we have used a single word<br />
frequency dictionary for all of them (German, French<br />
and Italian). The dictionary has been built from the<br />
Text+Berg corpus and thus contains all the encountered<br />
word types and their corresponding frequencies,<br />
computed over the same corpus. The interesting aspect<br />
about this dictionary is its reliability, in spite of being<br />
trained with noisy data (text containing OCR-errors).<br />
Correctly spelled words will typically have a higher<br />
frequency than the ones containing OCR errors. The list<br />
contains predominantly German words due to the high<br />
percentage of German articles in the first 90 yearbooks,<br />
thus the frequency of German words is usually higher<br />
than that of French words. This can lead to wrong<br />
substitution choices, such as a German word in a French<br />
sentence (e.g. Neu (approx. 4400 hits) instead of lieu<br />
(approx. 3000 hits)). Therefore we have decided to<br />
create a separate frequency dictionary for French words,<br />
which is used only for the monolingual French editions.<br />
3.3. Tokenization<br />
In this stage the paragraphs of the text are split into<br />
sentences and words, respectively. Tokenization is<br />
considered to be a straightforward problem that can be<br />
solved by applying a simple strategy such as split on all<br />
non-alphanumeric characters (e.g. spaces, punctuation<br />
marks). Studies have shown, however, that this is not a<br />
trivial issue when dealing with hyphenated compound<br />
words or other combinations of letters and special<br />
characters (e.g. apostrophes, slashes, periods etc.). He<br />
and Kayaalp (2006) present a comparative study of<br />
several tokenizers for English, showing that their output<br />
varies widely even for the same input language. We<br />
would expect a similar performance from a general<br />
purpose tokenizer dealing with several languages.<br />
We will exemplify the language-specific issues with the<br />
use of apostrophes. In many languages, they are used for<br />
contractions between different parts of speech, such as<br />
verb + personal pronoun es in German (e.g. hab's →<br />
habe + es) or determiner and noun in French or Italian<br />
(e.g. l'abri → le + abri). On the other hand, in old<br />
German written until 1900, like in modern English, it<br />
can also express possession (e.g. Goldschmied's,<br />
Theobald's, Mozart's). Under these circumstances,<br />
which is the desired tokenization, before or after the<br />
apostrophe? The answer is language-dependent and this<br />
underlies our approach towards tokenization.<br />
We use a two-step tokenization and perform the<br />
language recognition in between. The advantage of this<br />
approach is that we can deliver a language-specific<br />
tokenization of any input text (given that it is written in<br />
the supported languages). In the first step we carry out a<br />
rough tokenization of the text and then identify sentence<br />
77
Multilingual Resources and Multilingual Applications - Regular Papers<br />
boundaries. Once this is achieved, we can proceed to the<br />
language identification, which will be discussed in<br />
section 3.4.<br />
Afterwards we do another round of tokenization focused<br />
on word-level, where the language-specific rules come<br />
into play. We have implemented a set of heuristic rules<br />
in order to deal with special characters in a multilingual<br />
context, such as abbreviations, apostrophes or hyphens.<br />
For example, each acronym whose letters are separated<br />
by periods (e.g. C.A.S. or A.A.C.Z.) is considered a<br />
single token, if it is listed in our abbreviations<br />
dictionary. A German apostrophe is split from the<br />
preceding word (e.g. geht's → geht + 's), whereas in<br />
French and Italian it remains with the first word (e.g.<br />
dell'aqua → dell' + aqua, l'eau → l' + eau). Besides, we<br />
have compiled a small set of French apostrophe words<br />
which shouldn't be separated at all (e.g. aujourd'hui).<br />
Disambiguation for hyphens occurring in the middle of a<br />
word is performed by means of the general word<br />
frequency dictionary. For example, if nordouest has 14<br />
hits and nord-ouest 957 hits, we conclude that the<br />
hyphen is part of the compound and thus nord-ouest<br />
should be regarded as a single token. On the other hand,<br />
hyphens marking line breaks may also appear in the<br />
middle, like in the word rou-te. In this case, the<br />
hyphenated word appears 3 times in the dictionary,<br />
whereas the one without, route, 6335 times. Therefore<br />
the hyphen will be removed from the word.<br />
3.4. Language Identification<br />
The accuracy of the language identification is crucial for<br />
the automatic text analysis performed during the<br />
annotation process, such as tokenization, part-of-speech<br />
tagging, lemmatization or named entity identification.<br />
Therefore we perform a fine-grained analysis, at<br />
sentence level. We work with a statistical language<br />
identifier2 based on the approach presented in (Dunning,<br />
1994). The module uses two classifiers: one to<br />
distinguish between German, French, English and Italian<br />
and another one in order to discriminate between Italian<br />
and Romansh. In case the identified language is<br />
German, a further analysis based on the frequency<br />
dictionary is being carried out in order to decide whether<br />
or not it is Swiss German (CH-DE). This dictionary<br />
2 http://search.cpan.org/dist/Lingua-Ident/Ident.pm<br />
78<br />
contains frequently used Swiss German dialect words<br />
which do not have homographs in standard German.<br />
Whenever a sentence contains more than 10% dialect<br />
words from this list, the language of the sentence is set<br />
to CH-DE.<br />
However, the statistical language identification is not<br />
reliable for very short sentences. In order to achieve<br />
higher accuracy, we apply the heuristic rule that only<br />
sentences longer than 40 characters are fed to the<br />
language identifier. All the others are assigned the<br />
language of the article, as it appears in the ToC. The<br />
correctness of this decision relies on the fact that all ToC<br />
files are proofed manually, so that we do not introduce<br />
noisy data.<br />
Table 1 gives an overview of the distribution of the<br />
identified languages in the articles from the Text+Berg<br />
corpus. We present here only the composition of<br />
German and French articles, as they represent the great<br />
majority of our corpus (approximatively 98%). The<br />
values are not 100% accurate, as they are automatically<br />
computed by means of statistical methods. However,<br />
they mirror the global tendencies of the corpus that over<br />
95% of the sentences in an article are in the language of<br />
the article, a conclusion which corresponds to our<br />
expectations. An interesting finding is the percentage<br />
variation of foreign sentences. For example, German<br />
sentences are two times more frequent in French articles<br />
than the French sentences in German articles (in<br />
percentage terms). One reason for this is the fact that<br />
some French articles are translated from German and<br />
preserve the original bibliographical references, captions<br />
or footnotes. Other sources of language mixture are<br />
quotations and direct speech, aspects which can be<br />
encountered in both German and French articles.<br />
3.5. Linguistic Processing<br />
In the last two annotation stages we perform some<br />
linguistic processing, namely lemmatization and part of<br />
speech tagging. The markup is done by the TreeTagger3 .<br />
For our corpus, we have applied the standard<br />
configuration files for German, English and Italian. In<br />
the case of French we adopted a different approach, and<br />
we have trained our own parameter files based on the Le<br />
Monde-Treebank (Abeillé, 2003).<br />
3 www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Article language Number of sentences per language<br />
de en fr it rm ch-de total<br />
DE 1.166.141 1035 11.607 1481 1490 799 1.182.553<br />
FR 12.392 607 670.599 1187 1277 2 686.064<br />
Table 1: The language distribution of the sentences in the Text+Berg corpus<br />
Figure 1: An annotation snippet<br />
Romansh is not yet supported due to the lack of a<br />
sufficiently large annotated corpus for training the<br />
corresponding parameter file. Figure 1 shows a sample<br />
output: an annotated sentence in XML format.<br />
The TreeTagger assigns only lemmas for word forms<br />
that it knows (that have been encountered during the<br />
training). This results in a substantial number of word<br />
forms with unknown lemmas. Therefore we use an<br />
additional lemmatization tool, in order to increase the<br />
coverage of lemmatization. This approach has been<br />
implemented for German only because of its large<br />
number of compounds.<br />
We use the system Gertwol4 to insert missing German<br />
lemmas. Towards this goal we collect all word form<br />
types from the corpus and have Gertwol analyse them. If<br />
the TreeTagger does not assign a lemma to a word,<br />
whereas Gertwol provides an appropriate alternative, we<br />
choose the output of the latter system. This has resulted<br />
in approximately 700.000 additional lemmas, 80%<br />
percent of which represent noun lemmas, 15%<br />
adjectives and the remaining 5% other parts of speech.<br />
After performing this step, the remaining unknown<br />
4 http://www2.lingsoft.fi/cgi-bin/gertwol<br />
lemmas are mostly names and words containing OCR<br />
errors. We are interested in extending this strategy for<br />
French and Italian, in order to further increase the<br />
coverage of the annotation.<br />
4. Tools for Accessing the Corpus<br />
The Text+Berg corpus can be accessed through several<br />
search systems. For example, we have stored our<br />
annotated corpus in the Corpus Query Workbench<br />
(Christ, 1994), which allows us to browse it via a web<br />
interface5 . The queries follow the POSIX EGREP syntax<br />
for regular expressions. The advantage of this system is<br />
that it provides more precise results than usual search<br />
engines (which perform a full text search) due to our<br />
detailed annotations. For example, it is possible to query<br />
for all mountain names ending in horn that were<br />
mentioned before 1900. Moreover, it is also possible to<br />
restrict queries to particular languages or POS tags.<br />
In addition, we have built a tool for word alignment<br />
searches in our parallel corpus6 . Given a German search<br />
term, the tool displays all hits in the German part of the<br />
corpus together with the corresponding French sentences<br />
with the aligned word(s) highlighted. Other than being a<br />
word alignment visualization tool, it also serves as<br />
bilingual concordance tool to find mountaineering<br />
terminology in usage examples. In this way it is easy to<br />
determine the appropriate translation for words like<br />
Haken (English: hook) or Steigeisen (English: crampon).<br />
Moreover, it enables a consistent view of the possible<br />
translations of ambiguous words as Kiefer (English: jaw,<br />
pine) or Mönch (English: monk, mountain name). Figure<br />
2 depicts the output of the system for the word Leiter,<br />
which can either refer to leader or ladder.<br />
5 Access to the CQW is password-protected. See<br />
http://www.textberg.ch/index.php?id=4&lang=en for<br />
registration.<br />
6 http://kitt.ifi.uzh.ch/kitt/alignsearch/<br />
79
Multilingual Resources and Multilingual Applications - Regular Papers<br />
80<br />
Figure 2: Different translations of the German word Leiter in the Text+Berg corpus<br />
5. Conclusion<br />
In this paper we have given an overview of the<br />
annotation workflow of the Text+Berg corpus. The<br />
pipeline is capable of processing multilingual documents<br />
and dealing with both diachronic varieties in language<br />
and noisy data (OCR errors). The flexible architecture of<br />
the pipeline allows us to extend the corpus with more<br />
alpine literature and to process it in a similar manner,<br />
with little overhead.<br />
We have provided insights into the multilingual<br />
challenges in the annotation process, such as OCR<br />
correction, tokenization or language identification. We<br />
intend to further reduce the number of OCR errors by<br />
launching a crowd correction wiki page, where the<br />
members of the Swiss Alpine Club will be able to<br />
correct such mistakes. Regarding linguistic processing,<br />
we will continue investing efforts in improving the<br />
quality of the existing annotation tools with languagespecific<br />
resources (e.g. frequency dictionaries,<br />
additional lemmatizers). We will also work on<br />
improving the language models for Romansh and Swiss<br />
German dialects, in order to increase the reliability of<br />
the language identifier.<br />
6. References<br />
Abeillé, A., Clément, L., Toussenel, F. (2003): Building<br />
a Treebank for French. In Building and Using Parsed<br />
Corpora, Text, Speech and Language Technology(20),<br />
pp. 65–187.<br />
Christ, O. (1994): The IMS Corpus Workbench<br />
Technical Manual. Institut <strong>für</strong> maschinelle<br />
Sprachverarbeitung, <strong>Universität</strong> Stuttgart.<br />
Cunningham, H., Maynard, D., Bontcheva, K. (2002):<br />
GATE: A framework and graphical development<br />
environment for robust NLP tools and applications. In<br />
Proceedings of the 40th Anniversary Meeting of the<br />
Association for Computational Linguistics.<br />
Dunning, T. (1994): Statistical identification of<br />
language. Technical Report MCCS-94-273, New<br />
Mexico State University.<br />
He, Y., Kayaalp, M. (2006): A comparison of 13<br />
tokenizers on MEDLINE. Technical Report<br />
LHNCBC-TR-2006-003, The Lister Hill National<br />
Center for Biomedical Communications.<br />
Reynaert, M. (2008): Non-interactive OCR postcorrection<br />
for giga-scale digitization projects. In A.<br />
Gelbukh (Ed.), Proceedings of the Computational<br />
Linguistics and Intelligent Text Processing 9th<br />
International Conference, Lecture Notes in Computer<br />
Science. Berlin, Springer, pp. 617–630.<br />
Sennrich, R., Volk, M. (2010): MT-based sentence<br />
alignment for OCR-generated parallel texts. In<br />
Proceedings of AMTA. Denver.<br />
Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M.,<br />
Furrer, L., Ruef, B. (2010a): Challenges in building a<br />
multilingual alpine heritage corpus. In Proceedings of<br />
the Seventh international conference on Language<br />
Resources and Evaluation (LREC).<br />
Volk, M., Marek, T., Sennrich, R. (2010b): Reducing<br />
OCR errors by combining two OCR systems. In<br />
Proceedings of the ECAI 2010 Workshop on<br />
Language Technology for Cultural Heritage, Social<br />
Sciences, and Humanities (LaTeCH 2010).
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Visualizing Dependency Structures<br />
Chris Culy, Verena Lyding, Henrik Dittmann<br />
European Academy of Bozen/Bolzano<br />
viale Druso 1, 39100 Bolzano, Italy<br />
E-mail: chris@chrisculy.net, verena.lyding@eurac.edu, henrik.dittmann@eurac.edu<br />
Abstract<br />
In this paper we present an advanced visualization tool specialized for the presentation and interactive analysis of language structure,<br />
namely dependency structures. Extended Linguistic Dependency Diagrams (xLDDs) is a flexible tool that provides for the visual<br />
presentation of dependency structures and connected information according to the users’ preferences. We will explain how xLDD<br />
makes use of visual variables like color, shape and position to display different aspects of the data. We will provide details on the<br />
technical background and discuss issues with the conversion of dependency structures from different dependency banks. Insights<br />
from a small user study will be presented and we will discuss future directions and application contexts for xLDD.<br />
Keywords: dependency structures, dependency diagrams, visualization<br />
1. Introduction<br />
Dependency banks, and hence dependency structures, are<br />
becoming ever more widely available for different<br />
languages and are popular for a range of applications,<br />
from theoretical and applied linguistics research to<br />
pedagogy in linguistics and language learning (cf. e.g. the<br />
VISL project 1<br />
; (Hajič et al., 2001; Nivre et al., 2007)). In<br />
this context, also a number of (usually static)<br />
visualizations of dependency structures have been<br />
presented (Gerdes & Kahane, 2009; Nivre et al., 2006).<br />
Generally, visualizations of language and linguistic<br />
information (“LInfoVis”, from “Linguistic Information<br />
Visualization”) are becoming more widespread (see<br />
(Rohrdantz et al., 2010) for an overview), but<br />
visualizations targeted specifically at linguists and their<br />
informational needs are still not very common. Current<br />
attempts to visualize language data are usually either<br />
visually very simple or linguistically uninformed, and<br />
often very much bound to a specific application context.<br />
We are trying to improve this situation with a series of<br />
advanced LInfoVis tools. In this paper, we present<br />
Extended Linguistic Dependency Diagrams (xLDDs), an<br />
example of a LInfoVis tool which combines advanced<br />
visualization techniques with linguistic knowledge to<br />
create a new kind of interactive dependency diagram.<br />
This tool can be easily adapted for a variety of uses in a<br />
1 http://visl.sdu.dk/visl/en/parsing/automatic/dependency.php<br />
variety of environments and can be used with a range of<br />
dependency structure formats.<br />
2. Dependency Structures and<br />
Dependency Diagrams<br />
We will distinguish between dependency structures,<br />
which are mathematical objects (graphs), and<br />
dependency diagrams, which are visual representations<br />
of dependency structures. Unfortunately, the linguistics<br />
literature does not always maintain this distinction, but it<br />
is an important one, since the same dependency structure<br />
can have many different visual representations (see e.g.<br />
ANNIS2 2<br />
for multiple visual representations of the same<br />
structure).<br />
While there is no standard, or even general agreement,<br />
about what information should or should not be included<br />
in a dependency structure, essentially dependency<br />
structures are directed (usually acyclic) graphs that<br />
indicate binary head-dependent relations between parts<br />
of a sentence (see (Hudson, 1984) for early examples of<br />
dependency structures). We will call a dependency<br />
structure basic if it consists only of the tokens of the<br />
sentence and the relations between them, without any<br />
additional information. However, almost all dependency<br />
structures have more information than just relations<br />
between tokens (e.g. often there is lemma or part of<br />
speech (POS) information associated with the tokens).<br />
2 http://www.sfb632.uni-potsdam.de/~d1/annis<br />
81
Multilingual Resources and Multilingual Applications - Regular Papers<br />
We will refer to these dependency structures as<br />
advanced.<br />
We will call a dependency diagram linearized if it shows<br />
the tokens of the sentence in their typical presentation<br />
direction (e.g. left to right for German, right to left for<br />
Arabic). Basic dependency structures allow for basic<br />
diagrams only, as the information to visualize is restricted<br />
to tokens and dependency relations. Figure 1 shows an<br />
advanced dependency diagram of an advanced<br />
dependency structure, in that it includes a variety of<br />
information, including POS information in addition to the<br />
tokens and dependency relations. The dependency<br />
relations are indicated by directed arcs between the<br />
tokens, and the directions of the arrows follow the<br />
EAGLES 3<br />
recommendation of having the arrow pointing<br />
towards the head.<br />
It goes beyond the presentation of a typical linearized<br />
diagram in the use of color and in the positioning of the<br />
arcs. The POS of the words are encoded by colored nodes<br />
and tokens, and hovering over a token shows a tooltip<br />
with the POS type, as in Figure 1 NN (noun) for the word<br />
“Absage”. Color is also used to distinguish different<br />
dependency relations, blue arcs indicate verb–object<br />
relations, red arcs indicate verb-subject relations, green is<br />
used for modifier relations, gray for determiner-noun<br />
relations and black for the root dependency. Furthermore,<br />
the positioning of the arcs above and below the text<br />
visually separates subject and object relations (arcs<br />
below text) from any other type of relation (arcs above<br />
text). The example in Figure 1 is based on Boyd et al.’s<br />
4<br />
(2007) reanalysis (to Decca-XML format) of sentences<br />
from the Tiger Dependency Bank (TiGerDB) (Brants et<br />
al., 2002). We will have more to say about it shortly.<br />
3 http://www.ilc.cnr.it/EAGLES<strong>96</strong>/segsasg1/node44.html<br />
4 We would like to thank Adriane Boyd and Detmar Meurers<br />
for kindly providing us with the data they describe in<br />
(Boyd et al., 2007).<br />
82<br />
Figure 1: Basic linearized xLDD with color coding of parts of speech and<br />
dependency types; TiGerDB 8046, structure as in (Boyd et al., 2007)<br />
3. Extended Linguistic Dependency<br />
Diagrams<br />
3.1. Visual encoding of information<br />
One of the key ideas of information visualization is that<br />
we can use different visual features to encode different<br />
aspects of the information being visualized. Dependency<br />
structures, especially advanced dependency structures,<br />
provide lots of information that we can represent in<br />
various ways. xLDDs use three main visual properties to<br />
encode information in addition to the basic token and<br />
dependency information: position, color and size. These<br />
three visual variables are preattentive, meaning that we<br />
perceive strong differences without having to search for<br />
them actively. Information that is encoded in this way<br />
stands out among the other information present in the<br />
diagram and hence is much easier to locate and identify<br />
by the user. For example, in Figure 1 we can immediately<br />
find the verbal argument relations by the position of their<br />
arcs below the text, and the subject relation by its<br />
red color.<br />
Position is used in two ways. First, we can position the<br />
arcs above or below the text, using any kind of property,<br />
simple or calculated, to determine which arcs are below<br />
and which above (as in Figure 1). The second use of<br />
position is that of the vertical placement of tokens. By<br />
varying the standard vertical placement of tokens (i.e. not<br />
all on the same horizontal line) we can also encode<br />
certain kinds of information, as e.g. in Figure 2, where<br />
words that are split into several tokens are placed one<br />
level below the other text. This example shows an<br />
alternative reanalysis of the sentence from Figure 1, here<br />
based on By’s (2009) reanalysis of sentences from the<br />
TiGerDB. By made different choices from Boyd et al. He<br />
did not include POS information, and he split compound<br />
nouns into several tokens. Hence, we are provided with
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Figure 2: Advanced xLDD, encoding by levels words that are split into multiple<br />
tokens; TiGerDB 8046, structure as in (By, 2009)<br />
different information for the visualization. While in<br />
Figure 1 color is used on nodes and tokens to encode<br />
token-related information and on arcs to encode<br />
information on the dependency relations, in Figure 2 the<br />
coloring of nodes indicates the linear position of<br />
subject/object nodes relative to their heads:<br />
subject/object nodes left of their head are colored in red,<br />
right of their head in green, between multiple heads in<br />
yellow. Nodes of other relations are colored gray.<br />
The visual feature size is employed in xLDDs in form of<br />
the thickness of lines of arcs. In Figure 2 it is used to<br />
distinguish arcs between sub-words (thin) and any other<br />
arc (thick). As in Figure 1, arcs of subject and object<br />
relations are placed below the text and others above.<br />
There are several other visual aspects that we could use to<br />
encode information. We could, for example, also use the<br />
size or style/font of the text, or the shape of the nodes<br />
corresponding to the tokens to encode other information.<br />
All of these visual encodings in xLDDs, especially the<br />
preattentive ones, help (potentially) the user see patterns<br />
more quickly and more accurately than a monochrome,<br />
uniformly positioned dependency diagram.<br />
3.2. Visual presentation and interaction<br />
Another major hallmark of contemporary visualizations<br />
is their adjustability and interactivity. Some aspects of the<br />
visualization may not encode information but can be<br />
modified to improve readability, or cater to the individual<br />
user’s preferences. These include curvature and style of<br />
arcs, positioning of words, text size, shape of arrow heads.<br />
More or less circular arcs, staggered words, and smaller<br />
text size help to create compact displays that fit more<br />
information on the screen, which can be an advantage for<br />
displaying long sentences. Note that the same visual<br />
property, e.g. arc width, may either be facultative (when<br />
it doesn’t vary within one xLDD diagram) or may be used<br />
to encode information (when it does vary). Which setup<br />
is most helpful depends on the data to be visualized as<br />
well as on the user. Giving the user the flexibility to set<br />
those variables, besides setting variables for the visual<br />
encoding, is a major benefit of xLDD.<br />
In addition, by interacting with the visualization the user<br />
can get more information about the underlying data than<br />
can be seen in a static diagram. In the case of xLDDs, the<br />
application can provide different kinds of information in<br />
response to actions aimed at different parts of the<br />
diagram, for example clicking on a token, or its<br />
corresponding node, or moving the mouse over an arc or<br />
token. In Figure 2, we see that hovering over an arc<br />
brings up a tooltip with its relation type (here oa (direct<br />
object) between “Absage” and “erteilten”).<br />
Double-clicking on the node for “Absage”, shows<br />
token-related information, that is case, number, gender<br />
and index information, but no POS information, since it<br />
is not available in the underlying data. Since this<br />
information does not involve two tokens, it is not<br />
represented via arcs in the main diagram. It would also be<br />
possible to interactively suppress information, e.g.<br />
eliminating all arcs except the ones of interest. As with<br />
the visual features, which kinds of interaction serve what<br />
kinds of information depends on the particular<br />
application, the particular data, as well as on user<br />
preferences.<br />
3.3. Architecture and technical details<br />
xLDD is implemented in JavaScript, using the Protovis<br />
toolkit (Bostock & Heer, 2009). We have created a simple<br />
JSON exchange format for dependency structures (JSDS).<br />
Input dependency structures, whether from a fixed local<br />
source or from a dynamic web service, are converted into<br />
83
Multilingual Resources and Multilingual Applications - Regular Papers<br />
JSDS before being visualized by the xLDD framework.<br />
The xLDD framework contains an extensible visual<br />
encoding and interactive component, which allow the<br />
application developer complete control over what kinds<br />
of information are visually encoded and how, and<br />
similarly, what kinds of interactions there are. xLDD is<br />
thus intended as a tool that will be incorporated into a<br />
website or web application.<br />
Unfortunately, not all dependency structures contain the<br />
tokens of the source sentence or their order. Dependency<br />
structures following the example of the PARC 700 (King<br />
et al., 2003), for example, do not. These structures cannot<br />
be visualized as linearized dependency diagrams since<br />
they lack the relevant information, and since xLDDs are<br />
necessarily linearized, structures of this type cannot be<br />
visualized using xLDD. However, often these<br />
non-linearizable structures can be converted into<br />
linearizable ones. In fact, both of the presented examples<br />
are based on the TiGerDB, which does not contain the<br />
original tokens, following the model of the PARC 700. In<br />
both cases, the original dependency structures have been<br />
reanalyzed by other researchers to include the original<br />
token and token order information, cf. (Boyd et al., 2007)<br />
for Figure 1 and (By, 2009) for Figure 2. However, these<br />
conversions to a linearizable form are not trivial, and<br />
cannot necessarily be fully automated. An additional<br />
point is that the two conversions make different decisions<br />
about things like tokenization and POS, and so the<br />
resulting dependency structures are different from each<br />
other as well as from the original structures.<br />
Thus, in order for a dependency structure to be usable by<br />
xLDD, it must meet two conditions: first it must be<br />
linearizable (or converted to a linearizable form), and<br />
second it must be converted to the JSDS exchange format.<br />
Regarding the required exchange format, we have<br />
already written converters to JSDS for the CoNLL 2007<br />
Dependency Parsing format 5 , as well as for By’s formats<br />
and the Decca-XML format (Boyd et al., 2007). Our<br />
target format (JSDS format) is quite simple, so that<br />
converters for other (linearizable) formats to JSDS (e.g<br />
MALT-XML 6<br />
) would be easy to write.<br />
5 http://nextens.uvt.nl/depparse-wiki/DataFormat<br />
6 http://w3.msi.vxu.se/~nivre/research/MaltXML.html<br />
84<br />
4. User evaluation, future directions,<br />
and conclusion<br />
In other work (Culy et al., <strong>2011</strong>), we report on an<br />
evaluation study that we did of an earlier version of<br />
xLDD. Two usability tests plus the collection of<br />
subsequent evaluative feedback were carried out with<br />
four subjects with linguistics and language didactics<br />
background. For testing the use of the different visual<br />
features in xLDD the subjects were asked to find<br />
specified dependency relations in nine different xLDD<br />
displays (e.g. with and without the coloring of arcs, with<br />
different types of leveled and staggered text, etc.). In the<br />
tests the users’ reactions to xLDD (thinking aloud) and<br />
their performance (time and errors for task completion)<br />
were recorded. In general, users preferred visual cues<br />
over text-based indications (e.g. details in the pop-up<br />
window for each lemma) for solving the given tasks.<br />
They found color-coding and placement of the arcs to be<br />
very useful, with vertical positioning of the text<br />
somewhat less so. They also would have preferred to<br />
have some control over the visual encodings, which was<br />
not possible in the test situation, but is integrated into<br />
some of the current sample applications of xLDD in<br />
response to the users' requests. Since users did not<br />
understand what, if anything, was being encoded by<br />
vertical positioning, giving them control over the vertical<br />
positioning might have made it more useful. The main<br />
negative reaction was to problems with overlapping<br />
arrows and text, especially when the figure is zoomed out<br />
(i.e. gets smaller). Back on the positive side, there was<br />
consensus that xLDD would be useful in language<br />
learning and teaching.<br />
Finally, there are issues about how to visualize<br />
mismatches between the dependency structure and the<br />
original sentence (which are also issues for linearization).<br />
One case is that of punctuation, which may not be<br />
included in the dependency structure, but which is in the<br />
original sentence. While we might visualize only the<br />
dependency structure proper, it seems useful for some<br />
applications (e.g. language learning) to include the<br />
original punctuation.<br />
A second case is that of null elements of a sentence that<br />
are included in some dependency structures, e.g. the<br />
TiGerDB. For example, the dependency structure for<br />
“Was nicht zur Politik wird, hat keinen Zweck.”
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Figure 3: Presentation of corpus query results in the prototype thumbnails application; sentences matching the query<br />
(here “Heute” in corpus of German press releases) are presented as small xLDDs side by side with plain text.<br />
(TiGerDB 8247) has a null subject of “hat”. Since these<br />
null elements are not visible parts of the original sentence<br />
(no token is representing them), it is not clear how to<br />
visualize them. A similar question arises in dealing with<br />
multiple information contained within a single token. By<br />
(2009) and Boyd et al. (2007) make different decisions in<br />
how they handle these cases. For example, By (2009)<br />
inserts a null token following “zur” in the same example,<br />
corresponding to an empty determiner “der” (dative form<br />
of “die”) in the original Tiger structure, but Boyd et al.<br />
(2007) do not. This underscores our earlier comment that<br />
there is no agreement about the nature of dependency<br />
structures. A related issue has to do with abstract nodes,<br />
nodes which correspond to a syntactic category rather<br />
than to a null token. For example, the dependency<br />
diagram in TiGerDB for “Dazu bedarf es Kompetenz und<br />
eines gewissen Apparates.” (TiGerDB 8020) contains a<br />
node “coord” which is the head of a “coord_form”<br />
dependency with “und” as the dependent. “coord” is also<br />
the head of two “cj” dependencies with “Apparat” and<br />
“Kompetenz” as the dependents. Since “coord_form” is<br />
not a token in the sentence, it is not clear how to visualize<br />
it and its relations.<br />
A third visualization issue is where tokenization does not<br />
agree with orthographic boundaries (e.g. compounds in<br />
Tiger, where the compounds are separate elements in the<br />
original and in (By, 2009), but not in (Boyd et al., 2007)).<br />
We have done some preliminary experiments concerning<br />
these mismatches, but we plan on testing a wider range of<br />
examples. Finally, we can point out that all of these<br />
mismatches arise from ideas about dependency structures<br />
that vary from the idea of representing relations between<br />
words.<br />
In addition to addressing the functional difficulties<br />
evident in the evaluation, we have created a series of<br />
examples and prototype applications using xLDD that<br />
also take into account some of the other results of the<br />
evaluation. Several of the examples allow the user to<br />
specify which linguistic properties are encoded by which<br />
visual variables. While we can give the user full control<br />
over these encodings, often it is sufficient to use simple<br />
specifications of arc position and/or color of the arcs or<br />
tokens. Using too many visual variables is just as<br />
confusing as using none, or even more so. The specific<br />
choices of visual encodings depend on what the user is<br />
interested in – there is no single best encoding that<br />
encompasses all tasks and interests.<br />
One of the prototypes is an interactive diagram<br />
constructor for an on-line textbook. Given a sentence, the<br />
student can specify the relations among tokens, and the<br />
diagram will be constructed incrementally. It can also be<br />
verified against a correct diagram provided by the<br />
instructor. A second prototype combines a corpus query<br />
engine with xLDD. The search results (obtained via a<br />
web service) are presented as a table of the sentences and<br />
small versions of the diagrams (as shown in Figure 3). All<br />
85
Multilingual Resources and Multilingual Applications - Regular Papers<br />
these small diagrams can (simultaneously) have their<br />
visual encodings adjusted, and on clicking on any of them<br />
a larger version of that diagram is presented. These two<br />
prototypes underline the point that xLDD is a component<br />
which can be customized and used in any number of<br />
ways, and we hope that it will be adopted and adapted by<br />
others (e.g. in the context of CLARIN 7<br />
).<br />
In sum, xLDD is a new way of visualizing dependency<br />
structures, which incorporates advanced visualization<br />
techniques and provides flexibility for customizing the<br />
visualization. Color and position are used to encode<br />
information which is omitted or difficult to see in other<br />
dependency diagrams. Interaction provides even more<br />
opportunities to efficiently explore the structure. The<br />
preliminary results of a small-scale user study are<br />
promising, and give indications about what needs to be<br />
focused on for integration into specialized applications.<br />
86<br />
5. References<br />
Bostock, M., Heer, J. (2009): Protovis: A Graphical<br />
Toolkit for Visualization. IEEE Transactions on<br />
Visualization and Computer Graphics, 15(6), pp.<br />
1121–1128.<br />
Boyd, A., Dickinson, M., Meurers, D. (2007): On<br />
representing dependency relations – Insights from<br />
converting the German TiGerDB. In Proceedings of<br />
the Sixth International Workshop on Treebanks and<br />
Linguistic Theories (TLT 2007, Bergen, Norway), pp.<br />
31–42.<br />
Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.<br />
(2002): The TIGER Treebank. In Proceedings of the<br />
First Workshop on Treebanks and Linguistic Theories<br />
(TLT 2002, Sozopol, Bulgaria), pp. 24–41.<br />
Buchholz, S., Marsi, E. (2006): CoNLL-X Shared Task<br />
on Multilingual Dependency Parsing. In Proceedings<br />
of the Tenth Conference on Computational Natural<br />
Language Learning (CoNLL-X, New York City, NY,<br />
USA), pp. 149–164.<br />
By, T. (2009): The TiGer Dependency Bank in Prolog<br />
format. In Proceedings of Recent Advances in<br />
Intelligent Information Systems (IIS’09, Warsaw,<br />
Poland), pp. 119–129.<br />
Culy, C., Lyding, V., Dittmann, H. (<strong>2011</strong>): xLDD:<br />
Extended Linguistic Dependency Diagrams. In<br />
7<br />
European Research Infrastructure CLARIN,<br />
http://www.clarin.eu<br />
Information Visualization: Proceedings of the 15th<br />
International Conference on Information Visualization<br />
(IV <strong>2011</strong>, London, UK), pp. 164–169.<br />
Gerdes, K., Kahane, S. (2009): Speaking in Piles:<br />
Paradigmatic annotation of French spoken corpus. In<br />
Proceedings of the Corpus Linguistics Conference<br />
(CL2009, Liverpool, UK).<br />
Hajič, J., Vidová Hladká, B., Pajas, P. (2001): The Prague<br />
Dependency Treebank: Annotation Structure and<br />
Support. In Proceedings of the IRCS Workshop on<br />
Linguistic Databases (Philadelphia, PA, USA), pp.<br />
105–114.<br />
Hudson, R. (1984): English Word Grammar. London:<br />
Blackwell.<br />
King, T.H., Crouch, R., Riezler, S., Dalrymple, M.,<br />
Kaplan, R.M. (2003): The PARC 700 Dependency<br />
Bank. In Proceedings of the 4th International<br />
Workshop on Linguistically Interpreted Corpora<br />
(LINC-03, Budapest, Hungary).<br />
Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J.,<br />
Riel, S., Yuret, D. (2007): The CoNLL 2007 Shared<br />
Task on Dependency Parsing. In Proceedings of the<br />
CoNLL Shared Task Session of EMNLP-CoNLL 2007<br />
(Prague, Czech Republic), pp. 915–932.<br />
Nivre, J., Hall, J., Nilsson, J. (2006): Maltparser: A<br />
data-driven parser-generator for dependency parsing.<br />
In Proceedings of the Fifth International Conference<br />
on Language Resources and Evaluation (LREC 2006,<br />
Genoa, Italy), pp. 2216–2219.<br />
Rohrdantz, C., Koch, S., Jochim, C., Heyer, G.,<br />
Scheuermann, G., Ertl, T., Schütze, H., Keim, D.A.<br />
(2010): Visuelle Textanalyse. Informatik-Spektrum,<br />
33(6), pp. 601–611.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
A functional database framework for querying very large multi-layer corpora<br />
Roman Schneider<br />
Institut <strong>für</strong> deutsche Sprache<br />
R5 6-13, D-68161 Mannheim<br />
schneider@ids-mannheim.de<br />
Abstract<br />
Linguistic query systems are special purpose IR applications. We present a novel state-of-the-art approach for the efficient<br />
exploitation of very large linguistic corpora, combining the advantages of relational database management systems (RDBMS) with<br />
the functional MapReduce programming model. Our implementation uses the German DEREKO reference corpus with multi-layer<br />
linguistic annotations and several types of text-specific metadata, but the proposed strategy is language-independent and adaptable<br />
to large-scale multilingual corpora.<br />
Keywords: corpus storage, multi-layer corpora, corpus retrieval, database systems<br />
1. Introduction<br />
In recent years, the quantitative examination of natural<br />
language phenomena has become one of the<br />
predominant paradigms within (computational)<br />
linguistics. Both fundamental research on the basic<br />
principles of human language, as well as the<br />
development of speech and language technology,<br />
increasingly rely on the empirical verification of<br />
assumptions, rules, and theories. More data are better<br />
data (Church, & Mercer, 1993): Consequently, we notice<br />
a growing number of national initiatives related to the<br />
building of large representative datasets for<br />
contemporary world languages. Besides written (and<br />
sometimes spoken) language samples, these corpora<br />
usually contain vast collections of morphosyntactic,<br />
phonetic, semantic etc. annotations, plus text- or corpusspecific<br />
metadata. The downside of this trend is<br />
obvious: Even with specialized applications, our ability<br />
to store linguistic data is often bigger than the ability to<br />
process all this data.<br />
A lot of essential work towards the querying of<br />
linguistic corpora goes into data representation,<br />
integration of different annotation systems, and the<br />
formulation of query languages (e.g., Rehm et al., 2008;<br />
Zeldes et al., 2009; Kepser, Mönnich & Morawietz,<br />
2010). But the scaling problem still remains: As we go<br />
beyond corpus sizes of some million words, and at the<br />
same time increase the number of annotation systems<br />
and search keys, query costs rise disproportionately.<br />
This is due to the fact that unlike traditional IR systems,<br />
corpus retrieval systems not only have to deal with the<br />
“horizontal” representation of textual data, but with<br />
heterogeneous metadata on all levels of linguistic<br />
description. And, of course, the exploration of interrelationships<br />
between annotations becomes more and<br />
more challenging as the number of annotation systems<br />
increases. Given this context, we present a novel<br />
approach to scale up to billion-word corpora, using the<br />
example of the multi-layer annotated German Reference<br />
Corpus DEREKO.<br />
2. The Data<br />
The German Reference Corpus DEREKO currently<br />
comprises more than four billion words and constitutes<br />
the largest linguistically motivated collection of<br />
contemporary German. It contains fictional, scientific,<br />
and newspaper texts – as well as several other text types<br />
– and is annotated morphosyntactically with three<br />
competing systems (Connexor, Xerox, TreeTagger). The<br />
automated enrichment with additional metadata is<br />
underway.<br />
87
Multilingual Resources and Multilingual Applications - Regular Papers<br />
88<br />
Figure 1: Response times for nested SQL queries with three search keys (logarithmic scaled axis)<br />
3. Existing Approaches<br />
We empirically evaluated the most prominent existing<br />
querying approaches, and contrasted them with our<br />
functional model (the full paper will contain our detailed<br />
series of measurements). Given the reasonable<br />
assumptions that XML/SGML-based markup languages<br />
are more suitable for data exchange than for efficient<br />
storing and retrieval, and that traditional file-based data<br />
storage is less robust and powerful than database<br />
management systems, we focused on the following<br />
strategies:<br />
i. In-Memory Search: Due to the fact that a<br />
computer’s main memory is still the fastest form of<br />
data storage, there are attempts to implement inmemory<br />
databases even for considerably large<br />
corpora (Pomikálek, Rychlý & Kilgarriff, 2009).<br />
These indexless systems perform well for unparsed<br />
texts, but are strongly limited in terms of storage<br />
size and therefore cannot deal with data-intensive<br />
multi-layer annotations.<br />
ii. N-Gram Tables: In order to overcome physical<br />
limitations, newer approaches use database<br />
management systems and decompose sequences of<br />
strings into indexed n-gram tables (Davies, 2005).<br />
This allows queries over a limited number of search<br />
expressions, but space requirements for increasing<br />
values of n are enormous. Sentence-external queries<br />
with regular expressions or NOT-queries – both are<br />
crucial for comprehensive linguistic exploration –<br />
cannot use the n-gram-based indexes and thus<br />
perform rather poor.<br />
iii. Advanced SQL: Another strategy is to make use of<br />
the relational power of sub-queries and joins within<br />
a RDBMS. Chiarcos et al. (2008) use an<br />
intermediate language between query formulation<br />
and database backend; Bird et al. (2005) present an<br />
algorithm for the direct translation of linguistic<br />
queries into SQL. This approach uses absolute word<br />
positions, and therefore allows proximity queries<br />
without limitation of word distances. But again,<br />
even with the aid of the integrated cost-based<br />
optimizer (CBO), response times for increasing<br />
numbers of search keys become extremely long. We<br />
evaluated the proposed strategy on 1, 10, 100, 1000,
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Figure 2: MapReduce processes for a concatenated query with eight search keys<br />
iv. and 4000 million word corpora with rare-, low-,<br />
mid-, high-, and top-level search keys and found out<br />
that concatenated queries soon exceed the capability<br />
of our reference server because nested loops<br />
generate an immense workload. Figure 1 shows the<br />
response times in seconds for the query “select<br />
count(t1.co_sentenceid) from tb_token t1, (select<br />
co_id, co_sentenceid from tb_token where<br />
co_token=token1) t3, (select co_id, co_sentenceid<br />
from tb_token where co_token = token2) t2 where<br />
co_token = token3 and t1.co_sentenceid =<br />
t2.co_sentenceid and t1.co_sentenceid =<br />
t3.co_sentenceid and t1.co_id > t2.co_id and<br />
t2.co_id > t3.co_id;”, using three search keys on<br />
identical metadata types and a single-column index.<br />
This query simply counts the number of sentences<br />
that contain three specified tokens (token1, token2,<br />
token3) in a fixed order. Compared to a similar<br />
query on the 4000 Mio corpus with one search key<br />
(5s for a top-level search) or two search keys (56s),<br />
the increase of response time is obviously<br />
disproportional (301s). It gets remarkably less<br />
performant for searches on different metadata types<br />
(token, lemma, part-of-speech etc.) using multi-<br />
column indexes. Furthermore, by adding textspecific<br />
metada restrictions like text type or<br />
publication year, this querying strategy produces<br />
response times of several hours and thereby<br />
becomes fully unacceptable for real-time<br />
applications.<br />
4. Design and Implementation<br />
As our evaluation shows, existing approaches do not<br />
handle queries with complex metadata on very large<br />
datasets sufficiently. In order to overcome bottlenecks,<br />
we propose a strategy that allows the distribution of data<br />
and processor-intensive computation over several<br />
processor cores – or even cluster of machines – and<br />
facilitates the partition of complex queries at runtime<br />
into independent single queries that can be executed in<br />
parallel. It is based on two presuppositions:<br />
i. Mature relational DBMS can be used effectively to<br />
maintain parsed texts and linguistic metadata. We<br />
intensively evaluated different types of tables (heap<br />
tables, partitioned tables, index organized tables) as<br />
well as different index types (B-tree, bitmap,<br />
concatenated, functional) for the distributed storing<br />
and retrieval of linguistic data.<br />
89
Multilingual Resources and Multilingual Applications - Regular Papers<br />
ii. The MapReduce programming model supports<br />
90<br />
distributed programming and tackles large-data<br />
problems. Though MapReduce is already in use in a<br />
wide range of data-intensive applications (Lin &<br />
Dyer, 2010), its principle of “divide and conquer”<br />
has not been employed for corpus retrieval yet.<br />
In order to prove the feasibility of our approach, we<br />
implemented our corpus storage and retrieval framework<br />
on a commodity low-end server (quad-core<br />
microprocessor with 2.67 GHz clock rate, 16GB RAM).<br />
For the reliable measurement of query execution times,<br />
and especially to avoid caching effects, we always used<br />
a cold-started 64-bit database engine.)<br />
Figure 2 illustrates the map/reduce processes for a<br />
complex query, using eight dictinct search keys on<br />
Figure 3: Web-based retrieval form with our sample query<br />
different metadata types: Find all sentences containing a<br />
determiner immediately followed by a proper noun<br />
ending on “er”, immediately followed by a noun,<br />
immediately follwed by the lemma “oder”, followed by<br />
a determiner (any distance), immediately followed by a<br />
plural noun, followed by the lemma “sein” (any<br />
distance). Within a “map” step, the original query is<br />
partitioned into eight separate key-value pairs. Keys<br />
represent linguistic units (position, token, lemma, partof-speech,<br />
etc.), values may be the actual content. Thus,<br />
we can simulate regular expressions (a feature that is<br />
often demanded for advanced corpus retrieval systems,<br />
but difficult to implement for very large datasets).<br />
The queries can be processed in parallel and pass their<br />
results (sentence/position) to temporary tables. The
Multilingual Resources and Multilingual Applications - Regular Papers<br />
subsequent “reduce” processes filter out inappropriate<br />
results step by step. Usually, this cannot be executed in<br />
parallel, because each reduction produces the basis for<br />
the next step. But our framework, implemented with the<br />
help of stored procedures within the RDBMS,<br />
overcomes this restriction by dividing the process tree<br />
into multiple sub-trees. The reduce processes for each<br />
sub-tree are scheduled simultaneously, and aggregate<br />
their results after they are finished. So the seven reduce<br />
steps of our example can be executed within only four<br />
parallel stages.<br />
Our concatenated sample query with eight muti-type<br />
search keys on a four billion word corpus took less than<br />
four minutes, compared with several hours when<br />
employing SQL joins as in 3 (iii). The parallel<br />
MapReduce framework is invoked by an extensible<br />
web-based retrieval form (see figure 3) and stores the<br />
search results within the RDBMS, thus making it easy to<br />
reuse them for further statistical processing. Additional<br />
metadata restrictions (genre, topic, location, date) are<br />
translated into separate map processes and<br />
reduced/merged in parallel to the main search.<br />
5. Summary<br />
The results of our study demonstrate that the joining of<br />
relational DBMS technology with a functional/parallel<br />
computing framework like MapReduce combines the<br />
best of both worlds for linguistically motivated largescale<br />
corpus retrieval. On our reference server, it clearly<br />
outperforms other existing approaches. For the future,<br />
we plan some scheduling refinements of our parallel<br />
framework, as well as support for additional levels of<br />
linguistic description and metadata types.<br />
6. References<br />
Church, K., Mercer, R. (1993): Introduction to the<br />
Special Issue on Computational Linguistics Using<br />
Large Corpora. Computational Linguistics 19:1,<br />
pp. 1-24.<br />
Rehm, G., Schonefeld, O., Witt, A., Chiarcos, C.,<br />
Lehmberg, T. (2008): A Web-Platform for Preserving,<br />
Exploring, Visualising and Querying Linguistic<br />
Corpora and other Resources. Procesamiento del<br />
Lenguaje Natural 41, pp. 155-162.<br />
Zeldes, A., Ritz, J., Lüdeling, A., Chiarcos, C. (2009):<br />
ANNIS: A Search Tool for Multi-Layer Annotated<br />
Corpora. Proceedings of Corpus Linguistics 2009.<br />
July 20-23, Liverpool, UK.<br />
Kepser, S., Mönnich, U., Morawietz, F. (2010): Regular<br />
Query Techniques for XML-Documents. Metzing, D.,<br />
Witt, A. (Eds): Linguistic modeling of information<br />
and Markup Languages, Springer, pp. 249-266.<br />
Pomikálek, J., Rychlý, P., Kilgarriff, A. (2009): Scaling<br />
to Billion-plus Word Corpora. Advances in<br />
Computational Linguistics 41, pp. 3-13.<br />
Davies, M. (2005): The advantage of using relational<br />
databases for large corpora. International Journal of<br />
Corpus Linguistics 10 (3), pp. 307-334.<br />
Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling,<br />
A., Ritz, J., Stede, M. (2008): A Flexible<br />
Framework for Integrating Annotations from<br />
Different Tools and Tag Sets. Traitement Automatique<br />
des Langues 49(2), pp. 271-293.<br />
Bird, S., Chen, Y., Davidson, S., Lee, H., Zhen, Y.<br />
(2005): Extending Xpath to Support Linguistic<br />
Queries. Workshop on Programming Language<br />
Technologies for XML (Plan-X).<br />
Lin, J., Dyer, C. (2010): Data-Intensive Text Processing<br />
with MapReduce. Morgan & Claypool Synthesis<br />
Lectures on Human Language Technologies.<br />
91
Multilingual Resources and Multilingual Applications - Regular Papers<br />
92
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Hybrid Machine Translation for German in taraXÜ:<br />
Can translation costs be decreased without degrading quality?<br />
Aljoscha Burchardt, Christian Federmann, Hans Uszkoreit<br />
DFKI Language Technology Lab<br />
Saarbrücken& Berlin, Germany<br />
E-mail: {burchardt,cfedermann,uszkoreit}@dfki.de<br />
Abstract<br />
A breakthrough in Machine Translation is only possible if human translators are taken into the loop. While mechanisms for automatic<br />
evaluation and scoring such as BLEU have enabled fast development of systems, these systems have to be used in practice to get<br />
feedback for improvement and fine-tuning. However, it is not clear if and how systems can meet quality requirements in real-world,<br />
industrial translation scenarios. taraXÜ paves the way for wide usage of hybrid machine translation for German. In a joint consortium<br />
of research and industry partners, taraXÜ integrates human translators into the development process from the very beginning in a<br />
post-editing scenario collecting feedback for improvement of its core translation engines and selection mechanism. taraXÜ also<br />
performs pioneering work by integrating languages like Czech, Chinese, or Russian, that are not well studied to-date.<br />
Keywords: Hybrid Machine Translation, Human Evaluation, Post-Editing<br />
1. Introduction<br />
Machine Translation (MT) is a prime application of<br />
Language Technology. Research on Rule-Based MT<br />
(RBMT) goes back the early days of Artificial Intelligence<br />
in the 1<strong>96</strong>0s and some systems have reached a high<br />
level of sophistication (e.g. Schwall & Thurmair, 1997;<br />
Alonso & Thurmair, 2003). Since the mid 1990, Statistical<br />
MT (SMT) has become the prevalent paradigm in<br />
the research community (e.g. Koehn et al., 2007; Li et al.,<br />
2010). In the translation and localization industry,<br />
Translation Memory Systems (TMS) are used to support<br />
human translators by making informed suggestions for<br />
recurrent material that has to be translated.<br />
As human translators can no longer satisfy the constantly<br />
raising translation need, important questions that need to<br />
be investigated are:<br />
1) How good is MT quality today, especially for translation<br />
from and to German?<br />
2) Which paradigm is the most promising one?<br />
3) Can MT aid human translators and can it help to<br />
reduce translation costs without sacrificing quality?<br />
These questions are not easy to answer and it is clear that<br />
research on the matter is needed. The quality of MT<br />
output cannot be objectively assessed in a<br />
once-and-for-all measure (see e.g. Callison-Burch et al.,<br />
2006) and it also strongly depends on the nature of the<br />
input material. Various MT paradigms have different<br />
strengths and shortcomings, not only regarding quality.<br />
For example, RBMT allows for a good control of the<br />
overall translation process, but setting up and maintaining<br />
such a system is very costly as it requires trained<br />
specialists. SMT is cheap, but it requires huge amounts of<br />
compute power and training data, which can make it<br />
difficult to include new languages and domains. TMS can<br />
produce human quality, but are limited in coverage due to<br />
their underlying design. Finally, the question of how<br />
human translators can optimally be supported in their<br />
translation workflow has largely been untouched.<br />
Machine Translation for German The number of<br />
available mono- and bi-lingual resources for German is<br />
quite high. In the “EuroMatrix” 1<br />
which collects resources,<br />
corpora, and systems for a large number of language pairs,<br />
German ranges on the third place behind English and<br />
French. Still, only little research has been focused on MT<br />
for language pairs including German, especially for<br />
translation tasks to and from languages other than English.<br />
1 http://www.euromatrixplus.net/matrix/<br />
93
Multilingual Resources and Multilingual Applications - Regular Papers<br />
This paper reports on taraXÜ 2 , which aims to address<br />
the aforementioned questions in a consortium consisting<br />
of partners from both research and industry. taraXÜ takes<br />
the selection from hybrid MT results including RBMT,<br />
TMS, and SMT as the first part of its analytic process.<br />
Then a self-calibration 3<br />
component applies, extended by<br />
controlled language technology and human<br />
post-processing to match real-world translation concerns.<br />
A novelty in this project is that human translators are<br />
integrated into the development process from the very<br />
beginning: Within several human evaluation rounds, the<br />
automatic selection and calibration mechanisms will be<br />
refined and iteratively improved. This paper focuses on<br />
hybrid translation (Section 2) and the large-scale human<br />
evaluation rounds in taraXÜ (Section 3). In the conclusion<br />
and outlook (Section 4), ongoing and future research<br />
is sketched.<br />
94<br />
2. Hybrid Machine Translation<br />
Hybrid MT is a recent trend (e.g. Federmann et al., 2009;<br />
Chen et al., 2009) for leveraging the quality of MT. Based<br />
on the observation that different MT systems often have<br />
complementary strengths and weaknesses, different methods<br />
for hybridization are investigated that aim to “fuse”<br />
an improved translation out of the good parts of several<br />
translation candidates.<br />
2 http://taraxu.dfki.de/<br />
3 Due to limited space, this won’t be discussed herein.<br />
Figure 1: Error classification interface used within taraXÜ .<br />
Complementary Errors Typical difficulties for SMT<br />
are morphology, sentence structure, long-range<br />
re-ordering, and missing words, while strengths are<br />
disambiguation and lexical choice.<br />
RBMT systems are typically strong in morphology, sentence<br />
structure, have the ability to handle long-range<br />
phenomena, and also ensure completeness of the resulting<br />
translation. Weaknesses arise from parsing errors and<br />
wrong lexical choice. The following examples illustrate<br />
the complementary nature of such systems’ errors.<br />
1) Source: Then, in the afternoon, the visit will<br />
culminate in a grand ceremony, at which Obama will<br />
receive the prestigious award.<br />
2) RBMT 4<br />
: Dann wird der Besuch am Nachmittag in<br />
einer großartigen Zeremonie gipfeln, an der Obama<br />
die berühmte Belohnung bekommen wird.<br />
5<br />
3) SMT : Dann am Nachmittag des Besuchs in<br />
beeindruckende Zeremonie mündet, wo Obama den<br />
angesehenen Preis erhalten werden.<br />
As you can see in the translation of Example 1), the<br />
RBMT system generated a complete sentence, yet with a<br />
wrong lexical choice for award. The SMT system on the<br />
other hand generated the right reading, but made morphological<br />
errors and did not generate a complete German<br />
sentence. In the translation of Example 4), a parsing<br />
error in the analysis phase of the RBMT system led to an<br />
almost unreadable result while the SMT decoder gener-<br />
4<br />
System used: Lucy MT (Alonso & Thurmair, 2003)<br />
5<br />
System used: phrase-based Moses (Koehn et al., 2007)
Multilingual Resources and Multilingual Applications - Regular Papers<br />
ated a generally intelligible translation, yet with stylistic<br />
and formal deficits.<br />
4) Source: Right after hearing about it, he described it<br />
as a “challenge to take action.”<br />
5) RBMT: Nachdem er richtig davon gehört hatte,<br />
bezeichnete er es als eine “Herausforderung, um<br />
Aktion auszuführen.”<br />
6) SMT: Gleich nach Anhörung darüber, beschrieb er<br />
es als eine “Herausforderung, Maßnahmen zu<br />
ergreifen.”<br />
Hybrid combination can hence lead to better overall<br />
translations.<br />
A Human-centric Hybrid Approach In contrast to<br />
other hybrid approaches; taraXÜ is in the first place<br />
designed to support human post-editing, e.g., in a translation<br />
agency. Two different modes have to be handled by<br />
the project’s selection mechanism:<br />
� Human post-editing: Select thesentence that is<br />
�<br />
easiest to post-edit and have the user edit it.<br />
Standalone MT: Select the overall best translation<br />
and present it to the user.<br />
For the translation of 4), the best selection in Standalone<br />
MT mode would probably be 6), which is a useful<br />
translation, e.g., for information gisting. In Human<br />
post-editing mode, 5) would be a better selection as it can<br />
relatively quickly be transformed into 7), which is a<br />
human-quality translation.<br />
7) Human edit of 5): Gleich, nachdem er davon gehört<br />
hatte, bezeichnete er es als eine “Herausforderung,<br />
zu handeln.”<br />
One goal of taraXÜ is the design and implementation of<br />
such a novel selection mechanism; however this is still<br />
work in progress and will be described elsewhere. Apart<br />
from properties of the source sentence (domain, complexity,<br />
etc.) and the different translations (grammatical<br />
Figure 2: Post-editing interface used within taraXÜ.<br />
correctness, sentence length, etc.), the selection mechanism<br />
will also take into account “metadata” of the<br />
various systems involved such as runtime, number of<br />
out-of-vocabulary-warnings, number of different readings<br />
generated, etc.<br />
One industry partner in the project consortium provides<br />
modules for language checking that will not only be used<br />
in the selection mechanism, but also in pre-processing of<br />
the input. Starting from the observation that many translation<br />
problems arise from problematic input, another<br />
goal of taraXÜ is to develop automatic methods for<br />
pre-processing input before it is sent to MT translation<br />
engines.<br />
3. Large-Scale Human Evaluation<br />
Several large-Scale human evaluation rounds are foreseen<br />
within the duration of taraXÜ, mainly for the calibration<br />
of both the selection mechanism as well as the<br />
pre-editing steps, but also for measuring the time needed<br />
for post-editing, and for getting a detailed error classification<br />
on the translation output from the various MT<br />
systems under investigation. The evaluation rounds are<br />
performed by external Language Service Providers that<br />
usually offer human translation services and hence are<br />
considered to act as non-biased experts.<br />
Evaluation Procedure The language pairs that will be<br />
implemented and tested during the runtime of taraXÜ are<br />
listed in Table 1.<br />
English<br />
French<br />
German ⇔ Japanese<br />
Russian<br />
Spanish<br />
Chinese<br />
English ⇔<br />
Czech<br />
Table 1: Language pairs treated in taraXÜ.<br />
95
Multilingual Resources and Multilingual Applications - Regular Papers<br />
We use an extended version of the browser-based evalu-<br />
ation tool Appraise (Federmann, 2010) to collect human<br />
judgments on the translation quality of the various sys-<br />
tems under investigation in taraXÜ. A screen-shot of the<br />
error classification interface can be seen in Figure 1, the<br />
post-editing view is presented in Figure 2.<br />
Pilot Evaluation Round The first (pilot) evaluation<br />
round of taraXÜ includes the language pairs EN→DE,<br />
DE→EN, and ES→DE. The corpus size per language<br />
pair is about 2,000 sentences, the data taken mainly from<br />
previous WMT shared tasks, but also extracted from<br />
freely available technical documentation. Two evaluation<br />
tasks will be performed by the human annotators, mirroring<br />
the two modi of our selection mechanism:<br />
1) In the first task, the annotators have to rank the<br />
output of four different MT systems depending on<br />
their translation quality. In a subsequent step, they<br />
are asked to classify the two main types of errors (if<br />
any) of the chosen best translation. We use a subset<br />
of the error types suggested by (Vilar et al., 2006), as<br />
shown in Figure 1.<br />
2) The second task for the human annotators in the first<br />
evaluation round is selecting the translation that is<br />
easiest to post-edit and to perform the editing. Only a<br />
minimal post-editing should be performed.<br />
Some very first results of the ongoing examination of the<br />
first human evaluation round are shown in Table 2. The<br />
top of the table shows the over-all ranking among the four<br />
listed systems, bold face indicates the best system. Below<br />
are the results for translation from Spanish and English<br />
into German, respectively. On the bottom of the table,<br />
overall results on selected corpora are shown from the<br />
news domain (1,030 sentences from the WMT-2010<br />
news test set of Callison-Burch et al. (2010), sub-sampled<br />
proportionally to each one of its documents) and from the<br />
technical documentation of the OpenOffice project.<br />
One observation is that the systems’ ranks are comparably<br />
close except for Trados, which is not a proper MT<br />
system. The very good result of Trados on the news<br />
corpora requires further investigation. A noticeable result<br />
is that Google performs worst on the WMT corpus although<br />
the data should—in principle—have been available<br />
online for training; this will also require some more<br />
detailed inspection. The latter might, however, explain<br />
the good performance of the web-based system on the<br />
OpenOffice corpus.<br />
<strong>96</strong><br />
Lucy Moses Trados Google<br />
Overall 2.00 2.38 3.74 1.86<br />
DE-EN 2.01 2.46 3.80 1.73<br />
ES-DE 1.85 2.42 3.72 1.99<br />
EN-DE 2.12 2.28 3.71 1.89<br />
WMT10 2.52 2.59 2.21 2.69<br />
OpenOffice 1.72 2.77 3.95 1.56<br />
Table 2: First human ranking results, as the average<br />
rank of each system in each task.<br />
4. Conclusions and Outlook<br />
In this paper, we have argued and shown evidence that a<br />
human-centric hybrid approach to Machine Translation is<br />
a promising way of integrating this technology into industrial<br />
translation workflows. Even in this early stage,<br />
taraXÜ has generated positive feedback and raised interest,<br />
especially on the side of the industry partners. We<br />
reported early results from the first (pilot) evaluation of<br />
taraXÜ, including language pairs EN→DE, DE→EN,<br />
and ES→DE. After analyzing the results of this pilot,<br />
further evaluation rounds will iteratively extend the<br />
numbers of languages covered and include questions<br />
related to topics such as controlled language, error types,<br />
and the effect of different subject domains. In the presentation<br />
of this paper, we will include a more detailed<br />
discussion of the first evaluation results.<br />
5. Acknowledgements<br />
This work has partly been developed within the taraXÜ<br />
project financed by TSB Technologiestiftung Berlin –<br />
Zukunftsfonds Berlin, co-financed by the European Union<br />
– European fund for regional development. This work<br />
was also supported by the EuroMatrixPlus project<br />
(IST-231720) that is funded by the European Community<br />
under the Seventh Framework Programme for Research<br />
and Technological Development.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
6. References<br />
Alonso, J. A., Thurmair, G. (2003): The comprendium<br />
translator system. In Proceedings of the Ninth Machine<br />
Translation Summit.<br />
Callison-Burch, C., Koehn, P., Monz, C., Peterson, K.,<br />
Przybocki, M., Zaidan, O. (2010):Findings of the 2010<br />
joint workshop on statistical machine translation and<br />
metrics for machine translation. In Proceedings of the<br />
Joint Fifth Workshop on Statistical Machine Translation<br />
and MetricsMATR, pp. 17–53, Uppsala, Sweden.<br />
Association for Computational Linguistics. Revised<br />
August 2010.<br />
Callison-Burch, C., Osborne, M., Koehn, P. (2006):<br />
Re-evaluating the role of bleu in machine translation<br />
research. In Proceedings of the 11th Conference of the<br />
European Chapter of the Association for Computational<br />
Linguistics, pp. 249–256.<br />
Chen, Y., Jellinghaus, M., Eisele, A., Zhang, Y., Hunsicker,<br />
S., Theison, S., Federmann, C., Uszkoreit, H.<br />
(2009): Combining multi-engine translations with<br />
Moses. In Proceedings of the Fourth Workshop on<br />
Statistical Machine Translation, pp. 42–46, Athens,<br />
Greece. Association for Computational Linguistics.<br />
Federmann, C. (2010): Appraise: An open-source toolkit<br />
for manual phrase-based evaluation of translations. In<br />
Proceedings of the Seventh conference on International<br />
Language Resources and Evaluation.European<br />
Language Resources Association (ELRA).<br />
Federmann, C., Theison, S., Eisele, A., Uszkoreit, H.,<br />
Chen, Y., Jellinghaus, M., Hunsicker, S. (2009):<br />
Translation combination using factored word substitution.<br />
In Proceedings of the Fourth Workshop on<br />
Statistical Machine Translation, pp. 70–74, Athens,<br />
Greece. Association for Computational Linguistics.<br />
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,<br />
Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran,<br />
C., Zens, R., Dyer, C. J., Bojar, O., Constantin, A.,<br />
Herbst, E. (2007): Moses: Open source toolkit for statistical<br />
machine translation. In Proceedings of the 45th<br />
Annual Meeting of the Association for Computational<br />
Linguistics Companion Volume Proceedings of the<br />
Demo and Poster Sessions, pp. 177–180, Prague,<br />
Czech Republic. Association for Computational Linguistics.<br />
Li, Z., Callison-Burch, C., Dyer, C., Ganitkevitch, J.,<br />
Irvine, A., Khudanpur, S., Schwartz, L., Thornton, W.,<br />
Wang, Z., Weese, J., Zaidan, O. (2010): Joshua 2.0: A<br />
toolkit for parsing-based machine translation with<br />
syntax, semirings, discriminative training and other<br />
goodies. In Proceedings of the Joint Fifth Workshop on<br />
Statistical Machine Translation and MetricsMATR,<br />
pp. 133–137, Uppsala, Sweden. Association for<br />
Computational Linguistics.<br />
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J. (2001):<br />
Bleu: a method for automatic evaluation of machine<br />
translation. IBM Research Report RC22176<br />
(W0109-022), IBM.<br />
Schwall, U., Thurmair, G. (1997): From metal to t1:<br />
systems and components for machine translation applications.<br />
In Proceedings of the Sixth Machine<br />
Translation Summit, pp. 180– 190.<br />
Vilar, D., Xu, J., D’Haro, L. F., and Ney, H. (2006): Error<br />
Analysis of Machine Translation Output. In International<br />
Conference on Language Resources and Evaluation,<br />
pp. 697– 702, Genoa, Italy.<br />
97
Multilingual Resources and Multilingual Applications - Regular Papers<br />
98
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Annotation of Explicit and Implicit Discourse Relations<br />
in the TüBa-D/Z Treebank<br />
Anna Gastel, Sabrina Schulze, Yannick Versley, Erhard Hinrichs<br />
SFB 833, <strong>Universität</strong> Tübingen<br />
E-mail: (yannick.versley|erhard.hinrichs|sabrina.schulze)@uni-tuebingen.de, anna.gastel@student.uni-tuebingen.de,<br />
Abstract<br />
We report on an effort to add annotation for discourse relations, discourse structure, and topic segmentation to a<br />
subset of the texts of the Tübingen Treebank of Written German (TüBa-D/Z), which will allow the study of discourse<br />
relations and discourse structure in the context of the other information currently present in the corpus (including<br />
syntax, referential annotation, and named entities). This paper motivates the design decisions taken in the context of<br />
existing annotation schemes for RST, SDRT or the Penn Discourse Treebank, provides an overview over the<br />
annotation scheme and presents the result of an agreement study. In the agreement study, we use the notion of inter-<br />
adjudicator agreement to show that the task of discourse annotation, while challenging in principle, can be<br />
successfully solved when using appropriate heuristics.<br />
Keywords: discourse, annotation, text segmentation, agreement<br />
1. Introduction<br />
Discourse information has been proven useful for a<br />
number of tasks, including summarization (Schilder,<br />
2002) and information extraction (Somasundaran et al.,<br />
2009). While coreference corpora exist for many<br />
languages, and in large and very large sizes (frequently<br />
over one million words), the annotation of discourse<br />
structure and discourse relations has only recently<br />
gained the interest of the community at large.<br />
Many of the existing corpora containing discourse<br />
structure and/or discourse relations are tightly bound to<br />
existing discourse theories such as Rhetorical Structure<br />
Theory (RST, Mann & Thompson, 1988) or Segmented<br />
Discourse Representation Theory (Asher, 1993), or<br />
subscribe to a fundament of coherence relations while<br />
avoiding assumptions about discourse structure (Hobbs,<br />
1985; Wolf & Gibson, 2005).<br />
While annotation guidelines for corpora such as the RST<br />
Discourse Treebank (Carlson et al., 2003; see Stede<br />
2004, and van der Vlieth et al., <strong>2011</strong> for German and<br />
Dutch corpora, respectively, following these guidelines),<br />
an SDRT corpus (Hunter et al., 2007), or the Penn<br />
Discourse Treebank (PDTB, Prasad et al., 2007; see Al-<br />
Saif & Markert, 2010 for an effort towards an Arabic<br />
counterpart) generally agree on the idea of discourse<br />
relations between discourse segments, they do differ in<br />
other important aspects: RST (in particular, Carlson &<br />
Marcu, 2001) and the SDRT guidelines of (Reese et al.,<br />
2007) start from elementary discourse units (EDUs)<br />
that form the lowest level of a hierarchical structure; the<br />
PDTB's guidelines avoid the notion of discourse units,<br />
elementary or not, by asking annotators to mark<br />
connective arguments which may, but do not have to,<br />
coincide with syntactic or larger units, and do not need<br />
to form a hierarchy.<br />
In terms of the relation inventory, the most important<br />
desideratum consists in reconciling descriptive adequacy<br />
for the linguistic phenomena involved with an inventory<br />
size that can still be annotated reliably. This problem is<br />
solved in different ways: The RST guidelines contain a<br />
coarse level of 16 relation classes, which are further<br />
specified into 78 relations which are organized by<br />
nuclearity (where mononuclear relations put greater<br />
weight on one of the units, the nucleus, whereas<br />
99
Multilingual Resources and Multilingual Applications - Regular Papers<br />
CONTIGENCY [28.8%]<br />
Causal [20.5%]<br />
(c)Result-Cause (5.9%)<br />
(c)Result-Enable (4.7%)<br />
(c)Result-Epistemic (0.4%)<br />
(c)Result-Speechact (0.4%)<br />
(s)Explanation-Cause (6.6%)<br />
(s)Explanation-Enable (1.2%)<br />
(s)Explantion-Epistemic (1.1%)<br />
(s)Explanation-Speechact (0.6%)<br />
Conditional [3.0%]<br />
(c)Consequence (2.1%)<br />
(c)Alternation (0.5%)<br />
(c)Condition (0.5%)<br />
Denial [5.6%]<br />
(c)ConcessionC (4.0%)<br />
(s)Concession (2.0%)<br />
(s)Anti-Explanation (0.5%)<br />
multinuclear relations connect units that are equally<br />
important); Reese et al's guidelines for SDRT annotation<br />
do not posit any larger categories among their 14<br />
relations, but organize them by a distinction between<br />
coordinating and subordinating relations (cf. Asher &<br />
Vieu, 2005; this distinction vaguely corresponds to<br />
RST's notion of nuclearity), as well as by veridicality<br />
(where a relation is veridical if the larger unit containing<br />
it cannot be asserted without also asserting the truth of<br />
the relation arguments). The PDTB, in contrast, contains<br />
30 relations which are organized into a taxonomy with<br />
16 relations at the middle level and 4 relatively coarse<br />
top-level classes (Temporal, Contingency, Comparison,<br />
Expansion).<br />
For someone aiming to annotate a corpus with discourse<br />
structure, the choice is not easy: The Penn Discourse<br />
Treebank carefully avoids any strong commitments to<br />
the ideas it uses as a backdrop (such as Webber 2004;<br />
Knott et al., 2001), treating the annotation more like a<br />
collection of examples that can be mined to verify<br />
aspects of the theory; Al-Saif and Markert (2010), for<br />
their work on PDTB-style annotation of Arabic<br />
discourse, found it necessary to drastically simplify the<br />
annotation scheme (from 30 to 12 relations) in order to<br />
yield a feasible scheme for their annotation of explicit<br />
discourse connectives.<br />
Rhetorical Structure Theory, the most mature of the<br />
models for an annotation scheme, has also drawn a<br />
commensurate amount of (oftentimes valid) criticism:<br />
100<br />
EXPANSION [43.6%]<br />
Elaboration [23.6%]<br />
(s)Restatement (10.9%)<br />
(s)Instance (3.4%)<br />
(s)InstanceV (1.0%)<br />
(s)Background (9.1%)<br />
Interpretation [4.2%]<br />
(s)Summary (1.0%)<br />
(s)Commentary (3.3%)<br />
Continuation [6.8%]<br />
(c)Continuation (6.4%)<br />
TEMPORAL [14.35%]<br />
(c)Narration (9.3%)<br />
(s)Precondition (2.4%)<br />
COMPARISON [11,.%]<br />
(c)Parallel (3.3%)<br />
(c)ParallelV (1.1%)<br />
(c)Contrast (7.0%)<br />
REPORTING [9.5%]<br />
(s)Attribution (4.2%)<br />
(s)Source (6.0%)<br />
Table 1: Taxonomy of discourse relations with corpus frequencies<br />
The most important one is that RST defines its relations<br />
in terms of speaker intentions, which yields good<br />
descriptive adequacy (given an appropriate inventory of<br />
relations), but fares less well for cognitive plausibility<br />
(cf. the overview of critiques in Taboada & Mann,<br />
2006), with Sanders and Spooren (1999) claiming that<br />
RST lacks a separation between intentions, which are<br />
defined in terms of speaker and hearer, and their goals<br />
(as is customary in RST), and coherence relations,<br />
which connect two propositions. In a similar vein, Stede<br />
(2008) puts forward the claim that RST's notion of<br />
nuclearity encompasses criteria on different linguistic<br />
levels that are not always in agreement with each other.<br />
Despite SDRT's focus on coherence relations and its<br />
strong theoretical commitment on coherence relations<br />
and their role in structuring the text, attempts to realize<br />
these principles in a general scheme for the discourse<br />
annotation of text have been few and far in-between,<br />
with the unpublished corpus of Hunter et al (2007) being<br />
the most notable example.<br />
Hierarchical structuring of discourse is a wellestablished<br />
concept, not only because it reflects the<br />
principles that have been successful in structural<br />
accounts of syntax (see Polanyi & Scha, 1983; Grosz &<br />
Sidner, 1986, or Webber, 1991, inter alia), but also<br />
because it allows us to formulate well-formedness<br />
(coherence) constraints, as well as accessibility (Webber,<br />
1991) in terms of local configurations.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
While such a tree structure is classically motivated<br />
through intentional notions (the discourse segment<br />
purposes of Grosz & Sidner, 1986), the notion of<br />
question under discussion has been used in information<br />
structure to explain intonational focus in terms of (a<br />
hierarchy of) question under discussion (van Kuppevelt,<br />
1995; Roberts, 19<strong>96</strong>; Büring 2003; also Polanyi et al.,<br />
2003 for a related proposal). It also allows to couch<br />
well-formedness in terms of valid sub-questions (for<br />
subordination) or being (non-exhaustive) answers to a<br />
common question (for coordination; cf. Txurruka, 2003).<br />
Hence, we have, in addition to object-level relations<br />
(part-of, causality), an additional level of relations such<br />
as Contrast which are explainable in terms of<br />
information-structural notions, and which yet fulfill the<br />
intuition (made explicit by Roberts, 19<strong>96</strong>) that at any<br />
given point in discourse, interlocutors have a common<br />
notion of the discourse structure. This level is distinct<br />
from the upper-level structure that is the result of<br />
conscious structuring of the writer (possibly following<br />
genre-specific rules). As an example, some of the very<br />
general RST relations such as Motivation or<br />
Preparation are only explainable in terms of writer<br />
intentions and conscious text structuring, which may or<br />
may not be transparent to the average recipient.<br />
Our own annotation scheme reflects van Kuppevelt's<br />
and Roberts' intuitions about a shared structure in<br />
discourse: We found it important to keep a backbone of<br />
explicit hierarchical structure, as in RST's annotation<br />
scheme, but also to avoid vague relations between large<br />
text segments, which are often genre-specific or the<br />
(sometimes idiosyncratic) result of intentional text<br />
structuring by the author. The PDTB successfully uses<br />
the metaphor of implicit connectives to limit discourse<br />
relations to connective-argument-sized pieces; in our<br />
case, we reconcile an explicit notion of (shallow)<br />
hierarchy with a focus on coherence relations by<br />
dividing the text into topically coherent stretches (as<br />
discussed, e.g., by Hearst, 1997), which we call topic<br />
segments, and annotate hierarchical discourse structure<br />
(using SDRT's notion of co- and subordinating discourse<br />
relations) inside these topic segments.<br />
In the following text, section 2 gives more details on the<br />
corpus and on the annotation scheme, whereas section 3<br />
presents an experiment to establish the reliability of our<br />
scheme using an inter-annotator agreement study.<br />
Section 4 presents and summarizes our findings.<br />
2. Corpus and Annotation Scheme<br />
As a textual basis for the corpus, we selected newspaper<br />
articles from the syntactically and referentially<br />
annotated TüBa-D/Z corpus (Telljohann et al., 2009),<br />
with the current version totalling 919 sentences in 31<br />
articles, or about 29.6 sentences/article (against 20.6<br />
sentences/article on average in the complete TüBa-D/Z,<br />
which also includes very brief newswire-style reports),<br />
and altogether 1159 discourse relations and 103 topic<br />
segments (or about 9 sentences per topic segment).<br />
The relation inventory, and the distribution of different<br />
relation types, is presented in Table 1. From the starting<br />
point of the coordinating and subordinating discourse<br />
relations in Reese et al., we found it necessary to<br />
introduce finer distinctions in some places to ensure<br />
either consistency with a related effort on annotating<br />
explicit connectives (adding new relations such as<br />
Result-enable which corresponds to the Weak-Result<br />
relation proposed by Bras et al., 2006, for SDRT), but<br />
also the distinction between Contrast and Concession<br />
which is found in both the Penn Discourse Treebank and<br />
the RST annotation guidelines, but not Reese et al.'s<br />
proposal.<br />
The resulting 28 relations can be grouped into 8<br />
medium-level and 5 upper-level relation types by<br />
considering properties such as basic operation (causal<br />
vs. additive vs. temporal, with referential as a new group<br />
to account for elaborative relations) and symmetry as<br />
proposed by Sanders et al (1992); the resulting higherlevel<br />
types of discourse relations have much in common<br />
with the top-level taxonomic categories of the Penn<br />
Discourse Treebank with a small number of exceptions<br />
(the PDTB subsumes the non-symmetrical Concession<br />
relation under the label Comparison whereas we follow<br />
Sanders et al. in assuming a causal source of coherence<br />
for Concession and an additive source of coherence for<br />
the symmetrical Contrast relation; Our Reporting group<br />
includes the Attribution and Source relations that Hunter<br />
et al. use in accounting for reported facts, whereas the<br />
Penn Discourse Treebank, unlike RST and SDRT, treats<br />
attribution as an issue that is orthogonal to discourse<br />
structure).<br />
101
Multilingual Resources and Multilingual Applications - Regular Papers<br />
The hierarchical organization of relations according to<br />
basic operation does not differentiate between additional<br />
properties such as coordination/subordination or<br />
veridicality. Examples (1) and (2) serve to illustrate this<br />
distinction: 1<br />
(1) a) Private Unternehmen dürfen die Telefonbücher<br />
102<br />
der Telekom-Tochter DeTeMedien nicht ohne<br />
deren Erlaubnis zur Herstellung einer<br />
Telefonauskunfts-CDs verwenden.<br />
b) Die beklagten Unternehmen müssen den Vertrieb<br />
der Info-CDs sofort einstellen.<br />
Result-Cause(1a,1b)<br />
(2) a) Taxifahrer sind als Kolumnenthema eigentlich<br />
tabu,<br />
b) weil sie als "weiche Angriffsziele" gelten.<br />
Explanation-Cause(2a,2b)<br />
When the situation specified in Arg1(1a) is interpreted<br />
as the cause of the situation specified in Arg2 (1b), the<br />
relation between those two arguments is labeled Result-<br />
Cause. Both arguments are necessary for coherence, so<br />
they are coordinated. The second example is labeled<br />
Explanation-Cause, because the situation specified in<br />
Arg1(2a) is interpreted as the result of the situation<br />
specified in Arg2 (2b). The situation in (2a) contains the<br />
main information while the situation in (2b) contributes<br />
background information. With subordinating relations,<br />
Arg2 ('further information') is always subordinated to<br />
Arg1 ('main information'), independently of surface<br />
order, as you can see in the following two examples:<br />
(3) a) Zwei Ex-Mafiosi behaupten zudem,<br />
b) von dem Mordauftrag Andreottis gewußt zu<br />
haben.<br />
Attribution(3a,3b)<br />
(4) a) Nach Angaben von Polizeipräsident Hagen<br />
Saberschinsky<br />
b) haben Polizeibeamte einen ihrer Kollegen<br />
angezeigt.<br />
Source(4b,4a)<br />
In example (3) the main information is situated in Arg1:<br />
It is relevant for the coherence of the text to know that<br />
two mobsters testified knowing about the murder<br />
contract of Andreotti, which makes them important<br />
witnesses in the murder charges against Andreotti.<br />
1 TüBa-D/Z sentences 2563/2564, 7482/7483<br />
Therefore Arg2 is subordinated to Arg1. In example (4)<br />
the main information, namely that police officers press<br />
charges against one of their colleagues, is given by (4b).<br />
Therefore, 4b is the Arg1 of a Source relation, as it is<br />
more important to know about the complaint itself than<br />
to know where the information came from, and 4a is<br />
subordinated under 4b (cf. Hunter et al., 2007).<br />
Table 1 contains all discourse relations. Numbers in<br />
square brackets represent the distribution of the overall<br />
class. Numbers in parentheses represent the distribution<br />
of the single relation.<br />
In the table, coordinating relations are marked with a<br />
small 'c' in front of the relation and subordinating<br />
relations are marked with a small 's'.<br />
3. An experiment on inter-annotator and<br />
inter-adjudicator agreement<br />
For any annotation scheme that ventures into the domain<br />
of semantic and/or pragmatic distinctions, reliability is<br />
an issue that needs to be addressed explicitly in order to<br />
maintain the predictability of the annotated data (or,<br />
equivalently, the predictive power of conclusions from<br />
that data).<br />
Regarding the agreement on discourse relations, Marcu<br />
et al. (1999) determined κ values between κ=0.54<br />
(Brown corpus) and κ=0.62 (MUC) for fine-grained<br />
RST relations and between κ =0.59 (Brown) and κ =0.66<br />
(MUC) for coarser-grained relations. In their reliability<br />
study with the Penn Discourse Treebank, Prasad<br />
et al. (2008) determined agreement values between 80%<br />
(finest level) and 94% (coarsest level with 4 relation<br />
types), but did not report any chance-corrected values.<br />
Al-Saif and Markert (2010) report values of κ=0.57 for<br />
their PDTB-inspired connective scheme, saying that<br />
most disagreements are due to highly ambiguous<br />
connectives such as w/and, which can receive one of<br />
several relations. In a study on their Dutch RST corpus,<br />
van der Vlieth et al. (<strong>2011</strong>) found an inter-annotator<br />
agreement of κ=0.57. To the best of our knowledge, no<br />
agreement figures have been published on the RSTbased<br />
Potsdam Commentary Corpus (Stede, 2004) or<br />
any other German corpus with discourse relation<br />
annotation.<br />
In the regular annotation process of our corpus, two<br />
annotators create EDU segmentation, topic segments,
Multilingual Resources and Multilingual Applications - Regular Papers<br />
and discourse relations independently from each other;<br />
in a second step, the results from both annotators are<br />
compared and a coherent gold-standard annotation is<br />
created after discussing the goodness-of-fit of respective<br />
partial analyses to the text and the applicability of<br />
linguistic tests. In order to account for the complete<br />
annotation process including the revision step, we<br />
follow Burchardt et al. (2006) and separately report<br />
inter-annotator agreement, which is determined after the<br />
initial annotation, and inter-adjudicator agreement,<br />
which is determined after an additional adjudication<br />
step. The adjudication step is carried out by two<br />
adjudicators based on the original set of annotations, but<br />
is performed by each adjudicator independently from the<br />
other.<br />
In the case where multiple relations were annotated<br />
between the same EDU ranges (for example, a temporal<br />
Narration relation in addition to a Result-Cause relation<br />
from the Contingency group), we counted the<br />
annotations as matching whenever the complete set of<br />
relations (i.e. {Narration, Result-Cause} in the example)<br />
is the same across annotators.<br />
In a sample of three documents that we used for our<br />
agreement study, we found that annotators agreed on 49<br />
relations spans, with the comparison yielding an<br />
agreement value of κ=0.55 for individual relations, and<br />
κ=0.65 for the middle level of the taxonomy (eight<br />
relation types).<br />
For the inter-adjudicator task, we found an agreement on<br />
82 relation spans, among which relation agreement was<br />
at κ=0.83 for individual relations, and κ=0.85 for the<br />
middle level of the taxonomy, or a reduction of<br />
disagreements of about 57%.<br />
4. Discussion and Conclusion<br />
In this article, we have presented the annotation scheme<br />
we use to annotate discourse relations of complete texts<br />
in a subset of the TüBa-D/Z corpus, and reported the<br />
results of an agreement study using these guidelines and<br />
relation inventory. While the raw inter-annotator<br />
agreement is on a similar level as other annotation<br />
efforts with a similar scope, we found that a subsequent<br />
adjudication step introduces a rather substantial<br />
reduction in disagreements (between adjudicated<br />
versions that were obtained independently of each<br />
other), which suggests that a large part of the (raw)<br />
disagreement is due to the sheer complexity of the task<br />
and should not be taken as indicating the infeasibility of<br />
discourse structure (and discourse relation) annotation in<br />
general.<br />
The public availability of a corpus with discourse<br />
relation annotation in combination with the syntactic<br />
and referential annotation from the main TüBa-D/Z<br />
corpus will also allow it to provide an empirical<br />
evaluation of theories concerning the interface between<br />
syntax and discourse, such as D-LTAG (Webber, 2004)<br />
or D-STAG (Danlos, 2009) as well as those that predict<br />
interactions between referential and discourse structure<br />
(Grosz & Sidner 1986; Cristea et al., 1998; Webber,<br />
1991; Chiarcos & Krasavina, 2005, inter alia).<br />
5. References<br />
Al-Saif, A., Markert, K. (2010): Annotating discourse<br />
connectives for Arabic. In Proc. LREC 2010.<br />
Asher (1993): Reference to Abstract Objects in Discourse.<br />
Kluwer, Dordrecht.<br />
Asher, N., Lascarides, A. (2003): Logics of Conversation.<br />
Cambridge University Press, Cambridge.<br />
Asher, N., Vieu, L. (2005): Subordinating and coordinating<br />
discourse relations. Lingua 115, 591-610.<br />
Bras, M., Le Draoulec, A., Asher, N. (2006): Evidence for a<br />
Scalar Analysis of Result in SDRT from a Study of the<br />
French Temporal Connective 'alors'. In: SPRIK<br />
Conference ”Explicit and Implicit Information in Text -<br />
Information Structure across Languages”.<br />
Burchardt, A., Erk, K., Frank, A., Kowalski, A., Padó, S.,<br />
Pinkal, M. (2006): The SALSA Corpus: a German<br />
Corpus Resource for Lexical Semantics. In Proceedings<br />
of LREC 2006.<br />
Büring, D. (2003): On D-Trees, Beans, and B-Accents.<br />
Linguistics and Philosophy 26(5), pp. 511-545.<br />
Carlson, L., Marcu, D. (2001): Discourse Tagging Manual.<br />
ISI Tech Report ISI-TR-545.<br />
Carlson, L., Marcu, D., Okurowski, M. E. (2003): Building<br />
a Discourse-Tagged Corpus in the Framework of<br />
Rhetorical Structure Theory. In: Current Directions in<br />
Discourse and Dialogue, Kluwer.<br />
Chiarcos, C., Krasavina, O. (2005): Rhetorical Distance<br />
103
Multilingual Resources and Multilingual Applications - Regular Papers<br />
104<br />
Revisited: A Parametrized Approach. In Workshop on<br />
Constraints in Discourse (CID 2005).<br />
Cristea, D., Ide, N., Romary, L. (1998): Veins Theory: A<br />
Model of Global Discourse Cohesion and Coherence. In<br />
Proc. CoLing 1998.<br />
Danlos L. (2009): D-STAG : Un formalisme d'analyse<br />
automatique de discours basé sur les TAG synchrones.<br />
Revue TAL 50 (1), pp. 111-143.<br />
Grosz, B., Sidner, C. (1986): Attention, Intentions, and the<br />
structure of discourse. Computational Linguistics 12(3),<br />
pp. 175-204.<br />
Hearst, M. (1997): TextTiling: Segmenting Text into Multi-<br />
Paragraph Subtopic Passages, Computational<br />
Linguistics, 23 (1), pp. 33-64.<br />
Hobbs, J. (1985): On the Coherence and Structure of<br />
Discourse, Report No. CSLI-85-37, Center for the Study<br />
of Language and Information, Stanford University.<br />
Hunter, J., Baldridge, J., N. Asher (2007): Annotation for<br />
and Robust Parsing of Discourse Structure on<br />
Unrestricted Texts. Zeitschrift <strong>für</strong> Sprachwissenschaft<br />
26, pp. 213-239.<br />
Knott, A., Oberlander, J., O'Donnell, M., Mellish, C.<br />
(2001): Beyond Elaboration: The interaction of relations<br />
and focus in coherent text. In: Sanders, Schilperoord,<br />
Spooren (eds.), Text representation: linguistic and<br />
psycholinguistic aspects. John Benjamins.<br />
Mann, W. C., Thompson, S. A. (1998): Rhetorical Structure<br />
Theory: Toward a functional theory of text organization.<br />
Text 8, pp. 243-281.<br />
Marcu, D., Amorrortu, E., Romera, M. (1999):<br />
Experiments in Constructing a Corpus of Discourse<br />
Trees. ACL Workshop on Standards and Tools for<br />
Discourse Tagging.<br />
Polanyi, L., Scha. R. (1983): On the Recursive Structure of<br />
Discourse. In K. Ehlich & H. Van Riemsdijk (Eds.),<br />
Connectedness in sentence, discourse and text,<br />
pp. 141–178. Tilburg: Tilburg University<br />
Prasad, R., Miltsakaki, M., Dinesh, N., Lee, A., Joshi, A.,<br />
Robaldo, L., Webber, B. (2007): The Penn Discourse<br />
Treebank 2.0 Annotation Manual. Technical Report,<br />
University of Pennsylvania.<br />
Reese, B., Denis, P., Asher, N., Baldridge, J., Hunter, J.<br />
(2007): Reference Manual for the Analysis and<br />
Annotation of Rhetorical Structure. Technical Report,<br />
University of Texas at Austin.<br />
Roberts, C. (19<strong>96</strong>): Information Structure in Discourse:<br />
Towards an Integrated Formal Theory of Pragmatics. In<br />
Yoon, Kathol (eds.), OSU Workin Papers in Linguistics<br />
49: Papers in Semantics, pp. 91-136.<br />
Sanders, T. J. M., Spooren, W. P. M., Noordman, L. G. M.<br />
(1992): Toward a Taxonomy of Coherence Relations.<br />
Discourse Processes 15, pp. 1-35.<br />
Sanders, T. J. M., Spooren, W. P. M. (1999):<br />
Communicative intentions and coherence relations. In<br />
Bublitz, Lenk, Ventola (eds.) Coherence in Text and<br />
Discourse, pp. 235-250. John Benjamins, Amsterdam.<br />
Schilder, F. (2002): Robust discourse parsing via discourse<br />
markers, topicality and position. Natural Language<br />
Engineering 8(2), pp. 235-255.<br />
Somasundaran, S., Namata, G., Wiebe, J., Getoor, L.<br />
(2009): Supervised and Unsupervised Methods in<br />
Employing Discourse Relations for Improving Opinion<br />
Polarity Classification. In Proc. EMNLP 2009.<br />
Stede, M. (2004): The Potsdam Commentary Corpus. In<br />
Proc. ACL Workshop on Discourse Annotation.<br />
Telljohann, H., Hinrichs, E. W., Kübler, S., Zinsmeister, H.,<br />
Beck, K. (2009): Stylebook for the Tübingen Treebank<br />
of Written German (TüBa-D/Z). Technical Report,<br />
Seminar <strong>für</strong> Sprachwissenschaft, <strong>Universität</strong> Tübingen.<br />
Txurruka, I. G. (2003): The Natural Language Conjunction<br />
And. Linguistics and Philosophy 26(3), pp. 255-285.<br />
van der Vlieth, N., Berzlanovich, I., Bouma G., Egg, M.,<br />
Redeker, G. (<strong>2011</strong>): Building a Discourse-Annotated<br />
Dutch Text Corpus. In Proceedings of the DGfS<br />
Workshop “Beyond Semantics”, Bochumer<br />
Linguistische Arbeitsberichte 3.<br />
van Kuppevelt, J. (1995): Discourse Structure, Topicality<br />
and Questioning. Linguistics 31, pp. 109-147.<br />
Webber, B. (1991): Structure and Ostension in the<br />
Interpretation of Discourse Deixis. Natural Language<br />
and Cognitive Processes 6(2), pp. 107-135.<br />
Webber, B. (2004): DLTAG: Extending Lexicalized TAG<br />
to Discourse. Cognitive Science 28, pp. 751-779.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Devil’s Advocate on Metadata in Science<br />
Christina Hoppermann, Thorsten Trippel, Claus Zinn<br />
General and Computational Linguistics, University of Tübingen<br />
Wilhelmstraße 19, D-72074 Tübingen<br />
E-mail: christina.hoppermann@uni-tuebingen.de, thorsten.trippel@uni-tuebingen.de, claus.zinn@uni-tuebingen.de<br />
Abstract<br />
This paper uses a devil’s advocate position to highlight the benefits of metadata creation for linguistic resources. It provides an<br />
overview of the required metadata infrastructure and shows that this infrastructure is in the meantime developed by various projects<br />
and hence can be deployed by those working with linguistic resources and archiving. Possible caveats of metadata creation are<br />
mentioned starting with user requirements and backgrounds, contribution to academic merits of researchers and standardisation.<br />
These are answered with existing technologies and procedures, referring to the Component Metadata Infrastructure (CMDI). CMDI<br />
provides an infrastructure and methods for adapting metadata to the requirements of specific classes of resources, using central<br />
registries for data categories, and metadata schemas. These registries allow for the definition of metadata schemas per resource type<br />
while reusing groups of data categories also used by other schemas. In summary, rules of best practice for the creation of metadata are<br />
given.<br />
Keywords: metadata, Component Metadata Infrastructure (CMDI), infrastructure, sustainable archives<br />
1. Introduction<br />
The creation of primary research data and its analysis is a<br />
large share of a researcher’s workload. In linguistics,<br />
research data comprises many different types: there are<br />
resources such as corpora, lexicons, and grammars; there<br />
are various kinds of experimental data resulting, for<br />
example, from perception and production studies with<br />
sensor data originating from eye-tracking and MRI<br />
(magnetic resonance imaging) devices. There is data in<br />
the form of speech recordings, written text, videotaped<br />
gestures, which, in part, is annotated or transcribed along<br />
many different layers; there is audio and video data of<br />
other forms of human-human communication such as<br />
cultural or religious songs or dances; and there is also a<br />
large variety of software tools for the manipulation,<br />
analysis and interpretation of all these types of data<br />
sources.<br />
Once a study of research data yields statistically and<br />
scientifically significant results, it is documented and<br />
published, usually complementing a description of<br />
research methodology, interpretations of results, etc.,<br />
with a depiction of the underlying research data.<br />
Reputable journals are archived so that its articles are<br />
deemed accessible for a long time. Access to articles is<br />
usually facilitated via Dublin Core (DC) metadata<br />
categories such as ”author”, “title”, “journal”,<br />
“publisher” or “publication year”. In general, however,<br />
there is no infrastructure in place to access the research<br />
data underlying a reported study, although some<br />
researchers make such data available via their webpage<br />
or institution, and some conferences or journals ask<br />
authors to supplement their article with primary data,<br />
which is then also made public. 1<br />
So far, it is not the<br />
general rule to describe research data with metadata for<br />
indexing or cataloguing by themselves or others. In part,<br />
this is due to caveats for the provision of metadata held<br />
by large parts of the scientific community. In this paper,<br />
the Devil’s Advocate (DA) will articulate some of these<br />
caveats. We will aim at rebutting each of them, given the<br />
recent advances for metadata management, in particular,<br />
in the area of linguistics.<br />
2. Playing Devil’s Advocate<br />
DA: There is little if any scientific merit to be gained<br />
from resource and metadata provision.<br />
This is a view mentioned in a recent statement by the<br />
which says that infrastructure does<br />
Wissenschaftsrat 2<br />
1<br />
For example, Interspeech <strong>2011</strong> invited authors to submit<br />
supporting data files to be included on the Proceedings<br />
CD-ROM in case of paper acceptance.<br />
2<br />
The German Wissenschaftsrat is a joined council of German<br />
105
Multilingual Resources and Multilingual Applications - Regular Papers<br />
hardly provide for an increased scholarly reputation<br />
(Wissenschaftsrat, <strong>2011</strong>:23). Though this might be true<br />
for a restricted notion of scientific merit, that is the merit<br />
being defined by the number of published journal articles<br />
and books, it is not true in a less restricted sense.<br />
Furthermore, the Wissenschaftsrat (Wissenschaftsrat,<br />
<strong>2011</strong>:23) points out that infrastructural projects offer the<br />
opportunity for methodical innovations, generate new<br />
research questions, and help attracting new researchers.<br />
If new researchers, methods and research questions are<br />
part of the scientific merit, the claim that there is no<br />
scientific merit in metadata provision is thus not true.<br />
There are even more reasons for arguing that additional<br />
scientific merits are gained, at least in three overlapping<br />
areas: (1) by providing a complete overview over the<br />
field, (2) by fostering interoperability and providing<br />
reproducible, non-arbitrary results, and (3) by increasing<br />
the pace of gaining research results.<br />
First of all, in an ideal case, a metadata-driven resource<br />
inventory gives an accurate picture of a scientific<br />
landscape by containing all resource types such as<br />
corpora, lexical databases, or experiments. By having<br />
access to all these resources, in principle, nothing is<br />
gained because it is too time-consuming to analyse and<br />
reproduce research questions from the data. But as soon<br />
as resources are described by metadata, it is possible to<br />
classify, sort and provide an overview over them using<br />
the descriptions as such. Though descriptions contain<br />
generalisations, they are still sufficient to provide an<br />
outline of resources. This also serves the purpose of<br />
providing essential background for steering research<br />
activities and funding projects as well as to discover<br />
trends and gaps, all allowing to increase the researcher’s<br />
reputation and merit.<br />
Second, the metadata-based publicity fosters<br />
communication between researchers, for example,<br />
because contact information are required to gain access to<br />
resources, comparable data structures are needed to be<br />
reusable by other methods, or because selections of<br />
resources (e.g. subcorpora) have to be created. Resources<br />
can be merged and cross-evaluated to discover which<br />
results are reproducible. This helps to avoid fraud and<br />
plagiarism. At the same time, the investigation of<br />
research questions different from the original ones can be<br />
Research Foundation officials and researchers appointed by the<br />
government for consulting it on research related issues.<br />
106<br />
applied to existing resources. In all cases, good scientific<br />
practice will credit the resource creator, and thus add to<br />
his or her reputation when a publication makes reference<br />
to its underlying research data, which is possible on the<br />
basis of appropriate metadata. The references pointing to<br />
the resources can be indexed by others and are<br />
consequently added to the scientific map.<br />
Third, more and faster results can be created. By<br />
providing metadata, researchers new to a discipline gain<br />
a faster overview over the research questions and<br />
activities of a discipline as well as easier access to<br />
existing linguistic resources and tools. Moreover,<br />
accurate metadata descriptions can help avoiding the<br />
duplication of research work by providing insights and<br />
access to existing work. Hence, researchers who are<br />
applying new methods do not always have to recreate<br />
resources but can rely on existing ones, providing a<br />
jumpstart. At the same time, the resources as such are<br />
providing added benefit by being more widely used,<br />
thereby also increasing the reputation of the creator.<br />
DA: Expert knowledge on metadata is required to<br />
properly describe research data. Thus metadata<br />
experts rather than researchers are called for duty.<br />
The library sciences, with their long tradition and<br />
expertise in metadata, have many different classification<br />
systems in place to organise collections. But is it realistic<br />
to ask researchers, such as linguists, to properly describe<br />
language resources and tools with metadata, given their<br />
lack of knowledge in metadata provision, the variety and<br />
complexity of research data, and the missing dominant<br />
metadata schemes in the field? On the other hand, it<br />
seems clear that metadata provision cannot be done<br />
properly without the researchers’ involvement. It is<br />
unrealistic to assume that some research data can be just<br />
given to a librarian with expertise in linguistics (or a<br />
linguist with expertise in archiving methodology) with<br />
the task to assign proper metadata to it. There needs to be<br />
considerable involvement of the resource creator in<br />
describing the resource in formal (where possible) and<br />
informal terms (possibly by filling out a questionnaire).<br />
The “librarian” can then enter the provided information<br />
into a formal schema, ensuring that, at least, obligatory<br />
descriptors are properly provided. In sum, to put a proper<br />
metadata-based infrastructure in place, some minimal<br />
researcher training in metadata provision is needed. This
Multilingual Resources and Multilingual Applications - Regular Papers<br />
needs to be complemented with infrastructure personnel,<br />
or, if possible, with user-friendly metadata editors that<br />
trained researchers can learn to use.<br />
DA: There is a little if any consensus on the set of<br />
metadata descriptors or metadata schemes to be used<br />
in describing language resources and tools.<br />
It is clear that a common vocabulary for metadata<br />
provision is required. Otherwise it will be hard to offer<br />
effective metadata-based search and retrieval services. It<br />
is also evident that established metadata standards such<br />
as Dublin Core are insufficient, as they do not include<br />
every data category (DatCat) needed for describing<br />
specific types of resources. However, given the<br />
complexity of the research field in linguistics with its<br />
many different resource types, it is naïve to assume that<br />
established metadata schemas can be reused without<br />
loosing descriptive power. For example, resource types<br />
need to be indicated and for different resource types<br />
additional descriptive categories need to be defined. For<br />
lexical resources it is common to describe the lexical<br />
structures, for annotations the annotation tag sets, for<br />
experiments the size of the samples and the free and<br />
bound variables. Each of these data categories is only<br />
relevant for the individual type of a resource, but for<br />
these they can be more essential than categories such as<br />
“title” and “author”. As this list of data categories may<br />
require additions, since new resource types become<br />
available, it needs to be treated as an open list.<br />
In recent times, some consensus on the procedure of<br />
creating elementary field names for the description of<br />
linguistic research data has been achieved in order to<br />
allow for a standardisation of data categories. It is<br />
formally captured by the ISOcat data category registry<br />
for the description of language resources and tools (ISO<br />
12620; International Organization of Standardization,<br />
2009; http://www.isocat.org). ISOcat (Figure 1) is an<br />
open web-based registry of data categories into which<br />
everybody can insert his own data categories with<br />
(human-readable) definitions of their intended use. This<br />
is done in a private space with limited access that can be<br />
used by researchers to include new data categories not yet<br />
intended or not ready for standardisation. For private use,<br />
these data categories can already be referenced via<br />
persistent identifiers (PIDs) but they can also be stored in<br />
a public space with unrestricted access and be proposed<br />
as standard data categories. If the data categories are<br />
submitted for standardisation, a standardisation process<br />
involving domain experts is being initiated with<br />
community consensus building, quality assurance, voting<br />
and maintenance cycles.<br />
Figure 1: Relation between ISOcat, Component<br />
Registry and metadata instances<br />
The registry provides a solid base to start from, but the<br />
sheer size of available DatCats may overwhelm untrained<br />
users. Additional structures are needed to minimise cases<br />
where different users may apply different descriptors to<br />
provide similar resources with metadata. For this purpose,<br />
the Component Registry for metadata (Figure 1; Broeder<br />
et al., 2010; http://catalog.clarin.eu/ds/ComponentRegistry/#)<br />
contains a rich set of prefabricated metadata building<br />
blocks that aggregate elementary blocks of data<br />
categories into larger compounds. Researchers can select<br />
and combine existing buildings blocks – or define new<br />
ones – in a schema, which can then be instantiated to<br />
describe a given resource with the help of a so-called<br />
metadata instance (Figure 1). The concept of reusing<br />
building blocks is part of the Component Metadata<br />
Infrastructure (CMDI, http://www.clarin.eu/cmdi). For<br />
many resource types the registry already contains<br />
prefabricated schemas that can be re-used by researchers.<br />
Moreover, there exists at least one fully functional<br />
metadata editor (http://www.lat-mpi.eu/tools/arbil/) with<br />
interfaces to both ISOcat and the Component Registry. It<br />
is freely available and support is provided by the<br />
programmers to facilitate the use of the editor for<br />
non-expert users who otherwise might be overwhelmed<br />
by the total range of functions the editor offers. There are<br />
also other XML editors supporting the schemas. Once a<br />
schema is defined with these tools, these off-the-shelf<br />
107
Multilingual Resources and Multilingual Applications - Regular Papers<br />
XML editors are available to describe resources with<br />
metadata according to the metadata schema. These<br />
schemas can then be used to validate the metadata<br />
instances with the help of syntactical parsers to ensure the<br />
adherence to syntactic structures and controlled<br />
vocabulary.<br />
DA: There is rarely a right time to make a resource<br />
public (via metadata description).<br />
Research rarely follows a fully planned path. A resource<br />
such as a corpus or a lexicon is adjusted, additional layers<br />
of annotation or transcription are added, data may get<br />
re-annotated with different coders, lexical entries may get<br />
revised or extended to reflect new insights, etc.<br />
Nevertheless, the moment publications are created and<br />
project reports are written, it shall be good scientific<br />
practice to archive the underlying research data and to<br />
assign and publish metadata about the resource. Here, the<br />
current status of the resource can be marked with<br />
metadata about, for instance, the resource’s life cycle or<br />
versioning information.<br />
There is also a policy change in the funding agencies. The<br />
German Research Association (DFG), for instance, sets<br />
the terms that resources ought to be maintained by the<br />
originating institution; researchers are responsible for the<br />
proper documentation of resources, and procedures need<br />
to be defined for the case when they leave an institution<br />
(Deutsche Forschungsgemeinschaft, 1998:13). A proper<br />
documentation of resources has to include their<br />
description in terms of metadata to facilitate their<br />
archival and future retrieval.<br />
Therefore, at the latest, metadata shall be provided (or<br />
revised) at the end of a research project, at best by the<br />
researchers who have created the resource. Ideally, the<br />
life cycle stage at archiving time is already defined in the<br />
project work plan. Even if the desired final state was not<br />
accomplished, the primary data needs to be archived by<br />
the end of the project with proper metadata assigned to it.<br />
DA: Without a central metadata agency, all the added<br />
values advertised will not materialise.<br />
Added values such as searchability and citation of<br />
resources require some point of access to the metadata. It<br />
is correct that there is not a single central metadata<br />
agency but there are various interconnected agencies<br />
providing services to the community in terms of metadata.<br />
108<br />
For instance, the German NaLiDa project<br />
(http://www.sfs.uni-tuebingen.de/nalida/) serves as a<br />
metadata centre for resources and tools created in<br />
Germany. The project as such does not claim exclusive<br />
representation, but aims at cooperating with other<br />
archives in providing a service to the community for<br />
accessing metadata in the form of catalogues and<br />
allowing easy access to resources. It harvests metadata<br />
from participating institutions and also provides<br />
metadata management support for German research<br />
institutions (Barkey et al., <strong>2011</strong>). Within the project,<br />
a faceted search interface was developed with<br />
complementation of a full-text search engine<br />
(http://www.sfs.uni-tuebingen.de/nalida/katalog), with<br />
currently access to more than 10,000 metadata records of<br />
language resources and tools. Though the NaLiDa project<br />
could be seen as a central metadata agency, its<br />
implementation has a rather decentralised flavour.<br />
Metadata is harvested from various sources and then<br />
aggregated and indexed into a single database. To<br />
kick-start or increase the inflow of data, participating<br />
institutions receive help both in terms of setting-up an<br />
OAI-PMH 3<br />
-based data provision service and in other<br />
aspects of metadata creation and maintenance. Once the<br />
local metadata providers – the primary research data<br />
remains with the institutions – are set up, other parties<br />
than NaLiDa are free to crawl their data sets and to<br />
provide services in terms of all data.<br />
At the European level, the CLARIN project<br />
(http://www.clarin.eu) has also devised such a crawler,<br />
and is likewise offering a faceted search interface for<br />
language resources and tools (CLARIN Virtual Language<br />
Observatory, http://www.clarin.eu/vlo/). Since both (and<br />
other) parties work towards the realisation of a common<br />
infrastructure, with different foci but similar goals, there<br />
is much to be gained from a healthy competition and<br />
exchange of ideas for the scientific community to profit<br />
from.<br />
3. Summary<br />
Given the recent advances in linguistics with regard to<br />
metadata provision for linguistic resources and tools,<br />
there is little left to offer excuses for not using the<br />
existing infrastructure. In general, this results in the<br />
3 Open Archives Initiative Protocol for Metadata Harvesting
Multilingual Resources and Multilingual Applications - Regular Papers<br />
following rules of best practice for the documentation of<br />
resources:<br />
1) One of the best strategies for preserving research<br />
data is by publishing it into repositories and<br />
networks. This way, multiple archives serve as<br />
backup. Additionally, it allows for an easier sharing<br />
and spreading of resources, contributing to the<br />
academic merits of resource providers.<br />
2) Archived data is easier accessible if the data is<br />
sufficiently described. As flexible metadata schemas<br />
can adapt for various types of resources, it is possible<br />
to create such descriptions as required by the type of<br />
a resource. Metadata can then be used to make<br />
resources public, in order for others to use (harvest)<br />
them.<br />
3) Data categories are best defined in central<br />
4)<br />
(standardised) registries, such as ISOcat, that allow<br />
for references via persistent identifiers. No data<br />
categories should be used that are not centrally<br />
defined to avoid fragmentation of the resource<br />
community.<br />
For interoperability purposes, components as<br />
collections of data categories should be reused where<br />
adequate or defined as new entries in the Component<br />
Registry for reuse by others.<br />
5) The flexibility of the framework helps to avoid tag<br />
abuse if data providers adhere to data category<br />
definitions or, if not available, define their own<br />
modified categories. This will contribute to the<br />
consistency and reusability of data.<br />
6) Syntactic evaluation of metadata should always be<br />
performed to ensure harvesting, usability of<br />
7)<br />
applications and consistency. By checking for<br />
content models, tag abuse can be avoided further.<br />
When using research data, it should be referred to<br />
them as stated in the data’s metadata.<br />
8) Resource creators might need some training and<br />
assistance, which is provided by various projects.<br />
Some time for this work should be included.<br />
4. Acknowledgements<br />
Work on this paper was conducted within the Centre for<br />
Sustainability of Linguistic Data (<strong>Zentrum</strong> <strong>für</strong><br />
Nachhaltigkeit Linguistischer Daten, NaLiDa), which is<br />
funded by the German Research Foundation (DFG) in the<br />
Scientific Library Services and Information Systems<br />
(LIS) framework, and within the infrastructure project<br />
Heterogeneous Primary Research Data: Representation<br />
and Processing of the Collaborative Research Centre The<br />
Construction of Meaning: the Dynamics and Adaptivity<br />
of Linguistic Structures (SFB 833), which is also funded<br />
by the DFG.<br />
5. References<br />
Barkey, R., Hinrichs, E., Hoppermann, C. Trippel, T., Zinn,<br />
C. (<strong>2011</strong>): Komponenten-basierte Metadatenschemata<br />
und Facetten-basierte Suche - Ein flexibler und<br />
universeller Ansatz. In J. Griesbaum, T. Mandl & C.<br />
Womser-Hacker (eds.), Information und Wissen: global,<br />
sozial und frei? Internationales Symposium der<br />
Informationswissenschaft (Hildesheim). Boizenburg:<br />
Verlag Werner Hülsbusch (vwh), pp. 62-73.<br />
Broeder, D., Kemps-Snijders, M., Van Uytvanck, D.,<br />
Windhouwer, M., Withers, P., Wittenburg, P., Zinn, C.<br />
(2010): A Data Category Registry- and<br />
Component-based Metadata Framework. In<br />
Proceedings of the 7th Conference on International<br />
Language Resources and Evaluation, 19-21 May 2010,<br />
European Language Resources Association.<br />
Deutsche Forschungsgemeinschaft (1998): Vorschläge zur<br />
Sicherung guter wissenschaftlicher Praxis:<br />
Empfehlungen der Kommission „Selbstkontrolle in der<br />
Wissenschaft“, Denkschrift. Weinheim: Wiley-VCH.<br />
See<br />
http://www.dfg.de/download/pdf/dfg_im_profil/reden_<br />
stellungnahmen/download/empfehlung_wiss_praxis_0<br />
198.pdf (retrieved March 31, <strong>2011</strong>).<br />
International Organization of Standardization (2009):<br />
Terminology and other language and content resources -<br />
Specification of data categories and management of a<br />
Data Category Registry for language resources<br />
(ISO-12620-2009), Geneva. Go to www.isocat.org to<br />
access the registry.<br />
Wissenschaftsrat (<strong>2011</strong>): Empfehlung zu<br />
Forschungsinfrastrukturen in den Geistes- und<br />
Sozialwissenschaften. Berlin: 28/01/<strong>2011</strong>. See<br />
http://www.wissenschaftsrat.de/download/archiv/10465<br />
-11.pdf (retrieved March 31, <strong>2011</strong>).<br />
109
Multilingual Resources and Multilingual Applications - Regular Papers<br />
110
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Improving an Existing RBMT System by Stochastic Analysis<br />
Christian Federmann, Sabine Hunsicker<br />
DFKI – Language Technology Lab<br />
Stuhlsatzenhausweg 3, D-66123 Saarbrücken, GERMANY<br />
E-mail: {cfedermann,sabine.hunsicker}@dfki.de<br />
Abstract<br />
In this paper we describe how an existing, rule-based machine translation (RBMT) system that follows a transfer-based translation<br />
approach can be improved by integrating stochastic knowledge into its analysis phase. First, we investigate how often the rule-based<br />
system selects the wrong analysis tree to determine the potential benefit from an improved selection method. Afterwards we describe<br />
an extended architecture that allows integrating an external stochastic parser into the analysis phase of the RBMT system. We report<br />
on the results of both automatic metrics and human evaluation and also give some examples that show the improvements that can be<br />
obtained by such a hybrid machine translation setup. While the work reported on in this paper is a dedicated extension of a specific<br />
rule-based machine translation system, the overall approach can be used with any transfer-based RBMT system. The addition of<br />
stochastic knowledge to an existing rule-based machine translation system represents an example of a successful, hybrid<br />
combination of different MT paradigms into a joint system.<br />
Keywords: Machine Translation, Hybrid Machine Translation, Stochastic Parsing, System Combination<br />
1. Introduction<br />
Rule-based machine translation (RBMT) systems that<br />
employ a transfer-based translation approach, highly<br />
depend on the quality of their analysis phase as it<br />
provides the basis for its later processing phases, namely<br />
transfer and generation. Any parse failures encountered<br />
in the initial analysis phase will proliferate and cause<br />
further errors in the following phases. Very often, bad<br />
translation results can be traced back to incorrect analysis<br />
trees that have been computed for the respective input<br />
sentences. Consequently, any improvements that can be<br />
achieved for the analysis phase of some RBMT system<br />
lead to improved translation output, which makes this an<br />
interesting topic in the context of hybrid machine<br />
translation.<br />
In this paper we describe how a stochastic parser can<br />
supplement the rule-based analysis phase of a<br />
commercial RBMT system. The system in question is the<br />
rule-based engine Lucy LT. This engine uses a<br />
sophisticated RBMT transfer approach with a long<br />
research history, as explained in detail in (Wolf et al.,<br />
2010). The output of its analysis phase is a forest<br />
containing a small number of tree structures. For this<br />
study we investigated if the existing rule base of the Lucy<br />
LT system chooses the best tree from the analysis forest<br />
and how the selection of this best tree out of the set of<br />
candidates can be improved by adding stochastic<br />
knowledge to the RBMT system.<br />
The paper is structured in the following way: in Section 2<br />
we describe the Lucy RBMT system and its<br />
transfer-based architecture. Afterwards, in Section 3, we<br />
provide details on the integration of a stochastic parser<br />
into the Lucy analysis phase of this rule-based system.<br />
Section 4 describes the experiments we performed and<br />
reports the results of both automated metrics and human<br />
evaluation efforts before Section 5 discusses some<br />
examples that show how the proposed approach has<br />
improved or degraded machine translation quality.<br />
Finally, in Section 6, we conclude and provide an outlook<br />
on future work in this area.<br />
2. Lucy System Architecture<br />
The Lucy LT engine is a renowned RMBT system that<br />
follows a classical, transfer-based translation approach.<br />
The system first analyses the given source sentence<br />
resulting in a forest of several analysis trees. One of these<br />
trees is then selected (as “best” analysis) and transformed<br />
in the transfer phase into a tree structure from which the<br />
target text can be generated.<br />
It is clear that any errors that occur during the initial<br />
111
Multilingual Resources and Multilingual Applications - Regular Papers<br />
analysis phase proliferate and cause negative side effects<br />
on the quality of the resulting translation. As the analysis<br />
phase is of special importance, we describe it in more<br />
detail. The Lucy LT analysis consists of several phases:<br />
1) The input is tokenised with regards to the source<br />
language lexicon.<br />
2) The resulting tokens then undergo a morphological<br />
analysis, which identifies possible combinations of<br />
allomorphs for a token.<br />
3) This leads to a chart which forms the basis for the<br />
actual parsing, using a head-driven strategy. Special<br />
treatment is performed for the analysis of multi-word<br />
expressions and also for verbal framing.<br />
At the end of the analysis, there is an extra phase named<br />
phrasal analysis that is called whenever the grammar was<br />
not able to construct a legal constituent from all the<br />
elements of the input. This happens in several different<br />
scenarios:<br />
� The input is ungrammatical according to the LT<br />
analysis grammar.<br />
� The category of the derived constituent is not one of<br />
the allowed categories.<br />
� A grammatical phenomenon in the source sentence is<br />
not covered.<br />
� There are missing lexical entries for the input<br />
sentence.<br />
During the phrasal analysis, the LT engine collects all<br />
partial trees and greedily constructs an overall<br />
interpretation of the chart. Based on our findings from<br />
experiments with the Lucy LT engine, phrasal analyses<br />
are performed for more than 40% of the sentences from<br />
our test sets and very often result in bad translations.<br />
Each resulting analysis tree, independent of whether it is<br />
a grammatical or phrasal analysis, is also assigned an<br />
integer score by the grammar. The tree with the highest<br />
score is then handed over to the transfer phase, thus<br />
pre-defining the final translation output.<br />
112<br />
3. Adding Stochastic Analysis<br />
An initial, manual evaluation of the translation quality<br />
based on the tree selection of the analysis phase showed<br />
that there is potential for improvement. For this, we<br />
changed the RBMT system to produce translations for all<br />
its analysis trees and ranked them according to their<br />
quality. In many cases, one of the alternative trees would<br />
have lead to a better translation.<br />
Next to the assigned score, we examined the significance<br />
of two other features:<br />
1) The size of the analysis trees themselves, and<br />
2) The tree edit distance of each analysis candidate to a<br />
stochastic parse tree.<br />
An advantage of stochastic parsing lies in the fact that<br />
parsers from this class can deal very well even with<br />
ungrammatical or unknown output, which we have seen<br />
is problematic for a rule-base parser. We decided to make<br />
use of the Stanford Parser as described in<br />
(Klein & Manning, 2003), which uses an unlexicalised,<br />
probabilistic context-free grammar trained on the Penn<br />
Treebank. We parse the original source sentence with this<br />
PCFG grammar to get a stochastic parse tree that can be<br />
compared to the trees from the Lucy analysis forest.<br />
In our experiments, we compare the stochastic parse tree<br />
with the alternatives given by Lucy LT. Tree comparison<br />
is implemented based on the Tree Edit Distance, as<br />
originally defined in (Zhang & Shasha, 1989}. In<br />
analogy to the Word Edit or Levenshtein Distance, the<br />
distance between two trees is the number of editing<br />
actions that are required to transform the first tree into the<br />
second tree. The Tree Edit Distance knows three actions:<br />
� Insertion<br />
� Deletion<br />
� Renaming (substitution in Levenshtein Distance)<br />
We use a normalised version of the Tree Edit Distance to<br />
estimate the quality of the trees from the Lucy analysis<br />
forest. The integration of the stochastic selection has<br />
been possible by using an adapted version of the<br />
rule-based system, which allowed performing the<br />
selection of the analysis tree from an external process.<br />
4. Experiments<br />
Two test sets were used in our experiments. The first test<br />
set was taken from the WMT shared task 2008, consisting<br />
of a section of data from Europarl (Koehn, 2005). The<br />
second test set, which was taken from the WMT shared<br />
task 2010 contained news text. Phrasal analyses caused<br />
by unknown lexical items occurred more often in the<br />
news text, as that text sort tends to more often use<br />
colloquial expressions. In our experiments, we translated<br />
from English→German; evaluation was performed using<br />
both automated metrics and human evaluation using an<br />
annotation tool similar to e.g. Appraise (Federmann,<br />
2010).
Multilingual Resources and Multilingual Applications - Regular Papers<br />
First, only the Tree Edit Distance and internal score from<br />
the Lucy analysis phase were used and we select the tree<br />
with the lowest edit distance. If the lowest distance holds<br />
for two or more trees, the tree with the highest LT internal<br />
score is chosen. Later we added the size of the candidate<br />
trees as an additional feature, with a bias to prefer larger<br />
trees as they proved to create better translations in our<br />
experiments. Results from automatic scoring using<br />
BLEU (Papineni et al., 2001) and the derived NIST score<br />
are reported in Table 1 and Table 2 for test set #1 and test<br />
set #2, respectively. The BLEU scores for the new<br />
translations are a little bit worse, but still comparable to<br />
the quality of the original translations. The difference is<br />
not statistically significant.<br />
Test set #1 BLEU NIST<br />
Baseline 0.1100 4.4059<br />
Stochastic Selection 0.10<strong>96</strong> 4.3946<br />
Table 1: Automatic scores for test set #1.<br />
Test set #2 BLEU NIST<br />
Baseline 0.1529 5.5725<br />
Stochastic Selection 0.1514 5.5469<br />
Selection+Size 0.1511 5.5341<br />
Table 2: Automatic scores for test set #2.<br />
We also manually evaluated a sample of 100 sentences.<br />
For this, we created all possible translations for each<br />
phrasal analysis and had human annotators judge on their<br />
quality. Then, we checked whether our stochastic<br />
selection mechanism returned a tree that led to the best<br />
translation. In case it did not, we investigated the reasons<br />
for this. Sentences for which all trees created the same<br />
translation were skipped.<br />
Table 3 shows the error rate of our stochastic analysis<br />
component that chose the optimal tree for 56% of the<br />
sentences, while Table 4 shows the selection reasons that<br />
resulted in the selection of a non-optimal tree. We also<br />
see that the minimal tree edit distance seems to be a good<br />
feature to use for comparisons, as it holds for 71% of the<br />
trees, including those examples where the best tree was<br />
not scored highest by the LT engine. This also means that<br />
additional features for choosing the tree out of the group<br />
of trees with the minimal edit distance are required.<br />
Best translation? Yes (56%) No (44%)<br />
Minimal distance? Yes (71%) No (29%)<br />
Table 3: Error rate of the stochastic analysis.<br />
More than 50 tokens in source 36.4%<br />
Time-out before best tree is reached 29.5%<br />
Chosen tree had minimal distance 34.1%<br />
Table 4: Reasons for erroneous tree selection.<br />
Even for the 29% of sentences, in which the optimal tree<br />
was not chosen, little quality was lost: in 75.86% of those<br />
cases, the translations didn't change at all (obviously the<br />
trees resulted in equal translation output). In the<br />
remaining cases the translations were divided evenly<br />
between slight degradations and equal quality.<br />
In cases when the best tree was not chosen, the first tree<br />
(which is the default tree) was selected in 70.45% . This<br />
is due to a combination of robustness factors that are<br />
implemented in the RBMT system and have been beyond<br />
our control in the experiments. The LT engine has several<br />
different indicators that may each throw a time-out<br />
exception, if, for example, the analysis phase takes too<br />
long to produce a result. To avoid getting time-out errors,<br />
only sentences with up to 50 tokens are treated by our<br />
stochastic selection mechanism. Additionally, the<br />
component itself checks the processing time and returns<br />
intermediate results, if this limit is reached. We are<br />
currently working on eliminating this time-out issue as it<br />
prevents us from driving our approach to its full potential.<br />
As with the internal score, we see that the Tree Edit<br />
Distance on its own is a good indicator of the quality of<br />
the analysis, but that additional features are required to<br />
prevent suboptimal decisions to be taken. As such, we<br />
included the size of the trees. Here the bigger trees are<br />
preferred to smaller ones as experimental results have<br />
confirmed that these are more likely to produce better<br />
translations.<br />
The manual evaluation shows results that are similar to<br />
the automated metrics. We are currently investigating in<br />
more detail what happened in case of the degradations to<br />
improve that misbehaviour. It seems as if additional<br />
features might be needed to more broadly improve the<br />
rule-based machine translation engine using our<br />
stochastic selection mechanism.<br />
113
Multilingual Resources and Multilingual Applications - Regular Papers<br />
114<br />
5. Examples<br />
We now provide some examples from our experiments<br />
that illustrate how the stochstic selection mechanism<br />
changed the translation output of the rule-based system.<br />
For example, the analysis of the following sentence is<br />
now correct:<br />
Source: “They were also protesting against bad pay<br />
conditions and alleged persecution.”<br />
Translation A: “Sie protestierten auch gegen schlechte<br />
Soldbedingungen und behaupteten Verfolgung.”<br />
Translation B: “Sie protestierten auch gegen schlechte<br />
Soldbedingungen und angebliche Verfolgung.”<br />
Translation A is the default translation. The analysis tree<br />
associated with this translation contains a node for the<br />
adjective “alleged” which is wrongly parsed as a verb.<br />
The next example shows how an incorrect word order<br />
problem is fixed:<br />
Source: “If the finance minister can't find the money<br />
elsewhere, the project will have to be aborted and<br />
sanctions will be imposed, warns Janota.”<br />
Translation A: “Wenn der Finanzminister das Geld nicht<br />
anderswo finden kann, das Projekt abgebrochen<br />
werden müssen wird und Sanktionen auferlegt<br />
werden werden, warnt Janota.”<br />
Translation B: “Wenn der Finanzminister das Geld nicht<br />
anderswo finden kann, wird das Projekt abgebrochen<br />
werden müssen und Sanktionen werden auferlegt<br />
werden, warnt Janota.”<br />
Lexical items are associated with a domain area in the<br />
lexicon of the rule-based system. Items that are contained<br />
within a different domain area than the input text are still<br />
accessible, but items in the same domain are preferred. In<br />
the following example, this leads to an incorrect<br />
disambiguation of multi-word expressions:<br />
Source: “Apparently the engine blew up in the rocket's<br />
third phase.”<br />
Translation A: “Offenbar blies der Motor hinauf die<br />
dritte Phase der Rakete in.”<br />
Translation B: “Offenbar flog der Motor in der dritten<br />
Phase der Rakete in die Luft.”<br />
Again, the stochastic selection allows choosing a better<br />
tree, which leads to the correct idiomatic translation.<br />
Something similar happens in the following case:<br />
Source: “As of January, they should be paid for by the<br />
insurance companies and not compulsory.”<br />
Translation A: “Ab Januar sollten sie <strong>für</strong> von den<br />
Versicherungsgesellschaften und nicht obligatorisch<br />
bezahlt werden.”<br />
Translation B: “Ab Januar sollten sie von den<br />
Versicherungsgesellschaften und nicht obligatorisch<br />
gezahlt werden.”<br />
These changes remain at a rather local scope, but we also<br />
have observed instances where the sentence improves<br />
globally:<br />
Source: “In his new book, ‘After the Ice’, Alun Anderson,<br />
a former editor of New Scientist, offers a clear and<br />
chilling account of the science of the Arctic and a<br />
gripping glimpse of how the future may turn out<br />
there.”<br />
Translation A: “In seinem neuen Buch bietet Alun<br />
Anderson, ein früherer Redakteur von Neuem<br />
Wissenschaftler, ‘Nach dem Eis’ einen klaren und<br />
kalten Bericht über die Wissenschaft der Arktis und<br />
einen spannenden Blick davon an, wie die Zukunft<br />
sich hinaus dort drehen kann.”<br />
Translation B: “In seinem neuen Buch, ‘Nach dem Eis’,<br />
bietet Alun Anderson, ein früherer Redakteur von<br />
Neuem Wissenschaftler, einen klaren und kalten<br />
Bericht über die Wissenschaft der Arktis und einen<br />
spannenden Blick davon an, wie die Zukunft sich<br />
hinaus dort drehen kann.”<br />
In translation A, the name of the book, “After the Ice”,<br />
has been moved to an entirely different place in the<br />
sentence, removing it from its original context.<br />
6. Conclusion and Outlook<br />
The analysis phase proves to be crucial for the quality of<br />
the translation in rule-based machine translation systems.<br />
In this paper, we have shown that it is possible to improve<br />
the analysis results of such a rule-based engine by<br />
introducing a better selection method for the trees created<br />
by the grammar. Our experiments show that the selection<br />
itself is not a trivial task and requires fine-grained<br />
selection criteria.<br />
While the work reported on in this paper is a dedicated<br />
extension of a specific rule-based machine translation<br />
system, the overall approach can be used with any<br />
transfer-based RBMT system. Future work will<br />
concentrate on the circumvention of e.g. the time-out<br />
errors that prevented a better performance of the<br />
stochastic selection mechanism. Also, we will more<br />
closely investigate the issue of decreased translation
Multilingual Resources and Multilingual Applications - Regular Papers<br />
quality and experiment with additional decision factors<br />
that may help to alleviate the negative effects.<br />
The addition of stochastic knowledge to an existing<br />
rule-based machine translation system represents an<br />
example of a successful, hybrid combination of different<br />
MT paradigms into a joint system.<br />
7. Acknowledgements<br />
This work was also supported by the EuroMatrixPlus<br />
project (IST-231720) that is funded by the European<br />
Community under the Seventh Framework Programme<br />
for Research and Technological Development.<br />
8. References<br />
Federmann, C. (2010). Appraise: An open-source toolkit<br />
for manual phrase-based evaluation of translations. In<br />
Proceedings of the Seventh conference on<br />
International Language Resources and Evaluation.<br />
European Language Resources Association (ELRA).<br />
Klein, D., Manning, C. D. (2003). Accurate unlexicalized<br />
parsing. In Proceedings of the 41st Annual Meeting of<br />
the ACL, pp. 423–430.<br />
Koehn, P. (2005). Europarl: A parallel corpus for<br />
statistical machine translation. In Proceedings of the<br />
MT Summit 2005.<br />
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J. (2001).<br />
Bleu: a method for automatic evaluation of machine<br />
translation. IBM Research Report RC22176<br />
(W0109-022), IBM.<br />
Wolf, P., Alonso, J., Bernardi, U., Llorens, A. (2010).<br />
EuroMatrixPlus WP2.2: Study of Example- Based<br />
Modules for LT Transfer.<br />
Zhang, K., Shasha, D. (1989). Simple fast algorithms for<br />
the editing distance between trees and related problems.<br />
SIAM J. Comput., 18, pp. 1245–1262.<br />
115
Multilingual Resources and Multilingual Applications - Regular Papers<br />
116
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Terminology extraction and term variation patterns:<br />
a study of French and German data<br />
Marion Weller a , Helena Blancafort b , Anita Gojun a , Ulrich Heid a<br />
a Institut <strong>für</strong> maschinelle Sprachverarbeitung, <strong>Universität</strong> Stuttgart<br />
b Syllabs, Paris<br />
E-mail: {wellermn|gojunaa|heid}@ims.uni-stuttgart.de, blancafort@syllabs.com<br />
Abstract<br />
The terminology of many technical domains, especially new and evolving ones, is not fully fixed and shows considerable<br />
variation. The purpose of the work described in this paper is to capture term variation. For term extraction, we apply hand-crafted<br />
POS patterns on tagged corpora, and we use rules to relate morphological and syntactic variants. We discuss some French and<br />
German variation patterns, and we present first experimental results from our tools. It is not always easy to distinguish (near)<br />
synonyms from variants that have a slightly different meaning from the original term; we discuss ways of operating such a<br />
distinction. Our tools are based on POS tagging and an approximation of derivation and compounding; however, we also propose a<br />
non-symbolic, statistics-based line of development. We discuss general issues of evaluating variant detection and present a smallscale<br />
precision evaluation.<br />
Keywords: terminology, term variation, comparable corpora, pattern-based term extraction, compound nouns<br />
1. Introduction<br />
The objective of the EU-funded project TTC 1<br />
(Terminology Extraction, Translation Tools and<br />
Comparable Corpora) is the extraction of terminology<br />
from comparable corpora. The tools under development<br />
within the project address the issues of compiling corpus<br />
collections, monolingual term extraction and the<br />
alignment of terms into pairs of multilingual<br />
equivalence candidates, as well as the management and<br />
the export of the resulting terminological data towards<br />
CAT and MT tools.<br />
Since parallel corpora of specialized domains are scarce<br />
and not necessarily available for a broad range of<br />
languages (TTC deals with English (EN), Spanish (ES),<br />
German (DE), French (FR), Latvian (LV), Russian<br />
(RU), Chinese (ZH)), comparable corpora are used<br />
instead: textual material from specialized domains is<br />
accessible for many languages, either on the Internet or<br />
in publications of companies.<br />
1 http://www.ttc-project.eu<br />
The research leading to these results has received funding from<br />
the European Community's Seventh Framework Programme<br />
(FP7/2007-2013) under Grant Agreement n. 248005.<br />
In technical domains which are rapidly evolving,<br />
documents published on the Internet are often the most<br />
recent sources of data. In such domains, terminology<br />
typically has not yet been standardized, and thus<br />
numerous variants co-exist in published documents.<br />
Tools which support the extraction, identification and<br />
interrelating of term variants are thus necessary to<br />
capture the full range of expressions used in the<br />
respective domain. End users may then decide (e.g. on<br />
the basis of variant frequency and sources of variants)<br />
which expression to prefer.<br />
A second, more technical motivation for term variant<br />
extraction is provided by the procedures for term<br />
alignment (either lexical or statistical strategies), for<br />
which data sparseness is a problem. In order to reduce<br />
the complexity of term alignment, TTC intends to gather<br />
monolingual variants into sets of related terms.<br />
Particularly for this application, we do not only allow<br />
for (quasi) synonyms, but also for variants with a slight<br />
difference in meaning as shown in 1.<br />
1) production d'électricité ↔ électricité produite<br />
(production of electricity ↔ produced electricity)<br />
Terms may be of different forms (single-word vs. multiword<br />
terms) in different languages: this is a challenge<br />
117
Multilingual Resources and Multilingual Applications - Regular Papers<br />
for term alignment. For example, compound nouns play<br />
an important role in German terminology, but have no<br />
equivalents of the same morpho-syntactic structure in<br />
many other languages. Grouping equivalent terms of<br />
different syntactic structures can help to deal with such<br />
cases, as illustrated in 2:<br />
2) Energieproduktion ↔ Produktion von Energie ↔<br />
production d'électricité<br />
(energy production ↔ production of energy)<br />
2. Methodology<br />
The steps required for term extraction and for variant<br />
identification follow a simple pipeline architecture: first,<br />
a corpus collection is compiled, which then undergoes<br />
linguistic pre-processing. Following these steps,<br />
monolingual term candidates are extracted. As not all<br />
extracted items are domain relevant, we apply statistical<br />
filtering. Since we intend to detect term variation on a<br />
morpho-syntactic level, this last step requires<br />
morphological processing in order to model<br />
derivational relationships between word classes.<br />
2.1. Compiling a corpus and pre-processing<br />
To collect corpus data, we use the focused Web crawler<br />
Babouk (de Groc, <strong>2011</strong>) which has been developed<br />
within the TTC project. Babouk starts with a set of seed<br />
terms or URLs given by the user which are combined<br />
into queries and submitted to a search engine. Babouk<br />
scores the relevance of the retrieved web pages using a<br />
weighted-lexicon-based thematic filter. Based on the<br />
content of relevant retrieved pages, the lexicon is<br />
extended and new search queries are combined.<br />
One objective of the TTC project is to rely on flat<br />
linguistic analysis that is available for all languages.<br />
One strand of research thus goes towards the<br />
development of knowledge-poor strategies, such as<br />
using a pseudo part-of-speech tagger (Clark, 2003) as a<br />
basis for probabilistic NP-extraction (Guégan & Loupy,<br />
<strong>2011</strong>). A knowledge-rich approach is term extraction<br />
based on hand-crafted part-of-speech (POS) patterns,<br />
which is the method we chose for the present work.<br />
Pre-processing of our data collection consists of<br />
tokenizing, POS-tagging and lemmatization using<br />
TreeTagger (Schmid, 1994). For efficiency reasons, with<br />
German and French being morphologically rich<br />
118<br />
languages, we work with lemmas rather than inflected<br />
forms.<br />
2.2. Term candidate extraction and filtering<br />
Our main focus is on the extraction of nominal phrases<br />
such as [NN NN] or [NN PRP NN] constructions (cf.<br />
tables 2-5), but [V NN] collocations are also of interest2 .<br />
For each language, we identify term candidates by using<br />
hand-crafted POS patterns. In contrast to nominal<br />
phrases, which are relatively easy to capture by POS<br />
patterns, the identification of [V NN] collocations is<br />
more challenging, as verbs and their object nouns do not<br />
necessarily occur in adjacent positions, depending on the<br />
general structure of the sentence. This applies<br />
particularly to German where constituent order is rather<br />
flexible and allows for long distances between verbs and<br />
their objects.<br />
In order to reduce the extracted term candidates to a set<br />
of domain-relevant items, we estimate their domain<br />
specificity by comparing them with terms extracted<br />
from general language corpora (Ahmad et al, 1992). The<br />
underlying idea of this procedure is the assumption that<br />
terms which occur in both domain-specific and general<br />
language corpora are not domain-relevant, whereas<br />
terms occurring only or predominantly in the domainspecific<br />
data can be considered as specialized terms. We<br />
use the quotient q of a term's relative frequency in the<br />
specialized data and in the general language corpus as<br />
an indicator for its domain relevance (see table 1).<br />
term candidate f domain f general q<br />
Gleichstrom<br />
(direct current)<br />
128 4 22362,7<br />
Jahr (year) 2157 221.213 1,2<br />
Table 1: Domain-specific vs. general language<br />
2.3. Term variation<br />
In TTC we define a term variant as “an utterance which<br />
is semantically and conceptually related to an original<br />
term” (Daille, 2005). Thus, term variants are bound to<br />
texts (“utterance”) and require the presence of an<br />
“original term” identified e.g. by means of a morphosyntactic<br />
term pattern.<br />
2NN:noun, PRP: preposition, V: verb, VPART:<br />
participle
Multilingual Resources and Multilingual Applications - Regular Papers<br />
The relationship between term variant and original term<br />
is supposed to mainly be one of (quasi-) synonymy or of<br />
controlled modification (e.g. by attributive adjectives,<br />
NPs or PPs). We formalize this by explicitly classifying<br />
relationships between patterns.<br />
We distinguish the following types of variants:<br />
� graphical air flow ↔ airflow<br />
� morphological (derivation, compounding)<br />
Energieproduktion ↔ Produktion von Energie<br />
(production of energy)<br />
solare Energie ↔ Solarenergie (solar energy)<br />
� paradigmatic e.g. omissions<br />
les énergies renouvelables ↔ les renouvelables<br />
(the renewable energies ↔ the renewables)<br />
� abbreviations, acronyms<br />
Windenergieanlage ↔ WEA (wind energy plant)<br />
� syntactic variants3 consommation d’énergie ↔<br />
consommation annuelle d’énergie<br />
(energy consumption ↔ yearly energy consumption)<br />
Assuming that German technical texts contain many<br />
domain-specific compounds, we focus in this work on<br />
compound nouns and their variant [NN PRP NN] as<br />
illustrated above (morphological variants).<br />
For French, we choose a similar pattern [NN de NN] ↔<br />
[NN VPART]. In our current work, we restrict this<br />
pattern to nouns ending in -tion. The addition of French<br />
morphology tools is planned to widen the scope of these<br />
patterns.<br />
2.4. Morphological processing<br />
In order to identify morphological variants of German<br />
compounds, we need to split compounds into their<br />
components: in the present work, we opt for a statistical<br />
compound splitter; the implementation is based on<br />
(Koehn & Knight, 2003).<br />
Searching for the most probable split of a given word,<br />
the basic idea is that the components of a compound also<br />
appear as single words and consequently should occur in<br />
corpus data. A word frequency list serves as training<br />
data, supplemented with a hand-crafted set of rules to<br />
model transitional elements, such as the s in<br />
Produktions|kosten (production costs).<br />
3 This last type of variants is not necessarily synonymous with<br />
the original term.<br />
For French, we created a set of rules to model the<br />
relationship between nouns ending in -tion and the<br />
respective verbs:<br />
� production → produire (production → produce)<br />
� évolution → évoluer (evolution → evolve)<br />
� condition → conditionner (condition → condition)<br />
� protection → protéger (protection → protect)<br />
Similar rules can be formulated, e.g. for nouns ending in<br />
-ment or -eur, e.g. chargement (nominalized action) →<br />
charger (verb), as well as convertisseur (nominalized<br />
tool name) → convertir (verb). Similarly, terms<br />
containing adjectives ending in -able, such as utilisable<br />
→ utiliser (cf. table 5) or relational adjectives<br />
(prototypique → prototype) are under study. A further<br />
type of pattern that could be added are rules to handle<br />
prefixation (e.g. anti-corrosion → corrosion).<br />
2.5. Processing formally related items<br />
A very common form of graphic variation is<br />
hyphenation, e.g. Luftwärmepumpe vs. Luft-Wärmepumpe<br />
(air-source heat pump). This type of variation is<br />
dealt with by the splitting programm, which uses<br />
hyphens as splitting points. Hyphenated and nonhyphenated<br />
forms are treated as one term.<br />
To a certain extent, our variant detection tools also deal<br />
with alternating transitional elements (Kraftwerkbetrieb<br />
vs. Kraftwerksbetrieb). This is modeled by hand-crafted<br />
rules which allow for several realizations. Additionally,<br />
there are relatively regular forms of spelling variation,<br />
e.g. the new/old orthography in German, resulting in<br />
e.g. ph/f variation. This can be dealt with either by rules<br />
or using a method based on string-distance.<br />
3. Experiments and examples of results<br />
Our experiments are based on comparable corpora<br />
crawled from the Web. While they are generally easy to<br />
obtain with a focused crawler, such corpora might be<br />
inhomogeneous with respect to domain coverage or<br />
types of sources. When working with several languages,<br />
the degree of comparability may also vary.<br />
We use a collection of 1000 documents each for French<br />
and German, with a total size of 1.55 M tokens (FR) and<br />
1.29 M tokens (DE) of the domain of wind energy.<br />
When looking at the extracted German data, we find that<br />
119
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Nutzenergie useful energy 89<br />
nutzbar Energie usable energy 24<br />
genutzt Energie used energy 5<br />
nutzbar Energieform usable energy form 9<br />
genutzt Energieform used energy form 4<br />
nutzbar Energiegehalt usable energy content 3<br />
Nutzenergie-Anteil proportion of useful 1<br />
120<br />
Abgabe von Wärme 1 Wärmeabgabe 18 release of warmth<br />
Beleuchtung von Straße 1 Straßenbeleuchtung 49 street lighting<br />
Erzeugung von Strom 32 Stromerzeugung 569 power generation<br />
Produktion von Strom 4 Stromproduktion 72 power production<br />
Speicherung von Energie 7 Energiespeicherung 37 energy storage<br />
Verbrauch an Primärenergie 1 Primärenergieverbrauch 114 primary energy consumption<br />
Versorgung mit Fernwärme 2 Fernwärmeversorgung 13 district heating<br />
Nutzung von Biomasse 8 Biomassenutzung 7 biomass utilization<br />
energy<br />
nutzbar Energiemenge usable amount of<br />
energy<br />
Table 4: Variants of the compound Nutzenergie.<br />
the realization of a term as a compound is often more<br />
frequent than the alternative structures [NN PRP NN]<br />
or [NN ARTgen NNgen], as illustrated in table 2. This<br />
does not only apply to common words like Strom-<br />
erzeugung (power generation), but also to comparative-<br />
ly long and more complex words like Fernwärmever-<br />
sorgung (lit. long-distance heat supply: district heating).<br />
We consider this as evidence that the respective<br />
compound nouns are established as terms in the domain<br />
or even in general language. The degree of preference<br />
varies, up to the point of there not being an alternative<br />
realization, as is the case with Windgeschwindigkeit<br />
(wind speed, freq=149), for which one could imagine a<br />
construction like *Geschwindigkeit des Windes (speed<br />
of the wind), which does not occur in our corpus.<br />
In contrast to the German structures, the French terms<br />
Table 2: Prepositional phrases vs. compound nouns<br />
consommation d'électricité electricity consumption 28 électricité consommée consumed electricity 15<br />
consommation d'énergie energy consumption 66 énergie consommée consumed energy 22<br />
importation de pétrole import of petroleum 9 pétrole importé imported petroleum 1<br />
production d'électricité electricity production 225 électricité produite produced electricity 95<br />
production de chaleur heat production 26 chaleur produite produced heat 21<br />
installation d'éolienne wind turbine installation 5 éolienne installée installed wind turbine 16<br />
installation de puissance installation of power 1 puissance installée installed power 69<br />
utilisation d'énergie use of energy 5 énergie utilisée used energy 19<br />
Table 3: Related French terms: prepositional phrases vs. noun-participle constructions.<br />
1<br />
énergie utilisée used energy 19<br />
énergie utile useful energy 14<br />
énergie utilisable usable energy 14<br />
forme d'énergie utile useful energy form 2<br />
form d'énergie form of useable 2<br />
utilisable<br />
energy<br />
source d'énergie source of usable 1<br />
utilisable<br />
energy<br />
Table 5: Different combinations of the components<br />
energie and utile.<br />
of the pattern pair 4 [NN de NN] ↔ [NN VPART] in<br />
table 3 are not (near) synonyms, but could rather be<br />
considered as related. While some terms seem to prefer<br />
one of the two patterns, the overall tendency for<br />
preference is less clear than for the German examples.<br />
The difference in meaning (i.e. action vs. situation) does<br />
not allow for full interchangeability of related terms, and<br />
the use of the different forms of realization is context<br />
dependent. Some terms from the pairs contained in<br />
table 3 have different meanings, as is the case with<br />
puissance installée vs. installations de puissance élevée<br />
in example (3).<br />
3) Par contre, le coût et la complexité des installations les réservent<br />
le plus souvent à des installations de puissance élevée pour<br />
4 Note that the extracted lemma of the participle is its infinitive;<br />
we show the inflected form for better readability, i.e.<br />
consommée instead of consommer.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
bénéficier d’économies d’échelle.<br />
However, due to the cost and complexity of the installations, they<br />
are mostly restricted to installations of high power in order to<br />
benefit from the scaling effects.<br />
In other cases, grammatical and/or stylistic constraints<br />
may lead authors to use one variant rather than another.<br />
For example, compounds in enumerations are rather<br />
split in order to facilitate the combination with other<br />
nouns, e.g. Meeresboden vs. Boden von Meeren in<br />
example (4).<br />
4) Methanhydrat bildet sich am Boden von Meeren bzw. tiefen Seen<br />
Methane hydrate develops at the ground of the sea or deep lakes<br />
In table 4, we show examples of variants in a wider<br />
sense: starting with the compound Nutzenergie (useful<br />
energy), we find the synonym nutzbare Energie (usable<br />
energy) and the related form genutzte Energie (used<br />
energy). In the entries in the lower part of the table (grey<br />
background), the component Energie is part of a<br />
compound noun while still preserving the (basic)<br />
meaning of the term Nutzenergie (useful energy).<br />
The French examples in table 5 correspond to the<br />
German ones (table 4), with related terms consisting of<br />
the basic components in the upper part of the table, and<br />
terms expanded by an additional component in the lower<br />
part of the table (gray background). The forms nutzbar<br />
and utilisable (usable) in table 4 and 5 illustrate one of<br />
the above mentioned variation pattern for adjectives.<br />
4. Evaluation and discussion<br />
4.1. Issues in measuring precision and recall<br />
While it is relatively easy to measure the precision of<br />
identified (near) synonyms (such as the compound ↔<br />
[NN PRP NN] pairs), it is comparatively difficult to<br />
determine the precision of related terms like the ones in<br />
tables 4 and 5, as it is often difficult to decide on the<br />
degree of relatedness.<br />
Even more difficult is the evaluation of recall, which<br />
largely depends on the set of term variation patterns, but<br />
also on the patterns used for term candidate extraction.<br />
In order to avoid noise, term candidate extraction is<br />
restricted to productive patterns; this implies that not all<br />
term variants might be extracted and consequently, that<br />
some may not be available for variant grouping. The<br />
same applies to the set of rules used to group variants.<br />
For example, the French pattern [NN PRP NN] is<br />
restricted to the prepositions de and à. While there might<br />
be valid terms containing other prepositions, they are<br />
excluded from being extracted. Similarly, the large<br />
number of potential paraphrases of German compounds<br />
cannot be captured.<br />
The examples in tables 4 and 5 illustrate the wide range<br />
of possible types of variation and thus the difficulty to<br />
capture and relate the different types of variation. In<br />
addition to the problem of pattern coverage, another<br />
factor is the quality of the morphological tools used to<br />
model the relationship between word classes.<br />
4.2. Evaluation of precision<br />
In a small experiment, we measured the precision of the<br />
100 most-frequent German compound nouns and their<br />
proposed variants: 74 of the variants are valid. Most of<br />
the 26 invalid variants are due to bad PP-attachment, as<br />
illustrated by the following example:<br />
5) Stromkunde (energy customer) → *Kunde mit<br />
Strom (customer with energy)<br />
which is part of the verbal phrase Kunden mit Strom<br />
versorgen (supply costumers with energy). This kind of<br />
error can rather be considered a problem of the<br />
extraction step than of the variant detection.<br />
However, in the examined set of 100 items, there was<br />
one term-variant pair whose derivation is technically<br />
correct, but the meaning is not related:<br />
6) Grundwasser (ground water) → Wasser am Grund<br />
eines Sees (water on the ground of a lake)<br />
4.3. Symbolic vs. non-symbolic approach<br />
By relying on a fixed set of rules for extraction, we<br />
clearly favour precision at the cost of recall.<br />
In order to extract terms without a set of patterns, we<br />
present a knowledge-poor approach for term extraction<br />
using a probabilistic NP extractor and string-level term<br />
variation detection. First, we apply a probabilistic NP<br />
extractor trained on a small corpus annotated manually<br />
with NPs (300 to 600 sentences): this tool has been<br />
described in Guégan & Loupy (<strong>2011</strong>) for the extraction<br />
of NP chunks and uses a pseudo part-of-speech tagger<br />
(Clark, 2003).<br />
A further non-symbolic procedure consists in relating<br />
121
Multilingual Resources and Multilingual Applications - Regular Papers<br />
extracted terms without relying on a predefined set of<br />
variation patterns. We experimented with comparing<br />
NPs on a string level (using Levenshtein disctance ratio)<br />
and grouping terms by similarity. The resulting term<br />
groups also provide a basis for the automatic derivation<br />
of term variation patterns, which can be used as an input<br />
to the symbolic method.<br />
4.4. Relatedness of term candidates<br />
Using a predefined set of term variation patterns<br />
facilitates the decision whether terms are (near)<br />
synonyms or related. As synonyms, we consider for<br />
example the type [compound noun] ↔ [NN PRP NN].<br />
Structures involving relational adjectives ([ADJ NN]<br />
(DE), [NN ADJ] (FR)), can be expressed by<br />
prepositional phrases, e.g. production énergétique ↔<br />
production d'énergie (energy production ↔ production<br />
of energy).<br />
Similarly, patterns can also help to specify the degree of<br />
relatedness: by explicitly formulating term variation<br />
rules we can differentiate between merely related terms<br />
(e.g. consumption vs. annual consumption) and term<br />
variants where we assume quasi synonymy (cf.<br />
compound nouns in table 2).<br />
A difficult task is the identification of (neoclassical)<br />
synonyms: without additional information (e.g. a<br />
dictionary), it is impossible to relate terms like<br />
Sonnenenergie ↔ Solarenergie (solar energy), as the<br />
relation between Sonne and solar is not known to the<br />
system and cannot be derived by morphological means.<br />
While the terms in the example above are synonyms,<br />
there can be some slight difference in meaning between<br />
neoclassical compounds and their native form: the term<br />
hydroélectricité (hydroelectricity) is more precise than<br />
énergie de l'eau (water energy), and not necessarily a<br />
synonym.<br />
122<br />
5. Conclusion and next steps<br />
We presented a method for terminology extraction and<br />
for the identification of a certain type of term variation.<br />
Preliminary results show that there are preferences for a<br />
certain type of realization, especially when considering<br />
German compound nouns.<br />
Since our current work only deals with a small part of<br />
variation possibilities, we intend to enlarge our<br />
inventory by exploring more variation patterns. We<br />
particularly plan to include high-quality morphological<br />
tools, e.g. SMOR (Schmid et al., 2004) for German, and<br />
DériF (Namer, 2009) for French. SMOR has proven to<br />
outperform our statistical splitter.<br />
Another strand of research is the exploration of term<br />
variation across languages, e.g. relations between term<br />
variants that are similar within different language pairs.<br />
References<br />
Ahmad, K., Davies, A. , Fulford, H. , Rogers, M. (1992):<br />
What is a Term? The semi-automatic extraction of<br />
terms from text. In Translation Studies - an<br />
Interdiscipline. John Benjamins Publishing Company.<br />
Clark, A. (2003): Combining distributional and<br />
morphological information for part of speech<br />
induction. In Proceedings of the 10th conference of<br />
the European chapter of the Association for<br />
Computational Linguistics. Budapest, Hungary.<br />
Daille, B. (2005): Variants and application-oriented<br />
terminology engineering. In Terminology, volume. 1.<br />
Guégan, M. , de Loupy, C. (<strong>2011</strong>): Knowledge-Poor<br />
Approach to Shallow Parsing: Contribution of<br />
Unsupervised Part-of-Speech Induction. RANLP <strong>2011</strong><br />
- Recent Advances in Natural Language Processing.<br />
de Groc, C. (<strong>2011</strong>): Babouk: Focused web crawling for<br />
corpus compilation and automatic terminology<br />
extraction. In Proceedings of the IEEE/WIC/ACM<br />
International Conferences on Web Intelligence and<br />
Intelligent Agent Technology. Lyon, France.<br />
Koehn, P. , Knight, K. (2003): Empirical Methods for<br />
Compound Splitting. In Proceedings of the 10th<br />
conference of the European chapter of the Association<br />
for Computational Linguistics. Budapest, Hungary.<br />
Namer, F. (2009): Morphologie, Lexique et Traitement<br />
Automatique des Langues - Le système DériF.<br />
Hermès – Lavoisier Publishers.<br />
Schmid, H. (1994): Probabilistic part-of-speech tagging<br />
using decision trees. In Proceedings of the<br />
international conference on new methods in language<br />
processing. Manchester, UK.<br />
Schmid, H. , Fitschen, A. , Heid,U. (2004): SMOR: A<br />
German computational morphology covering<br />
derivation, composition and inflection. In Proceedings<br />
of LREC '04. Lisbon, Portugal.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Ansätze zur Verbesserung der Retrieval-Leistung<br />
kommerzieller Translation-Memory-Systeme<br />
Dino Azzano a , Uwe Reinke b , Melanie Sauer b<br />
a itl AG, b Fachhochschule Köln<br />
a Elsenheimerstr. 65, 80687 München<br />
b Gustav-Heinemann-Ufer 54, 50<strong>96</strong>8 Köln<br />
E-mail: dino.azzano@gmail.com, uwe.reinke@fh-koeln.de, melanie.sauer@fh-koeln.de<br />
Abstract<br />
Translation-Memory-Systeme (TM-Systeme) zählen zweifelsohne zu den wichtigsten und am weitesten verbreiteten Werkzeugen<br />
der computergestützten Übersetzung. Kommerzielle Systeme sind inzwischen seit über zwei Jahrzehnten auf dem Markt. Im Hin-<br />
blick auf die Erkennung semantischer Ähnlichkeiten wurde ihre Retrieval-Leistung bislang jedoch nicht entscheidend verbessert.<br />
Demgegenüber stellt die Computerlinguistik seit langem robuste Verfahren bereit, die zu diesem Zweck sinnvoll eingesetzt werden<br />
könnten. Ausgehend von den derzeitigen Grenzen der Retrieval-Leistung kommerzieller TM-Systeme zeigt der vorliegende Beitrag<br />
mögliche Ansätze zur Retrieval-Optimierung auf. Dabei wird zwischen Ansätzen mit und ohne Nutzung von linguistischem Wissen<br />
unterschieden. So genannte platzierbare und lokalisierbare Elemente können ohne linguistisches Wissen effizient behandelt werden.<br />
Im Gegensatz zu normalem Fließtext sind diese Elemente grundsätzlich eindeutig erkennbar und bleiben in der Übersetzung unverändert<br />
oder sie werden gemäß vorgegebenen Regeln angepasst. Die Erkennung mancher Elemente kann durch reguläre Ausdrücke<br />
erzielt werden und – basierend auf der Erkennung – verbessert eine optimierte Ähnlichkeitsberechnung das Retrieval der Segmente,<br />
in denen die Elemente vorkommen. Für die Optimierung des Retrievals von Paraphrasen und Subsegmenten (Phrasen,<br />
Teilsätzen) sowie <strong>für</strong> eine Verbesserung der Terminologieerkennung sind demgegenüber linguistische Verfahren erforderlich. Im<br />
Rahmen eines Forschungsprojekts wird an der Fachhochschule Köln derzeit versucht, vorhandene computerlinguistische Verfahren<br />
in kommerzielle TM-Systeme zu integrieren.<br />
Keywords: computergestützte Übersetzung, Translation-Memory-Systeme, Retrieval-Optimierung, Fuzzy-Matching, platzierbare<br />
und lokalisierbare Elemente<br />
1. Translation-Memory-Systeme<br />
TM-Systeme sind Software-Applikationen, die den Übersetzungsprozess<br />
unterstützen und seit Jahren <strong>für</strong> alle am<br />
Übersetzungsprozess Beteiligten ein wichtiges computergestütztes<br />
Werkzeug darstellen. Ihr Hauptzweck ist die<br />
Wiederverwendung bereits übersetzten Textmaterials<br />
(Trujillo, 1999; Reinke, 2005). Unter den professionellen<br />
Übersetzern arbeitet die Mehrheit regelmäßig mit einem<br />
oder mehreren TM-Systemen (Massion, 2005; Lagoudaki,<br />
2006). Zu den bekanntesten kommerziellen Produkten<br />
zählen Across, Déjà Vu, memoQ, MultiTrans,<br />
SDL Trados, Similis, Transit und Wordfast. Als nicht<br />
kommerzielles Produkt sei Omega-T erwähnt.<br />
Kernstück eines TM-Systems ist das Translation-Memory<br />
(TM), eine Datenbank oder eine Kollektion<br />
von Dateien, welche Einzelsegmente – die in der Regel<br />
einem Satz entsprechen – in der Ausgangssprache sowie<br />
in mindestens einer Zielsprache enthält. Zwischen den<br />
ausgangssprachlichen und den zielsprachlichen Einträgen<br />
besteht eine feste Zuordnung. TMs stellen daher<br />
alignierte parallele Textkorpora dar, die Metainformationen<br />
(wie Anlagedatum, Erzeuger usw.) enthalten, aber<br />
nicht linguistisch annotiert sind (Kenning, 2010; Zinsmeister,<br />
2010). Weitere Komponenten von TM-Systemen<br />
wie Terminologiedatenbank, Editor, Filter zur Konvertierung<br />
von Dateiformaten sowie Projektmanagementwerkzeuge<br />
seien hier nur aus Gründen der Vollständigkeit<br />
erwähnt.<br />
TM-Systeme generieren keine eigenen Texte. Sie sind<br />
daher klar von Systemen zur maschinellen Übersetzung<br />
(MÜ) zu unterscheiden, wobei hybride Lösungen exis-<br />
123
Multilingual Resources and Multilingual Applications - Regular Papers<br />
tieren, die TM und MÜ integrieren. Kernaufgabe eines<br />
TM-Systems ist das Nachschlagen und Auffinden von<br />
Treffern im TM (Reinke, 2004; Jekat & Volk, 2010). Ein<br />
TM-System ist somit in erster Linie ein (monolinguales)<br />
Information-Retrieval-System. Die Suche erfolgt<br />
zunächst auf Segmentebene. Während der Übersetzung<br />
wird der zu bearbeitende ausgangssprachliche Text<br />
segmentweise mit dem TM verglichen. Wird ein ausgangssprachlicher<br />
Treffer gefunden, kann die zugeordnete<br />
zielsprachliche Entsprechung zur Weiterverarbeitung<br />
verwendet werden. Dabei ist die Suche unscharf,<br />
so dass auch ähnliche Segmente (Fuzzy-Treffer) gefunden<br />
werden können (Sikes, 2007). Die Ähnlichkeit<br />
zwischen Suchanfrage und Treffer wird durch einen<br />
Prozentwert quantifiziert. Beim Übersetzen werden dem<br />
TM die neu erstellten Segmentpaare hinzugefügt, so dass<br />
dessen Umfang kontinuierlich zunimmt.<br />
Moderne TM-Systeme bieten darüber hinaus Funktionen,<br />
um die Suche ggf. auf die Subsegmentebene auszudehnen.<br />
Hierbei werden ausgangs- und zielsprachliche Subsegmente<br />
einander mit Hilfe statistischer Verfahren zugeordnet<br />
(Macken, 2009; Chama, 2010) und während der<br />
Übersetzung vorgeschlagen (z.B. mit Hilfe einer Autovervollständigen-Funktion).<br />
Obwohl TM-Systeme inzwischen seit über zwei Jahrzehnten<br />
auf dem Markt sind, wurde ihre Retrieval-Leistung<br />
auf Segmentebene bislang qualitativ und<br />
quantitativ nicht entscheidend verbessert. Selbst Ansätze,<br />
die ohne linguistisches Wissen auskommen und somit auf<br />
recht einfache Weise Verbesserungen erzielen könnten,<br />
haben in kommerziellen TM-Systemen bislang wenig<br />
Beachtung gefunden. Diese werden zunächst im 2. Abschnitt<br />
des Beitrags behandelt. Der 3. Abschnitt geht<br />
dann auf Möglichkeiten zur linguistischen Optimierung<br />
der Retrieval-Leistung ein.<br />
124<br />
2. Retrieval-Optimierung<br />
ohne linguistisches Wissen<br />
Bei den bisherigen Evaluierungen der Retrieval-Leistung<br />
von TM-Systemen wurde der Schwerpunkt auf den<br />
Fließtext gelegt (Reinke, 2004; Sikes, 2007; Baldwin,<br />
2010). Das ist berechtigt, birgt jedoch die Gefahr, andere<br />
Textelemente, so genannte platzierbare und lokalisierbare<br />
Elemente, außer Acht zu lassen, die im Übersetzungsprozess<br />
eine beachtliche Rolle spielen.<br />
2.1. Platzierbare und lokalisierbare Elemente<br />
Platzierbare Elemente wie Tags, Inline-Grafiken und<br />
Felder bestehen nicht oder nur teilweise aus reinem Text<br />
und können häufig unverändert in den Zieltext übernommen<br />
werden. Tags sind Auszeichnungselemente in<br />
HTML- und XML-Dateien. XML-Formate haben in den<br />
letzten Jahren im Bereich der technischen Dokumentation<br />
– auch als Austauschformat – stark an Bedeutung<br />
gewonnen (Reinke, 2008; Anastasiou, 2010; Pelster,<br />
<strong>2011</strong>) und spielen daher auch im Übersetzungsprozess<br />
eine wichtige Rolle. Inline-Grafiken und Felder sind<br />
typische Elemente in Desktop-Publishing-Formaten sowie<br />
in Formaten aus MS Word 1<br />
, die im Alltag der meisten<br />
Übersetzer von zentraler Bedeutung sind (Lagoudaki,<br />
2006:12).<br />
Lokalisierbare Elemente wie Zahlen, Datumsangaben,<br />
Eigennamen mit eindeutiger Oberflächenstruktur, URLs<br />
und E-Mail-Adressen sind hingegen Elemente aus reinem<br />
Text, die meist ohne linguistisches Wissen erkennbar<br />
sind und deren Lokalisierung – im Unterschied zum<br />
normalen Fließtext – vorgegebenen Regeln obliegt und<br />
häufig keine Auswirkung auf den restlichen Text hat.<br />
2.2. Untersuchung<br />
Im Rahmen einer Promotionsarbeit (Azzano, <strong>2011</strong>) wur-<br />
de untersucht, welchen Einfluss platzierbare und lokali-<br />
sierbare Elemente auf das Retrieval kommerzieller<br />
TM-Systeme ausüben. Zu diesem Zweck wurden acht<br />
kommerzielle TM-Systeme verglichen: Across, Déjà Vu,<br />
Heartsome, memoQ, MultiTrans, SDL Trados, Transit<br />
und Wordfast. Aus unterschiedlichen Korpora wurden<br />
Segmente extrahiert, in denen platzierbare und lokalisierbare<br />
Elemente vorkamen, wobei möglichst viele<br />
Variationsmuster berücksichtigt wurden. Diese Segmente<br />
wurden anschließend mit den TM-Systemen bearbeitet,<br />
um die Erkennung der Elemente sowie die vorgeschlagenen<br />
Ähnlichkeitswerte zu prüfen. 2<br />
Die Hauptergebnisse<br />
der vergleichenden Analyse werden im Folgenden<br />
zusammenfasst.<br />
1 Mit DOCX verwendet MS Word zwar ein XML-basiertes<br />
Format, seine Betrachtung und Bearbeitung im Übersetzungsprozess<br />
unterscheidet sich jedoch von üblichen XML-Dokumenten.<br />
2 Eine nähere Erläuterung der Testmethoden und Testdaten ist<br />
in diesem Beitrag aus Platzgründen nicht möglich; siehe Azzano<br />
(<strong>2011</strong>) <strong>für</strong> weitere Einzelheiten.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
2.2.1. Recall<br />
Prinzipiell sollten in einem TM gespeicherte ausgangs-<br />
sprachliche Segmente auch dann gefunden werden, wenn<br />
sie sich vom aktuell zu übersetzenden ausgangssprachlichen<br />
Segment nur durch die oben genannten Elemente<br />
unterscheiden. 3<br />
Die Tests zeigten jedoch, dass das Retrieval<br />
in solchen Fällen fehlschlagen kann oder dass es,<br />
wie im folgenden Beispiel, trotz minimalen Unterschieds<br />
zu sehr hohen Abzügen kommt:<br />
� Amstrong stepped off Eagle’s footpad […]<br />
� Amstrong stepped off Eagle’s footpad […]<br />
Die meisten TM-Systeme bieten Ähnlichkeitswerte zwi-<br />
schen 91% und 99%; eines bietet aber 85% und eines<br />
nur 46%.<br />
Auf Grund der kommerziellen Natur der untersuchten<br />
TM-Systeme und der dadurch bedingten Black-Box-<br />
Evaluierung lassen sich die Ursachen <strong>für</strong> diese Fehler<br />
nicht eindeutig identifizieren. Dennoch sind einige Rückschlüsse<br />
aus den Testergebnissen möglich.<br />
Bei platzierbaren Elementen liegt die Ursache <strong>für</strong> hohe<br />
Abzüge manchmal darin, dass diese Elemente bei der<br />
Ermittlung der Segmentlänge wie übliche Wörter aus dem<br />
Fließtext gewichtet und damit überbewertet werden. 4<br />
Hingegen stellt ein fester Abzug <strong>für</strong> Unterschiede bei<br />
platzierbaren Elementen eine gute Lösung dar und wird<br />
von vier der getesteten TM-Systeme i.d.R. auch angewendet.<br />
Im Unterschied zu Fließtext kann die Art der<br />
Änderung (Hinzufügung, Löschung, Ersetzung oder Umstellung)<br />
bei der Gewichtung von Abzügen ignoriert werden.<br />
Ein fester Abzug sowie dessen Unabhängigkeit von<br />
der Art der Änderung dürften keine allzu großen Anpassungen<br />
der Algorithmen zur Ermittlung des Ähnlichkeitswertes<br />
in den kommerziellen TM-Systemen darstellen.<br />
Bei lokalisierbaren Elementen kann die Retrieval-Leistung<br />
ebenfalls verbessert werden, wenn diese als<br />
„Sonderelemente“ und nicht als normaler Fließtext erkannt<br />
werden. Es können prinzipiell dieselben Strategien<br />
wie <strong>für</strong> platzierbare Elemente angewendet werden (z.B.<br />
fester Abzug). Zur Erkennung solcher Elemente, die bes-<br />
3 Eine mögliche Ausnahme bilden hier solche platzierbare Elemente,<br />
die als Attribut- oder Feldwerte längere zu übersetzende<br />
Fragmente enthalten, wie z.B. .<br />
4 Die Segmentlänge, d.h. die Anzahl der Token (Wortanzahl) im<br />
Segment, wird als Normalisierungsfaktor zur korrekten Ermittlung<br />
des Änderungsumfangs im Verhältnis zum Segmentumfang<br />
verwendet (Trujillo, 1999; Manning & Raghavan<br />
& Schütze, 2008).<br />
timmten Mustern folgen, bewähren sich reguläre<br />
Ausdrücke. Aktuell zeigen kommerzielle TM-Systeme<br />
jedoch noch Schwächen. Zum einen sind Mechanismen<br />
zur Erkennung zwar prinzipiell vorhanden, aber sie<br />
schlagen oft fehl, beispielsweise bei Zahlen. Tabelle 1<br />
führt die Erkennungsrate der untersuchten TM-Systeme<br />
bei Zahlentoken auf. 5<br />
TM-System Erkennungsrate<br />
Wordfast 0,99<br />
memoQ 0,99<br />
Across 0,<strong>96</strong><br />
Transit 0,90<br />
Déjà Vu 0,89<br />
SDL Trados 0,71<br />
Tabelle 1: Erkennungsrate von Zahlen<br />
Zum anderen werden einige lokalisierbare Elemente<br />
völlig ignoriert. Beispielsweise sind zuverlässige reguläre<br />
Ausdrücke zur Erkennung von URLs im reinen Text<br />
bereits ohne Weiteres verfügbar (Goyvaerts & Levithan,<br />
2009), aber nur ein TM-System implementiert sie. Daher<br />
wurden reguläre Ausdrücke zur Erkennung der jeweiligen<br />
lokalisierbaren Elemente präsentiert bzw. entwickelt.<br />
Die Vorteile einer geeigneten Behandlung platzierbarer<br />
und lokalisierbarer Elemente gehen über das reine Retrieval<br />
hinaus. Solche Elemente können häufig automatisch<br />
ersetzt oder gelöscht werden, wobei der Rest des<br />
Fließtextes gleich bleibt. Diese – zum Teil in den TM-Systemen<br />
bereits angewendeten – automatischen Anpassungen<br />
können zum einen Zeit bei der Übersetzung sparen,<br />
zum anderen erhöhen sie den Ähnlichkeitswert.<br />
2.2.2. Precision<br />
Unter 2.2.1 wurden Beispiele präsentiert, bei denen die<br />
Ähnlichkeitswerte zu niedrig ausfallen. Allerdings tritt<br />
auch der umgekehrte Fall ein.<br />
Nicht erkannt werden häufig Unterschiede zwischen den<br />
zu übersetzenden und den im TM gefundenen Segmenten,<br />
wenn sich die Position bzw. die Reihenfolge der platzierbaren<br />
Elemente unterscheidet. Im folgenden Beispiel bieten<br />
drei TM-Systeme <strong>für</strong> das zweite Segment einen<br />
100%-Treffer, obwohl sich die Position der Tags geändert<br />
5 Insgesamt wurden 79 Zahlentoken getestet, wobei jedes Token<br />
ein einmaliges Muster aufweist. Für weitere Informationen<br />
zu den Einzeltests und den getesteten Versionen siehe Azzano<br />
(<strong>2011</strong>).<br />
125
Multilingual Resources and Multilingual Applications - Regular Papers<br />
hat und folglich auch in der Fremdsprache angepasst wer-<br />
den müsste.<br />
� This statement is true only when […]<br />
� This statement is true only when […]<br />
Der Fehler liegt vermutlich darin, dass diese Elemente<br />
lediglich in ungeordneter Reihenfolge berücksichtigt bzw.<br />
vor der Auswertung durch inhaltsleere nicht-positionale<br />
Platzhalter ersetzt werden. Darüber hinaus wird die An-<br />
zahl der Änderungen teilweise nicht berücksichtigt so<br />
dass die Ähnlichkeitswerte zu positiv ausfallen. Im fol-<br />
genden Beispiel bieten vier TM-Systeme <strong>für</strong> beide Varia-<br />
tionen des ersten Segments den gleichen Ähnlichkeitswert,<br />
obwohl im dritten das Tag zweimal hinzugefügt<br />
worden ist.<br />
� Last transmission February 6, 1<strong>96</strong>6, 22:55 UTC.<br />
� Last transmission February 6, 1<strong>96</strong>6, 22:55<br />
126<br />
UTC.<br />
� Last transmission February 6, 1<strong>96</strong>6, <br />
22:55 UTC.<br />
All diese Unzulänglichkeiten dürften mit wenigen Ein-<br />
griffen in die Retrieval-Algorithmen der TM-Systeme<br />
beseitigt werden können.<br />
3. Retrieval Optimierung<br />
mit linguistischem Wissen<br />
3.1. Aktuelle Ansätze<br />
Grundsätzlich lassen sich bei der Optimierung der Retrie-<br />
val-Ergebnisse von TM-Systemen zwei Zielsetzungen<br />
unterscheiden:<br />
1) Die Verbesserung von Recall und Precision des (monolingualen)<br />
Retrievals (Optimierung der Treffermenge<br />
und des Rankings der Treffer)<br />
a. auf Segmentebene<br />
b. auf Subsegmentebene (Retrieval von ‚Chunks’,<br />
(komplexen) Phrasen, Teilsätzen)<br />
2) Die Anpassung der gefundenen Treffer zur Optimierung<br />
ihrer Wiederverwendbarkeit.<br />
In der Forschung sind derzeit in erster Linie Bemühungen<br />
festzustellen, die Wiederverwendbarkeit von Fuzzy-Treffern<br />
durch Verfahren der statistischen maschinellen Übersetzung<br />
zu erhöhen (Biçici & Dymetman, 2008; Zhechev<br />
& van Genabith, 2010; Koehn & Senellart, 2010). Dabei<br />
werden solche Fragmente, die den Unterschied zwischen<br />
einem zu übersetzenden Segment und einem in der<br />
TM-Datenbank gefundenen Fuzzy-Treffer ausmachen,<br />
mit Hilfe statistischer Übersetzungsverfahren so bearbeitet,<br />
dass die Anpassung der im Translation-Memory gefundenen<br />
Übersetzung an den aktuellen Kontext <strong>für</strong> den<br />
Übersetzer idealerweise keinen zusätzlichen Posteditionsaufwand<br />
bedeutet. Welche Auswirkung eine solche<br />
‚Verschmelzung’ von Humanübersetzung und maschineller<br />
Übersetzung auf Segmentebene tatsächlich auf die<br />
Postedition von Fuzzy-Treffern und somit auf Produktivität<br />
und Textqualität hat, müsste aber in jedem Fall empirisch<br />
untersucht werden.<br />
Unter dem Aspekt einer effizienten Einbindung vorhandener<br />
linguistischer Verfahren in kommerzielle TM-Systeme<br />
scheint es zunächst durchaus lohnenswert, eine Optimierung<br />
von Recall und Precision zu verfolgen. Eines<br />
der wenigen kommerziellen TM-Systeme, das zur Optimierung<br />
der Retrieval–Leistung nicht nur zeichenkettenbasierte,<br />
sondern (einfache) computerlinguistische<br />
Verfahren anwendet, ist das Programm Similis der<br />
französischen Firma Lingua et Machina, das morphosyntaktische<br />
Analysen und flache Parsing-Verfahren einsetzt,<br />
um Fragmente unterhalb der Segmentebene zu identifizieren<br />
(Planas, 2005).<br />
Neben der Identifikation von Subsegmenten ist vor allem<br />
auch eine Verbesserung des Retrievals solcher ausgangssprachlicher<br />
Segmente erforderlich, die lediglich Paraphrasen<br />
bereits übersetzter Sätze darstellen und somit auf<br />
zielsprachlicher Seite häufig keinerlei Veränderung erfordern.<br />
Durch morphosyntaktische, lexikalische und/<br />
oder syntaktische Variation gekennzeichnete Paraphrasen<br />
machen einen nicht zu unterschätzenden Anteil in solchen<br />
Fachtexten aus, die ständig aktualisiert, modifiziert und<br />
wiederverwendet werden.<br />
3.2. Zielsetzungen und Ansätze im Rahmen<br />
des Projekts iMEM<br />
Möglichkeiten, vorhandene computerlinguistische Verfahren<br />
in kommerzielle TM-Systeme zu integrieren, werden<br />
derzeit an der Fachhochschule Köln im Rahmen des Forschungsprojekts<br />
„Intelligente Translation Memories<br />
durch computerlinguistische Optimierung (iMEM)“ untersucht.<br />
iMEM zielt auf eine Optimierung der Retrieval-Leistung<br />
von TM-Systemen sowohl im Hinblick auf die<br />
bessere Erkennung von Fragmenten unterhalb der Segmentebene<br />
als auch hinsichtlich einer Optimierung der<br />
Verfahren zur Terminologieerkennung und -prüfung. Dabei<br />
sollen robuste Verfahren zur morphosyntaktischen
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Analyse sowie zur regelbasierten Satzsegmentierung zum<br />
Einsatz kommen. Ziel ist die Entwicklung von<br />
Schnittstellenmodellen und prototypischen Schnittstellen<br />
zwischen kommerziellen TM-Systemen und „Lingware“.<br />
Hierbei werden exemplarisch das TM-System SDL Tra-<br />
dos Studio 2009 und das morphosyntaktische Analyse-<br />
werkzeug MPRO (Maas & Rösener & Theofilidis, 2009)<br />
eingesetzt. Ausgehend von den Sprachen Deutsch und<br />
Englisch sollen Erfahrungen <strong>für</strong> die Entwicklung weiterer<br />
Sprachmodule sowie <strong>für</strong> die Übertragung der Ergebnisse<br />
auf andere TM-Systeme gewonnen werden.<br />
Für die Einbindung morphosyntaktischer Informationen<br />
in das TM-System wurde eine eigenständige SQL-Daten-<br />
bank konzipiert, die aus den Daten der TM-Datenbank als<br />
paralleles „linguistisches TM“ aufgebaut wird und über<br />
entsprechende IDs mit der TM-Datenbank verknüpft ist.<br />
Das „linguistische TM“ enthält neben den Token der Tex-<br />
toberfläche derzeit im Wesentlichen Ergebnisse der mit<br />
MPRO durchgeführten Kompositaanalyse, wobei die<br />
Daten zur Beschleunigung des Retrievals in Form von<br />
SuffixArrays (Aluru, 2004) vorgehalten werden.<br />
In der Retrieval-Phase wird das aktuell zu übersetzende<br />
Segment zunächst analog zum „linguistischen TM“ lin-<br />
guistisch analysiert und annotiert, so dass ein Abgleich<br />
stattfinden kann. Die Abfrage des TM erfolgt in zwei un-<br />
abhängigen Teilprozessen, bei denen einerseits die Token-<br />
ketten und andererseits die Ergebnisse der Kompositazer-<br />
legungen des zu übersetzenden Segments mit den<br />
entsprechenden Daten der im „linguistischen TM“ ges-<br />
peicherten ausgangssprachlichen Segmente verglichen<br />
werden. Dabei werden <strong>für</strong> alle Ergebnisse der beiden Ab-<br />
fragen unter Verwendung von Generalized Suffix Arrays<br />
(GSA) (Rieck & Laskov & Sonnenburg, 2007) die<br />
längsten gemeinsamen Zeichenketten (Longest Common<br />
Substring, LCS) von zu übersetzendem Segment und im<br />
„linguistischen TM“ gefundenem Segment ermittelt. Für<br />
das Ranking der gefundenen Treffer ist noch eine Formel<br />
zu entwickeln, die die Ergebnisse beider Teilsuchen kom-<br />
biniert und gewichtet, wobei jeweils u.a. Anzahl und<br />
Länge der LCS sowie deren Position in den zu verglei-<br />
chenden Segmenten berücksichtigt werden sollte (vgl.<br />
auch Hawkins & Giraud-Carrier, 2009).<br />
Im weiteren Verlauf des Projekts soll untersucht werden,<br />
inwieweit das bisherige Verfahren durch satzsyntaktische<br />
Analysen so erweitert werden kann, dass unterhalb der<br />
Segmentebene nicht nur vergleichsweise einfache Phra-<br />
sen, sondern vor allem auch Übersetzungseinheiten wie<br />
Teilsätze und komplexe Phrasen gefunden werden, die <strong>für</strong><br />
das computergestützte Humanübersetzen mit TM-Syste-<br />
men relevant sind.<br />
4. Fazit undAusblick<br />
Zusammenfassend kann festgestellt werden, dass <strong>für</strong> die<br />
Lösung der Retrieval-Probleme kommerzieller TM-Sys-<br />
teme aus Sicht der Computerlinguistik bewährte Verfah-<br />
ren zur Verfügung stehen, die aber bisher kaum oder nur<br />
vereinzelt Eingang in die TM-Systeme gefunden haben.<br />
Ein sprachunabhängiger, rein zeichenkettenbasierter An-<br />
satz ohne Nutzung linguistischen Wissens, wie er derzeit<br />
bei fast allen kommerziellen TM-Systemen verfolgt wird,<br />
liefert ungeachtet seiner offensichtlichen Vorteile hinsich-<br />
tlich der Sprachabdeckung keine optimalen Precision-<br />
und Recall-Werte. Es liegt daher nahe, einen differenzier-<br />
ten Ansatz zu verfolgen und <strong>für</strong> die vom Übersetzungsvo-<br />
lumen her ‘großen’ Sprachen damit zu beginnen, vorhan-<br />
dene robuste Verfahren der linguistischen Datenverarbei-<br />
tung in kommerzielle TM-Systeme zu integrieren. Bei<br />
Sprachen, <strong>für</strong> die entsprechende Verfahren nicht zur<br />
Verfügung stehen oder nicht ausreichend robust sind, kann<br />
zunächst weiter mit den herkömmlichen Retriev-<br />
al-Mechanismen gearbeitet werden, wobei Verbesserun-<br />
gen in der Handhabung von platzierbaren und lokalisier-<br />
baren Elementen möglich sind.<br />
5. Danksagung<br />
Das Projekt „iMEM – Intelligente Translation Memories<br />
durch computerlinguistische Optimierung“ wird vom<br />
Bundesministerium <strong>für</strong> Bildung und Forschung im Rah-<br />
men des Programms „Forschung an Fachhochschu-<br />
len“ gefördert.<br />
6. Literatur<br />
Aluru, S. (2004): „Suffix Trees and Suffix Arrays“. In<br />
Mehta, D. P. und Sahni, S. (Eds.), Handbook of Data<br />
Structures and Applications. Boca Rayton: Chapman &<br />
Hall/CRC.<br />
Azzano, D. (<strong>2011</strong>): Placeable and localizable elements in<br />
translation memory systems. Dissertation. Ludwig-<br />
Maximilians-<strong>Universität</strong> München.<br />
Anastasiou, D. (2010): Survey on the Use of XLIFF in<br />
Localisation Industry and Academia. In Proceedings of<br />
the 7th International Conference on Language Re-<br />
127
Multilingual Resources and Multilingual Applications - Regular Papers<br />
128<br />
sources and Evaluation.<br />
Baldwin, T. (2010): The hare and the tortoise: speed and<br />
accuracy in translation retrieval. Machine Translation,<br />
23(4), pp. 195-240.<br />
Biçici, E. und Dymetman, M. (2008): Dynamic Transla-<br />
tion Memory: Using Statistical Machine Translation to<br />
improve Translation Memory Fuzzy Matches. In Gel-<br />
bukh, A. F. (Ed.), Computational Linguistics and Intel-<br />
ligent Text Processing, 9th International Conference,<br />
Proceedings. Lecture Notes in Computer Science 4919.<br />
Berlin, Heidelberg: Springer, pp. 454-465.<br />
Chama, Z. (2010): Vom Segment zum Kontext. techni-<br />
sche kommunikation, 32(2), pp. 21-25.<br />
Goyvaerts, J. und Levithan, S. (2009): Regular expres-<br />
sions cookbook. Sebastopol, O’Reilly.<br />
Hawkins, B. und Giraud-Carrier, C. (2009): „Ranking<br />
search results for translated content“. In Zhang, K. und<br />
Alhajj, R. (Eds.), IRI'09 - Proceedings of the 10th IEEE<br />
international conference on Information Reuse & Inte-<br />
gration. Piscataway, NJ: IEEE Press, pp. 242-245.<br />
Jekat, S. und Volk, M. (2010): Maschinelle und computer-<br />
gestützte Übersetzung. In Carstensen, K.-U. et al. (Eds.),<br />
Computerlinguistik und Sprachtechnologie: eine Ein-<br />
führung. Heidelberg: Spektrum, pp. 642-658.<br />
Kenning, M.-M. (2010): What are parallel and compara-<br />
ble corpora and how can we use them? In O’Keeffe, A.<br />
und McCarthy, M. (Eds.), The Routledge Handbook of<br />
Corpus Linguistics. NewYork: Routledge, pp. 487-500.<br />
Koehn, Ph. und Senellart, J. (2010): Convergence of<br />
Translation Memory and Statistical Machine Transla-<br />
tion. In Zhechev, V. (Ed.), Proceedings of the Second<br />
Joint EM+/CNGLWorkshop “Bringing MT to the User:<br />
Research on Integrating MT in the Translation Indus-<br />
try”.<br />
Lagoudaki, E. (2006): Translation Memory systems: Enlight-<br />
ening users' perspective. http://www3.imperial.ac.uk/<br />
portal/pls/portallive/docs/1/7307707.PDF (26.07.<strong>2011</strong>).<br />
Maas, H. D., Rösener, Ch. und Theofilidis, A. (2009):<br />
„Morphosyntactic and semantic analysis of text: The<br />
MPRO tagging procedure“. In Mahlow, C. und<br />
Piotrowski, M. (Eds.), State of the Art in Computational<br />
Morphology: Workshop on Systems and Frameworks<br />
for Computational Morphology. Proceedings. Berlin et<br />
al.: Springer, pp. 76-87.<br />
Macken, L. (2009): In search of the recurrent units of<br />
translation. In Daelemans, W. und Hoste, V. (Eds.),<br />
Evaluation of Translation Technology. Brussels: Aca-<br />
demic and Scientific Publishers, pp. 195-212.<br />
Manning, C., Raghavan, P., Schütze, H. (2008): Introduc-<br />
tion to Information Retrieval. Cambridge et al.: Cam-<br />
bridge University Press.<br />
Massion, F. (2005): Translation-Memory-Systeme im<br />
Vergleich. Reutlingen: Doculine.<br />
Pelster, U. (<strong>2011</strong>): XML<strong>für</strong> den passenden Zweck. techni-<br />
sche kommunikation, 33(1), pp. 54-57.<br />
Planas, E. (2005): SIMILIS: Second-generation transla-<br />
tion memory software. In Translating and the Computer<br />
27: Proceedings of the Twenty-seventh International<br />
Conference on Translating and the Computer. London:<br />
Aslib.<br />
Reinke, U. (2004): Translation Memories: Systeme –<br />
Konzepte – linguistische Optimierung. Frankfurt am<br />
Main: Lang.<br />
Reinke, U. (2006): Translation Memories. In Brown, K.<br />
(Ed.), Encyclopedia of Language and Linguistics. Ox-<br />
ford: Elsevier, pp. 61-65.<br />
Reinke, U. (2008): XML-Unterstützung in Transla-<br />
tion-Memory-Systemen. In tekom Jahrestagung 2008,<br />
Zusammenfassung der Referate. Stuttgart: Gesellschaft<br />
<strong>für</strong> technische Kommunikation e.V.<br />
Rieck, K., Laskov, P. und Sonnenburg, S. (2007): Compu-<br />
tation of Similarity Measures for Sequential Data using<br />
Generalized Suffix Trees. In Schölkopf, B., Platt, J. und<br />
T. Hoffman (Eds.), Advances in Neural Information<br />
Processing Systems 19. Cambridge, MA: MIT Press,<br />
pp. 1177-1184.<br />
Sikes, R. (2007): Fuzzy matching in theory and practice.<br />
MultiLingual, 18(6), pp. 39-43.<br />
Trujillo, A. (1999): Translation Engines: Techniques for<br />
Machine Translation. London: Springer.<br />
Zinsmeister, H. (2010): Korpora. In Carstensen, K.-U. et<br />
al. (Eds.), Computerlinguistik und Sprachtechnologie:<br />
eine Einführung. Heidelberg: Spektrum, pp. 482-491.<br />
Zhechev, V. und van Genabith, J. (2010): Maximising TM<br />
Performance through Sub-Tree Alignment and SMT. In<br />
Proceedings of the Ninth Conference of the Association<br />
for Machine Translation in theAmericas.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
WikiWarsDE: A German Corpus of Narratives Annotated<br />
with Temporal Expressions<br />
Jannik Strötgen, Michael Gertz<br />
Institute of Computer Science, Heidelberg University<br />
Im Neuenheimer Feld 348, 69120 Heidelberg, Germany<br />
E-mail: stroetgen@uni-hd.de, gertz@uni-hd.de<br />
Abstract<br />
Temporal information plays an important role in many natural language processing and understanding tasks. Therefore, the<br />
extraction and normalization of temporal expressions from documents are crucial preprocessing steps in these research areas, and<br />
several temporal taggers have been developed in the past. The quality of such temporal taggers is usually evaluated using<br />
annotated corpora as gold standards. However, existing annotated corpora only contain documents of the news domain, i.e., short<br />
documents with only few temporal expressions. A remarkable exception is the recently published corpus WikiWars, which is the<br />
first temporal annotated English corpus containing long narratives that are rich in temporal expressions. Following this example, in<br />
this paper, we describe the development and the characteristics of WikiWarsDE, a new temporal annotated corpus for German.<br />
Additionally, we present evaluation results of our temporal tagger HeidelTime on WikiWarsDE and compare them with results<br />
achieved on other corpora. Both, WikiWarsDE as well as our temporal tagger HeidelTime are publicly available.<br />
Keywords: temporal expression, TIMEX2, corpus annotation, temporal information extraction<br />
1. Introduction and Related Work<br />
In the last decades, the extraction and normalization of<br />
temporal expressions have become hot topics in<br />
computational linguistics. In many research areas,<br />
temporal information plays an important role, e.g., in<br />
information extraction, document summarization, and<br />
question answering (Mani et al., 2005). In addition,<br />
temporal information is valuable in information retrieval<br />
and can be used to improve search and exploration tasks<br />
(Alonso et al., <strong>2011</strong>). However, the tasks of extracting<br />
and normalizing temporal expressions are challenging<br />
due to the fact that there are many different ways to<br />
express temporal information in documents and that<br />
temporal expressions may be ambiguous.<br />
Besides explicit expressions (e.g., “April 10, 2005”) that<br />
can directly be normalized to some standard format,<br />
relative and underspecified expressions are very<br />
common in many types of documents. To determine the<br />
semantics of such expressions, context information is<br />
required. For example, to normalize the expression<br />
“Monday” in phrases like “on Monday”, a reference<br />
time and the relation to the reference time have to be<br />
identified. Depending on the domain of the documents<br />
that are to be processed, this reference time can either be<br />
the document creation time or another temporal<br />
expression in the document. While the document<br />
creation time plays an important role in news<br />
documents, it is almost irrelevant in narrative style<br />
documents, e.g., documents about history or<br />
biographies. Despite these challenges, all applications<br />
using temporal information mentioned in documents<br />
rely on high quality temporal taggers, which correctly<br />
extract and normalize temporal expressions from<br />
documents.<br />
Due to the importance of temporal tagging, there have<br />
been significant efforts in the area of temporal<br />
annotation of text documents. Annotation standards such<br />
as TIDES TIMEX2 (Ferro et al., 2005) and TimeML<br />
(Pustejovsky et al., 2003b; Pustejovsky et al., 2005)<br />
were defined and temporal annotated corpora like<br />
TimeBank (Pustejovsky et al., 2003a) were developed –<br />
although most of the corpora contain English documents<br />
129
Multilingual Resources and Multilingual Applications - Regular Papers<br />
only. Furthermore, research challenges were organized<br />
where temporal taggers were evaluated. The ACE<br />
(Automatic Content Extraction) time expression and<br />
normalization (TERN) challenges were organized in<br />
2004, 2005, and 2007. 1 In 2010, temporal tagging was<br />
one task in the TempEval-2 challenge (Verhagen et al.,<br />
2010). However, so far, research was limited to the news<br />
domain, i.e., the documents of the annotated corpora are<br />
short with only a few temporal expressions. The<br />
temporal discourse structure is thus usually easy to<br />
follow. Only recently, a first corpus containing<br />
narratives was developed (Mazur & Dale, 2010). This<br />
corpus, called WikiWars, consists of Wikipedia articles<br />
about famous wars in history. The documents are much<br />
longer than news documents and contain many temporal<br />
expressions. As the developers point out, normalizing<br />
the temporal expressions in such documents is more<br />
challenging due to the rich temporal discourse structure<br />
of the documents.<br />
Motivated by this observation and by the fact that no<br />
temporal annotated corpus for German was publicly<br />
available so far, we created the WikiWarsDE corpus2 ,<br />
which we present in this paper. WikiWarsDE contains<br />
the corresponding German articles of the documents of<br />
the English WikiWars corpus. For the annotation<br />
process, we followed the suggestions of the WikiWars<br />
developers, i.e., annotated the temporal expressions<br />
according to the TIDES TIMEX2 annotation standard<br />
using the annotation tool Callisto3 . To be able to use<br />
publicly available evaluation scripts, the format of the<br />
ACE TERN corpus was selected. Thus, evaluating a<br />
temporal tagger on the WikiWarsDE corpus is<br />
straightforward and evaluation results of different<br />
taggers can be compared easily.<br />
The remainder of the paper is structured as follows. In<br />
Section 2, we describe the annotation schema and the<br />
corpus creation process. Then, in Section 3, we present<br />
detailed information about the corpus such as statistics<br />
on the length of the documents and the number of<br />
temporal expressions. In addition, evaluation results of<br />
1 The 2004 and 2005 training sets and the 2004 evaluation set<br />
are released by the LDC as is the TimeBank corpus; see<br />
http://www.ldc.upenn.edu/<br />
2 WikiWarsDE is publicly available on http://dbs.ifi.uniheidelberg.de/temporal_tagging/<br />
3 http://callisto.mitre.org/<br />
130<br />
our own temporal tagger on the WikiWarsDE corpus are<br />
presented. Finally, we conclude our paper in Section 4.<br />
Temporal<br />
Value of the VAL<br />
Expression<br />
attribute<br />
November 12, 2001 2001-11-12<br />
9:30 p.m. 2001-11-12T21:304 24 months P20M<br />
daily XXXX-XX-XX<br />
Table 1: Normalization examples (VAL) of temporal<br />
expressions of the types date, time, duration, and set.<br />
2. Annotation Schema and Corpus Creation<br />
In Section 2.1, we describe the annotation schema,<br />
which we used for the annotation of temporal<br />
expressions in our newly created corpus. Furthermore,<br />
we explain the task of normalizing temporal expressions<br />
using some examples. Then, in Section 2.2, we detail the<br />
corpus creation process and explain the format, in which<br />
WikiWarsDE is publicly available.<br />
2.1. Annotation Schema<br />
Following the approach of Mazur and Dale (2010), we<br />
use TIDES TIMEX2 as annotation schema to annotate<br />
the temporal expressions in our corpus. The TIDES<br />
TIMEX2 annotation guidelines (Ferro et al., 2005)<br />
describe how to determine the extents of temporal<br />
expressions and their normalizations. In addition to date<br />
and time expressions, such as “November 12, 2001” and<br />
“9:30 p.m.”, temporal expressions describing durations<br />
and sets are to be annotated as well. Examples for<br />
expressions of the types duration and set are “24<br />
months” and “daily”, respectively.<br />
The normalization of temporal expressions is based on<br />
the ISO 8601 standard for temporal information with<br />
some extensions. The following five features can be<br />
used to normalize a temporal expression:<br />
� VAL (value)<br />
� MOD (modifier)<br />
� ANCHOR_VAL (anchor value)<br />
� ANCHOR_DIR (anchor direction)<br />
� SET<br />
The most important feature of a TIMEX2 annotation is<br />
the “VAL” (value) feature. For the four examples above,<br />
4 Assuming that “9:30 p.m.” refers to 9:30 p.m. on November<br />
12, 2001.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
the values of VAL are given in Table 1. Furthermore,<br />
“MOD” (modifier) is used, for instance, for expressions<br />
such as “the end of November 2001”, where MOD is set<br />
to “END”, i.e., to capture additional specifications not<br />
captured by VAL. ANCHOR_VAL and ANCHOR_DIR<br />
are used to anchor a duration to a specific date, using the<br />
value information of the date and specifying whether the<br />
duration starts or ends on this date. Finally, SET is used<br />
to identify set expressions.<br />
Often, for example in the TempEval-2 challenge, the<br />
normalization quality of temporal taggers is evaluated<br />
based on the VAL (value) feature, only. This fact points<br />
out the importance of this feature and was the<br />
motivation to evaluate the normalization quality of our<br />
temporal tagger based on this feature as described in<br />
Section 3.<br />
2.2. Corpus Creation<br />
For the creation of the corpus, we followed Mazur and<br />
Dale (2010), the developers of the English WikiWars<br />
corpus. We selected the 22 corresponding German<br />
Wikipedia articles and manually copied sections<br />
describing the course of the wars. 5 All pictures, crosspage<br />
references, and citations were removed. All text<br />
files were then converted into SGML files, the format of<br />
the ACE TERN corpora containing “DOC”, “DOCID”,<br />
“DOCTYPE”, “DATETIME”, and “TEXT” tags. The<br />
document creation time was set to the time of<br />
downloading the articles from Wikipedia. The “TEXT”<br />
tag surrounds the text that is to be annotated.<br />
Similar to Mazur and Dale (2010), we used our own<br />
temporal tagger, which is described in Section 3.2,<br />
containing a rule set for German as a first-pass<br />
annotation tool. The output of the tagger can then be<br />
imported to the annotation tool Callisto for manual<br />
correction of the annotations. Although this fact has to<br />
be taken into account when comparing the evaluation<br />
results on WikiWarsDE of our temporal tagger with<br />
other taggers, this procedure is motivated by the fact that<br />
“annotator blindness” is reduced to a minimum, i.e., that<br />
annotators miss temporal expressions. Furthermore, the<br />
annotation effort is reduced significantly since one does<br />
5 Due to the shortness of the Wikipedia article about the Punic<br />
Wars in general, we used sections of three separate articles<br />
about the 1st, 2nd, and 3rd Punic Wars.<br />
not have to create a TIMEX2 tag for the expressions<br />
already identified by the tagger.<br />
At the second annotation stage, the documents were<br />
examined for temporal expressions missed by the<br />
temporal tagger and annotations created by the tagger<br />
were manually corrected. This task was performed by<br />
two annotators – although Annotator 2 only annotated<br />
the extents of temporal expressions. The more difficult<br />
task of normalizing the temporal expressions was<br />
performed by Annotator 1 only, since a lot of experience<br />
in temporal annotation is required for this task. At the<br />
third annotation stage, the results of both annotators<br />
were merged and in cases of disagreement the extents<br />
and normalizations were rechecked and corrected by<br />
Annotator 1.<br />
To compare our inter-annotator agreement for the<br />
determination of the extents of temporal expressions to<br />
others, we calculated the same measures as the<br />
developers of the TimeBank-1.2 corpus. They calculated<br />
the average of precision and recall with one annotator's<br />
data as the key and the other's as the response. Using a<br />
subset of ten documents, they report inter-annotator<br />
agreement of <strong>96</strong>% and 83% for partial match (lenient)<br />
and exact match (strict), respectively. 6 Our scores for<br />
lenient and exact match on the whole corpus are <strong>96</strong>.7%<br />
and 81.3%, respectively.<br />
Finally, the annotated files, which contain inline<br />
annotations, were transformed into the ACE APF XML<br />
format, a stand-off markup format used by the ACE<br />
evaluations. Thus, the WikiWarsDE corpus is available<br />
in the same two formats as the WikiWars corpus, and the<br />
evaluation tools of the ACE TERN evaluations can be<br />
used with this German corpus as well.<br />
3. Corpus Statistics and Evaluation Results<br />
In this section, we first present some statistical<br />
information about the WikiWarsDE corpus, such as the<br />
length of the documents and the number of temporal<br />
expressions in the documents (Section 3.1). Then, in<br />
Section 3.2, we shortly introduce our own temporal<br />
tagger HeidelTime, present its evaluation results on<br />
WikiWarsDE, and compare them with results achieved<br />
on other corpora.<br />
6<br />
For more information on TimeBank, see<br />
http://timeml.org/site/timebank/ documentation-1.2.html.<br />
131
Multilingual Resources and Multilingual Applications - Regular Papers<br />
132<br />
Corpus Docs Token Timex Token / Timex Timex / Document<br />
ACE 04 en train 863 306.463 8.938 34,3 10,4<br />
TimeBank 1.2 183 78.444 1.414 55,5 7,7<br />
TempEval2 en train 162 53.450 1.052 50,8 6,5<br />
TempEval2 en eval 9 4.849 81 59,9 9,0<br />
WikiWars 22 119.468 2.671 44,7 121,4<br />
WikiWarsDE 22 95.604 2.240 42,7 101,8<br />
Table 2: Statistics of the WikiWarsDE corpus and other publicly available or released corpora.<br />
3.1. Corpus Statistics<br />
The WikiWarsDE corpus contains 22 documents with a<br />
total of more than 95,000 tokens and 2,240 temporal<br />
expressions. Note that the fact that the WikiWars corpus<br />
contains almost 25,000 tokens more than WikiWarsDE<br />
can be partly explained by the differences between the<br />
two languages. In German compounds are very frequent,<br />
e.g., the 3 English tokens "course of war" is just 1 token<br />
in German (“Kriegsverlauf”).<br />
In Table 2, we present some statistics of the corpus in<br />
comparison to other publicly available corpora. On the<br />
one hand, the density of temporal expressions<br />
(Token/Timex) is similar among the documents of all the<br />
corpora. In WikiWarsDE, one temporal expression<br />
occurs every 42.7 tokens on average.<br />
On the other hand, one can easily see that the documents<br />
of the WikiWarsDE and the WikiWars corpora are much<br />
longer and contain many more temporal expressions<br />
than the documents of the news corpora. While<br />
WikiWars and WikiWarsDE contain 121.4 and 101.8<br />
temporal expressions per document on average, the<br />
number of temporal expressions on the news copora<br />
ranges between 6.5 and 10.4 temporal expressions only.<br />
Thus, the temporal discourse structure is much more<br />
complex for the narrative-style documents in WikiWars<br />
and WikiWarsDE. Further statistics on the single<br />
documents of WikiWarsDE are published with the<br />
corpus.<br />
3.2. Evaluation Results<br />
After the development of the corpus, we evaluated our<br />
temporal tagger HeidelTime on the corpus. HeidelTime<br />
is a multilingual, rule-based temporal tagger. Currently,<br />
two languages are supported (English and German), but<br />
due to the strict separation between the source code and<br />
the resources (rules, extraction patterns, normalization<br />
information), HeidelTime can be easily adapted to<br />
further languages. In the TempEval-2 challenge,<br />
HeidelTime achieved the best results for the extraction<br />
and normalization of temporal expressions from English<br />
documents (Strötgen & Gertz, 2010; Verhagen et al.,<br />
2010). Since HeidelTime uses different normalization<br />
strategies depending on the type of the documents that<br />
are to be processed (news- or narrative-style<br />
documents), we were able to show that HeidelTime<br />
achieves high quality results on both kinds of documents<br />
for English. 7<br />
With the development of WikiWarsDE, we are now able<br />
to evaluate HeidelTime on a German corpus as well. For<br />
this, we use the well-known evaluation measures of<br />
precision, recall, and f-score. In addition, we distinguish<br />
between lenient (overlapping match) and strict (exact<br />
match) measures. For the normalization, one can<br />
calculate the measures for all expressions that were<br />
correctly extracted by the system (value). This approach<br />
is used by the ACE TERN evaluations. However, similar<br />
to Ahn et al. (2005) and Mazur and Dale (2010), we<br />
argue that it is more meaningful to combine the<br />
extraction with the normalization tasks, i.e., to calculate<br />
the measures for all expressions in the corpus<br />
(lenient+value and strict+value).<br />
7 More information on HeidelTime, its evaluation results on<br />
several corpora, as well as download links and an online demo<br />
can be found at http://dbs.ifi.uni-heidelberg.de/heideltime/.
Corpus<br />
Multilingual Resources and Multilingual Applications - Regular Papers<br />
lenient<br />
P R F<br />
strict<br />
P R F<br />
value<br />
P R F<br />
lenient + value<br />
P R F<br />
strict + value<br />
P R F<br />
TimeBank 1.2 90.5 91.4 90.9 83.5 84.3 83.9 86.2 86.2 86.2 78.0 78.8 78.4 73.2 74.0 73.6<br />
WikiWars 93.9 82.4 87.8 86.0 75.4 80.4 89.5 90.1 89.8 84.1 73.8 78.6 79.6 69.8 74.4<br />
WikiWarsDE 98.5 85.0 91.3 92.6 79.9 85.8 87.0 87.0 87.0 85.7 74.0 79.4 82.5 71.2 76.5<br />
Table 3: Evaluation results of our temporal tagger on an English news corpus (TimeBank 1.2), an English narratives<br />
corpus (WikiWars) and our newly created German narratives corpus WikiWarsDE.<br />
On WikiWarsDE, HeidelTime achieves f-scores of 91.3<br />
and 85.8 for the extraction (lenient and strict,<br />
respectively) and 79.4 and 76.5 for the normalization<br />
(lenient + value and strict + value, respectively).<br />
For comparison, we present the results of HeidelTime on<br />
some English corpora. As shown in Table 3, our<br />
temporal tagger achieves equally good results on both,<br />
the narratives corpora (WikiWars and WikiWarsDE) and<br />
the news corpus (TimeBank). Note that our temporal<br />
tagger uses different normalization strategies depending<br />
on the type of the corpus that is to be processed. This<br />
might be the main reason why HeidelTime clearly<br />
outperforms the temporal tagger of the WikiWars<br />
developers. For the WikiWars corpus, Mazur and Dale<br />
(2010) report f-scores for the normalization of only 59,0<br />
and 58,0 (lenient + value and strict + value,<br />
respectively). Compared to these values, HeidelTime<br />
achieves much higher f-scores, namely 78.6 and 74.4,<br />
respectively.<br />
4. Conclusions<br />
In this paper, we described WikiWarsDE, a temporal<br />
annotated corpus containing German narrative-style<br />
documents. After presenting the creation process and<br />
statistics of WikiWarsDE, we used the corpus to<br />
evaluate our temporal tagger HeidelTime. While Mazur<br />
and Dale (2010) report lower evaluation results of their<br />
temporal tagger on narratives than on news documents,<br />
HeidelTime achieves similar results on both types of<br />
corpora. Nevertheless, we share their opinion that the<br />
normalization of temporal expressions on narratives is<br />
challenging. However, using different normalization<br />
strategy for different types of documents (news-style<br />
and narrative-style documents), this problem can be<br />
tackled.<br />
By making available WikiWarsDE and HeidelTime, we<br />
provide useful contributions to the community in<br />
support of developing and evaluating temporal taggers<br />
and of improving temporal information extraction.<br />
5. Acknowledgements<br />
We thank the anonymous reviewers for their valuable<br />
suggestions to improve the paper.<br />
6. References<br />
Ahn, D., Adafre, S.F., de Rijke, M. (2005): Towards<br />
Task-Based Temporal Extraction and Recognition. In<br />
G. Katz, J. Pustejovsky, F. Schilder (Eds.), Extracting<br />
and Reasoning about Time and Events. Dagstuhl,<br />
Germany: Dagstuhl Seminar Proceedings.<br />
Alonso, O., Strötgen, J., Baeza-Yates, R., Gertz, M.<br />
(<strong>2011</strong>): Temporal Information Retrieval: Challenges<br />
and Opportunities. In Proceedings of the 1st<br />
International Temporal Web Analytics Workshop<br />
(TWAW), pp. 1–8.<br />
Ferro, L., Gerber, L., Mani, I., Sundheim, B., Wilson, G.<br />
(2005): TIDES 2005 Standard for the Annotation of<br />
Temporal Expressions. Technical report, The MITRE<br />
Corporation.<br />
Mani, I., Pustejovsky, J., Gaizauskas, R.J. (2005): The<br />
Language of Time: A Reader. Oxford University<br />
Press.<br />
Mazur, P., Dale, R. (2010): WikiWars: A New Corpus<br />
for Research on Temporal Expressions. In<br />
Proceedings of the 2010 Conference on Empirical<br />
Methods in Natural Language Processing (EMNLP),<br />
pp. 913-922.<br />
Pustejovsky, J., Hanks, P., Sauri, R., See, A.,<br />
Gaizauskas, R.J., Setzer, A., Radev, D., Sundheim, B.,<br />
Day, D., Ferro, L., Lazo, M. (2003a): The<br />
133
Multilingual Resources and Multilingual Applications - Regular Papers<br />
134<br />
TIMEBANK Corpus. In Proceedings of Corpus<br />
Linguistics 2003, pp. 647–656.<br />
Pustejovsky, J., Castaño, J.M., Ingria, R., Sauri, R.,<br />
Gaizauskas, R.J., Setzer, A., Katz, G. (2003b):<br />
TimeML: Robust Specification of Event and<br />
Temporal Expressions in Text. In New Directions in<br />
Question Answering, pp 28–34.<br />
Pustejovsky, J., Knippen, R., Littman, J., Sauri, R.<br />
(2005): Temporal and Event Information in Natural<br />
Language Text. Language Resources and Evaluation,<br />
39(2-3):123–164.<br />
Strötgen, J., Gertz, M. (2010): HeidelTime: High<br />
Quality Rule-based Extraction and Normalization of<br />
Temporal Expressions. In Proceedings of the 5th<br />
International Workshop on Semantic Evaluation<br />
(SemEval), pp. 321–324.<br />
Verhagen, M., Sauri, R., Caselli, T., Pustejovsky, J.<br />
(2010): SemEval-2010 Task 13: TempEval-2. In<br />
Proceedings of the 5th International Workshop on<br />
Semantic Evaluation (SemEval), pp. 57–62.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Translation and Language Change with Reference to Popular Science Articles:<br />
The Interplay of Diachronic and Synchronic Corpus-Based Studies<br />
Sofia Malamatidou<br />
University of Manchester<br />
Oxford Road, M13 9PL<br />
E-mail: sofia.malamatidou@manchester.ac.uk<br />
Abstract<br />
Although a number of scholars have adopted a corpus-based approach to the investigation of translation as a form of language<br />
contact and its impact on the target language (Steiner, 2008; House, 2004; 2008; Baumgarten et el. 2004), no sustained corpus-based<br />
study of translation involving Modern Greek has so far been attempted and very few diachronic corpus-based studies (Amouzadeh &<br />
House, 2010) have been undertaken in the field of translation. This study aims to combine synchronic and diachronic corpus-based<br />
approaches, as well as parallel and comparable corpora for the analysis of the linguistic features of translated texts and their impact<br />
on non-translated ones. The corpus created captures a twenty-year period (1990-2010) and is divided into three sections, including<br />
non-translated and translated Modern Greek popular science articles published in different years, as well as the source texts of the<br />
translations. Unlike most studies employing comparable corpora, which focus on revealing recurrent features of translated language<br />
independently of the source and target language, this study approaches texts with the intention of revealing features that are<br />
dependent on the specific language pair involved in the translation process.<br />
Keywords: corpus-based translation studies, language change, diachronic corpora, Modern Greek, passive voice<br />
1. Introduction<br />
Translation as a language contact phenomenon is a<br />
phenomenon that neither linguistics nor translation<br />
studies has addressed in depth. However, in the era of the<br />
information society, the translation of popular science<br />
texts tends to be very much a unidirectional process from<br />
the dominant lingua franca, which is English, into less<br />
widely spoken languages such as Modern Greek. This<br />
process is likely to encourage changes in the<br />
communicative conventions of the target language.<br />
Given the fact that the genre of popular science was<br />
developed in Greece mainly through translations from<br />
Anglophone sources in the last two decades, it is<br />
interesting to examine whether and how the translations<br />
from English encouraged the dissemination of particular<br />
linguistic features in the target language in the discourse<br />
of this particular genre. A number of scholars, mostly<br />
within the English-German context, have taken interest in<br />
investigating translation as a form of language contact<br />
and its effects on the target language. Steiner (2008) has<br />
investigated grammatical and syntactic features of<br />
explicitness as a result of the contact between English<br />
and German, which however did not involve diachronic<br />
analyses of corpora. Most importantly, House and a<br />
group of scholars have investigated how translation from<br />
English affects German, but also Spanish and French<br />
(House, 2004; 2008; Baumgarten et al. 2004; Becher et al.<br />
2009). However, these studies mainly involved manual<br />
analyses of texts, that is, they were not corpus-based<br />
studies as they are understood by Baker (1995), i.e. they<br />
did not involve the automatic or semi-automatic analysis<br />
of machine-readable texts. Diachronic corpus-based<br />
approaches to translation are limited (Amouzadeh &<br />
House, 2010) and in terms of Modern Greek, no similar<br />
study has ever been conducted.<br />
This study aims to examine whether and how translation<br />
can encourage linguistic changes in the target language<br />
by investigating a diachronic corpus of non-translated<br />
and translated Modern Greek popular science articles,<br />
along with their source texts, in order to examine how<br />
translation can be understood as a language contact<br />
phenomenon. The linguistic change that is examined is<br />
135
Multilingual Resources and Multilingual Applications - Regular Papers<br />
the frequency of the passive voice, since it has been<br />
claimed to be found more frequently in translated<br />
Modern Greek texts (Apostolou-Panara, 1991),<br />
especially those translated from English.<br />
This paper first presents the theoretical model that<br />
informs the study, namely the Code-Copying Framework<br />
(Johanson, 1993; 1999; 2002). Then the research<br />
methodology is presented in detail and data analysis<br />
techniques are analysed. Finally, some preliminary<br />
findings are discussed. It must be mentioned, that this is<br />
still an ongoing project and for that reason the results are<br />
limited to a number of small sample studies.<br />
136<br />
2. The Code-Copying Framework<br />
The Code-Copying Framework is a widely applicable<br />
linguistic model that is suitable for the description of<br />
phenomena that have consistently been neglected, such<br />
as translation as a form of language contact and a<br />
propagator of change. Some of its concepts have recently<br />
been used by translation scholars to describe similar<br />
phenomena (Steiner, 2008), suggesting that it is a<br />
conceptual model suitable for analysing diverse cases of<br />
language contact, in particular cases where translation<br />
plays a central role in the dissemination of linguistic<br />
features.<br />
The Code-Copying Framework was developed by<br />
Johanson (1993; 1999; 2002) who is critical of the<br />
terminology, especially that of borrowing, used in the<br />
field of language change studies and it is this critique that<br />
serves as a point of departure towards developing a new<br />
explanatory framework of language contact, where<br />
‘copying’ replaces traditional terms and provides a<br />
different vintage point from which to analyse the<br />
phenomenon. Johanson (1999:39) argues that in any<br />
situation of code-interaction, that is, in a situation where<br />
two or more codes interact, two linguistic systems, i.e.<br />
two codes are employed. The Model Code is the source<br />
code, whereas the Basic Code is the recipient code which<br />
also provides the necessary morphosyntactic and other<br />
information for inserting and adapting the copied<br />
material (Johanson, 2008:62). Although, there are<br />
different directions of copying, this study focuses on the<br />
case of ‘adoption’ which involves elements being<br />
inserted from the Model Code into the Basic Code and<br />
views translation as a language contact situation where<br />
translators are likely to copy elements from the source<br />
language, i.e. the Model Code, when translating into the<br />
target language, which is the Basic Code.<br />
Two types of copying are possible within this model:<br />
global and selective copying. The linguistic properties<br />
that can be copied are material (i.e. phonic), semantic,<br />
combinational (i.e. collocations and syntax) and<br />
frequential properties, namely the frequency of particular<br />
linguistic units. In the case of global copying, a linguistic<br />
item is copied along with all its aforementioned<br />
properties. In the case of selective copying, one or more<br />
properties are copied resulting in distinct types of<br />
copying. Thus, there is material (M), semantic (S),<br />
combinational (C) and frequential (F) copying.<br />
Figure 1: The Code-Copying Framework<br />
(Johanson, 2006:5)<br />
During the process of translation, selective copying is<br />
more probable than global copying (Verschik, 2008:133).<br />
For that reason, the type of copying that is dealt with in<br />
this study is selective copying and in particular<br />
frequential copying, which results in a change in the<br />
frequency patterns of an existing linguistic unit.<br />
Apostolou-Panara (1991) notes that the passive<br />
constructions are used more frequently in Modern Greek<br />
than they once were. Traditionally, it has been argued that<br />
the passive voice structures are used in Modern Greek<br />
though not as often as in English (Warburton, 1975:576),<br />
where the passive voice is quite frequent especially in<br />
terms of informative texts such as popular science articles.<br />
As far as translation is concerned, different frequencies
Multilingual Resources and Multilingual Applications - Regular Papers<br />
and proportionalities of native patterns often result in<br />
texts having a ‘non-native’ feeling (Steiner, 2008:322).<br />
The frequent translation of source text patterns with<br />
grammatical, yet marginal, target language linguistic<br />
patterns may ultimately override prevailing patterns and<br />
result in new communicative preferences in the target<br />
language (Baumgarten & Özçetin 2008:294).<br />
Copies usually begin as momentary code-copies, that is,<br />
occasional instances of copying. When copies start being<br />
frequently and regularly used by a group of individuals or<br />
by a particular speech community, they become<br />
habitualised code-copies. Copies may also become<br />
conventionalised and become integrated and widely<br />
accepted by a speech community. The final stage is for<br />
copies to become monolingual, i.e. when copies are used<br />
by monolinguals and do not presuppose any bilingual<br />
ability (Johanson, 1999:48). Since momentary copies are<br />
difficult to trace (Csató, 2002:326), emphasis in this<br />
study is placed on habitualised code-copies. Translators<br />
are considered as part of a particular speech community<br />
and copies are regarded as habitualised when they are<br />
frequently and regularly used by translators.<br />
Conventionalised copies are not examined in this study,<br />
since they presuppose measuring social evaluation that is<br />
outside the scope of this research. However, it is safe to<br />
assume that if a copy is monolingualised, that is, it is used<br />
in non-translated texts; it is also in general terms socially<br />
approved.<br />
Translation in this study is understood as a social<br />
circumstance facilitating copying. It is not considered as<br />
a cause of change, but rather as an instance of contact<br />
during which copying may occur and change may<br />
proliferate through language, since translated texts,<br />
especially newspaper and magazine articles, are widely<br />
circulating texts that are likely to exert a powerful<br />
linguistic impact on a large audience. The main factors of<br />
copying are considered to be extra-linguistic, especially<br />
the cultural dominance of English in relation to Modern<br />
Greek, as far as the production of scientific texts is<br />
concerned, and the prestige that English enjoys as a<br />
prominent language and culture, both in the general sense<br />
of a lingua franca and in terms of scientific research.<br />
3. Data and Methodology<br />
3.1. Corpus design<br />
Based on the availability of data and the research aims of<br />
this thesis, an approximately 500,000 corpus of Modern<br />
Greek non-translated and translated popular science<br />
articles, along with their source texts was created. The<br />
corpus is named TROY (TRanslation Over the Years) and<br />
covers a 20-year period (1990-2010), which is considered<br />
to be an adequate time span for language change to occur<br />
and is amenable to being systematically observed.<br />
Newspapers and magazines dedicated to scientific issues<br />
provide are the two main sources of popular science<br />
articles. The corpus is specialised in terms of both genre<br />
and domain, i.e. it involves popular science articles from<br />
the domain of technology and life sciences. These<br />
domains were chosen due to the fact that the majority of<br />
articles, especially translations, seem to belong to either<br />
one of the two domains. This in turn indicates that<br />
interest is expressed for these domains from the general<br />
public, which consequently suggests that a high number<br />
of people will read articles belonging to the domains of<br />
technology and life sciences, a fact that is likely to result<br />
in a powerful linguistic impact on a large audience.<br />
The TROY corpus is divided into three subcorpora. The<br />
first subcorpus consists of non-translated Modern Greek<br />
popular science articles published in 1990-1991. The<br />
second subcorpus consists of non-translated and<br />
translated Modern Greek popular science articles<br />
published in 2003-2004, as well as the source texts of the<br />
translations. The years 2003-2004 were selected because<br />
translations of popular science texts started circulating<br />
more widely in Greece during that period than in<br />
previous years. The third subcorpus includes<br />
non-translated as well as translated texts and their source<br />
texts, all published in 2009-2010. The subcorpora are<br />
evenly balanced, both in terms of their overall size and<br />
between the two domains.<br />
3.2. Corpus Methodology<br />
The corpus methodology employed in this study has three<br />
aims. Firstly, it aims to investigate whether certain<br />
features have changed over time in Modern Greek.<br />
Secondly, it aims to examine whether this change is<br />
137
Multilingual Resources and Multilingual Applications - Regular Papers<br />
related or mirrored in the process of translation. Finally, it<br />
aims to investigate whether influence can be traced back<br />
to the English source texts. Ultimately, this methodology<br />
aims at combining most corpus-based methodologies<br />
under one research aim. Thus, synchronic and diachronic<br />
corpus-based approaches, as well as parallel and<br />
comparable corpora are employed in order to illustrate<br />
the way in which combined methodologies can assist in<br />
the analysis of the linguistic features of translated texts<br />
and their impact on non-translated ones.<br />
Firstly, the corpus methodology aims at examining<br />
language change in Modern Greek and in particular to<br />
investigate whether the frequency of the passive voice<br />
has changed over time. This involves a longitudinal<br />
corpus-based study, during which a comparable corpus is<br />
analysed diachronically. For the purposes of this study,<br />
the non-translated articles published in 1990-1991 will be<br />
compared to the non-translated articles published in<br />
2009-2010.<br />
The second aim of this corpus-based methodology is to<br />
examine the role of translation in this language change<br />
phenomenon. This involves a comparable corpus-based<br />
analysis where translated and non-translated Modern<br />
Greek popular science articles are analysed<br />
synchronically. First, the non-translated articles<br />
published in 2003-2004 will be compared to the<br />
translated articles published during the same years. Then,<br />
the same type of analysis will be conducted for articles<br />
published in 2009-2010. Two separate phases of analysis<br />
are included in order to investigate the extent to which<br />
the linguistic features in the translated texts differ from<br />
those of the non-translated ones at different time periods.<br />
More particularly, the first phase of analysis focuses on a<br />
period of time when the influence from English<br />
translations of popular science articles was at its initial<br />
stage. The second phase of analysis focuses on a later<br />
stage of the contact between English and Modern Greek<br />
through translation, as far as the particular genre or<br />
popular science is concerned.<br />
Finally, this corpus-based methodology aims to<br />
investigate the role of the source texts in this language<br />
contact situation. This involves the synchronic analysis<br />
of a parallel corpus of translated articles and their<br />
138<br />
originals, which consists of two phases of analysis, i.e.<br />
the translated popular science articles that were published<br />
in 2003-2004 will be compared to their source texts and<br />
the same analysis will be conducted for the articles<br />
published in 2009-2010.<br />
The analyses will be conducted with the help of the<br />
Concordance tool of WordSmith Tools 5.0 and will be<br />
based on semi-automatic methods, since at points where<br />
a closer examination of the texts is required, they will be<br />
analysed manually. The verb form is considered to be the<br />
unit of analysis and auxiliary verbs are excluded from the<br />
counts, since they do not provide any lexical information.<br />
For the sample studies discussed below, a part-of-speech<br />
(POS) tagger is not being used due to the fact that<br />
available Modern Greek POS taggers score relatively low<br />
on accuracy and Modern Greek verbs can be quite<br />
accurately identified from their suffixes with the use of<br />
wildcards.<br />
4. Preliminary Results<br />
Although this is still an ongoing project, a number of<br />
sample studies indicate that a corpus-based methodology<br />
that combines synchronic and diachronic corpus-based<br />
approaches, as well as parallel and comparable corpora<br />
can considerably assist in the analysis of the linguistic<br />
features of translated texts and their impact on<br />
non-translated ones. Articles for the sample studies are<br />
taken from the newspaper Βήμα (The Tribune), which<br />
includes a section dedicated to scientific issues.<br />
4.1. Language Change in Modern Greek<br />
In terms of the first aim of this corpus-based<br />
methodology, that is, the examination of language change<br />
in Modern Greek, a sample study of popular science<br />
articles published in 1991 and 2010 involving 4,000<br />
words was conducted in order to examine changes in the<br />
frequency of the passive voice. Although this is a very<br />
small sample study, it was found that the passive voice<br />
has become more frequent in Modern Greek in the last 20<br />
years, at least in terms of the specific genre of popular<br />
science articles. In particular, in the articles published in<br />
1991, 273 verb forms were found, 42 of which involved<br />
passive verb forms. In the articles published in 2010, 217<br />
instances of verb forms were identified, 42 of which were<br />
passive. This means that there is an approximately 5%
Multilingual Resources and Multilingual Applications - Regular Papers<br />
increase in the frequency of the distribution of passive<br />
voice constructions in Modern Greek. However, this 5%<br />
increase may be attributed to a number of factors that are<br />
irrespective of contact-induced language change, i.e. it<br />
may be a result of internal language changes. An analysis<br />
of translated texts is necessary in order to establish the<br />
extent to which contact through translation has<br />
encouraged a frequential copying of passive voice<br />
structures from English.<br />
Figure 2: Change in the frequency of the passive voice in<br />
Modern Greek (1991-2010)<br />
4.2. The Role of the Translations<br />
A second sample study was conducted in order to<br />
examine the role of the translation in this language<br />
change situation. In particular, a small corpus of 20,000<br />
words taken from translated and non-translated Modern<br />
Greek popular science articles published in 2010 was<br />
analysed. The analysis revealed that the frequency of the<br />
passive voice in the translated and non-translated articles<br />
is very similar, i.e. approximately 20%. In the<br />
non-translated articles, 1,081 verb instances were<br />
identified, 215 of which were passive, whereas the<br />
translated articles included 1,234 verb forms, out of<br />
which 243 involved passive voice occurrences.<br />
Figure 3: Frequency of the passive voice in translated and<br />
non-translated articles published in 2010<br />
This similarity in terms of the proportions of the passive<br />
voice suggests that the translated texts at least mirror the<br />
changes in the frequency of the passive voice that is<br />
attested in Modern Greek. This sample study focuses on a<br />
later stage of contact between English and Modern Greek<br />
in terms of popular science publications and it is assumed<br />
that this later stage indicates more established instances<br />
of copying, if we accept that some kind of copying has<br />
taken place. Although a comparable analysis of articles<br />
published in 2003-2004, when the influence from<br />
Anglophone source texts was at its initial stage, has not<br />
yet been attempted, such an analysis is likely to reveal a<br />
different patterning than the one discussed above, i.e. that<br />
the frequency of the passive voice is higher in translated<br />
texts than in non-translated ones. This will indicate that<br />
the frequential copying of passive voice gradually<br />
habitualised in the context of translation.<br />
4.3. The Role of the Source Texts<br />
Finally, in terms of the last aim of this corpus-based<br />
methodology, namely the investigation of the role of the<br />
English source texts in this language change<br />
phenomenon, it should be mentioned that although a<br />
sample study is not available at the moment for this type<br />
of analysis, it can be predicted based on the previous<br />
sample study that translated texts are likely to follow the<br />
patterns of the source texts. Corpus studies (Biber et al.<br />
1999:476) suggest that the English passives account for<br />
approximately 25% of all finite verbs in academic prose<br />
and for 15% in news. Popular science articles are<br />
considered to be somewhere in between these two genres,<br />
since they present scientific issues using a journalistic<br />
language. Thus, the frequency of the passive voice in<br />
English popular science articles can be expected to be<br />
somewhere between these two percentages, i.e. 20%. The<br />
distribution of the frequency of the passive voice in the<br />
previous sample study represents exactly this proportion.<br />
If this prediction is confirmed, it will suggest that the<br />
translation of popular science articles from Anglophone<br />
sources tends to encourage the frequential copying of the<br />
passive voice in Modern Greek. In that case, Modern<br />
Greek being the Basic Code copied the frequency of the<br />
passive voice patterns from the Model Code, which is<br />
English. The copies first habitualised in the discourse of<br />
the translation and then spread into the general linguistic<br />
community and became monolingual copies.<br />
139
Multilingual Resources and Multilingual Applications - Regular Papers<br />
140<br />
5. Conclusion<br />
Although the results are only preliminary, the importance<br />
of this corpus-based study lies in a number of factors.<br />
Firstly, it is one of the first diachronic corpus-based<br />
studies ever to be attempted within the field of translation<br />
studies and it raises collective awareness of how<br />
translation can encourage the dissemination of particular<br />
source language linguistic features. If this scholarly<br />
strand is to be consolidated, more research across a wider<br />
range of language pairs and linguistic features has to be<br />
conducted. Secondly, it is one of the first sustained<br />
corpus-based studies ever to be conducted in the Modern<br />
Greek context within the field of translation studies,<br />
which aims at analysing systematically and in depth the<br />
Modern Greek linguistic features of translated texts.<br />
Finally, this study combines all corpus-based<br />
methodologies, i.e. diachronic, synchronic, comparable<br />
and parallel, under one research aim: the investigation of<br />
translation as a language contact phenomenon. This is<br />
probably the most important aspect of this study since it<br />
stresses the numerous advantages of collaborative<br />
techniques and engages them in a mutually profitable<br />
dialogue.<br />
6. References<br />
Amouzadeh, M., House, J. (2010): Translation and<br />
Language Contact: The case of English and Persian.<br />
Languages in Contrast, 10(1), pp. 54-75.<br />
Apostolou-Panara, A. (1991): English Loanwords in<br />
Modern Greek: An overview. Terminologie et<br />
Traduction, 1(1), pp. 45-60.<br />
Baker, M. (1995): Corpora in Translation Studies: An<br />
overview and some suggestions for future research.<br />
Target, 7(2), pp. 223-243.<br />
Baumgarten, N., House, J., Probst, J. (2004): English as a<br />
Lingua Franca in Covert Translation Processes. The<br />
Translator, 10(1), pp. 83-108.<br />
Baumgarten, N., Özçetin, D. (2008): Linguistic Variation<br />
through Language Contact in Translation. In E.<br />
Siemund & N. Kintana (Eds.), Language Contact and<br />
Contact Languages. Amsterdam: John Benjamins, pp.<br />
293-316.<br />
Becher, V., House, J., Kranich, S. (2001): Convergence<br />
and Divergence of Communicative Norms through<br />
Language Contact in Translation. In K. Braunmüller &<br />
J. House (Eds.), Convergence and Divergence in<br />
Language Contact Situations. Amsterdam: John<br />
Benjamins, pp. 125-152.<br />
Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan,<br />
E. (1999): Longman Grammar of Spoken and Written<br />
English. Harlow: Longman.<br />
Csató, É.Á. (2002): Karaim: A high-copying language. In<br />
M.C. Jones & E. Esch (Eds.), Language Change: The<br />
interplay of internal, external and extra-linguistic<br />
factors. Berlin: Mouton de Gruyter, pp. 315-327.<br />
Johanson, L. (1993): Code-Copying in Immigrant<br />
Turkish. In G. Extra & L. Verhoeven (Eds.), Immigrant<br />
Languages in Europe. Clevedon, Philadelphia and<br />
Adelaide: Multilingual Matters, pp. 197-221.<br />
Johanson, L. (1999): The Dynamics of Code-Copying in<br />
Language Encounters. In B. Brendemoen, E. Lanza &<br />
E. Ryen (Eds.), Language Encounters across Time and<br />
Space. Oslo: Novus Press, pp. 37-62.<br />
Johanson, L. (2002): Structural Factors in Turkic<br />
language Contacts. London: Curzon.<br />
Johanson, L. (2008): Remodelling Grammar: Copying,<br />
conventionalisation, grammaticalisation. In E.<br />
Siemund & N. Kintana (Eds.), Language Contact and<br />
Contact Languages. Amsterdam: John Benjamins,<br />
pp. 61-79.<br />
House, J. (2004): English as Lingua Franca and its<br />
Influence on Other European Languages. In J.M.<br />
Bravo (Ed.), A New Spectrum of Translation Studies.<br />
Valladolid: Universidad de Valladolid, pp. 49-62.<br />
House, J. (2008): Global English and the Destruction of<br />
Identity?. In P. Nikolaou & M.V. Kyritsi (Eds.),<br />
Translating Selves: Experience and identity between<br />
languages and literatures. London and New York:<br />
Continuum, pp. 87-107.<br />
Steiner, E. (2008): Empirical Studies of Translations as a<br />
Mode of Language Contact: ‘Explicitness’ of<br />
lexicogrammatical encoding as a relevant dimension.<br />
In E. Siemund & N. Kintana (Eds.), Language Contact<br />
and Contact Languages. Amsterdam: John Benjamins,<br />
pp. 317-346.<br />
Verschik, A. (2008): Emerging Bilingual Speech: From<br />
Monolingualism to Code-Copying. London:<br />
Continuum.<br />
Warburton, I. (1975): The Passive in English and Greek.<br />
Foundations of Language, 13(4), pp. 563-578.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
A Comparable Wikipedia Corpus: From Wiki Syntax to POS Tagged XML<br />
Noah Bubenhofer, Stefanie Haupt, Horst Schwinn<br />
Institut <strong>für</strong> Deutsche Sprache IDS<br />
Mannheim<br />
E-mail: bubenhofer@ids-mannheim.de, st.haupt@gmail.com, schwinn@ids-mannheim.de<br />
Abstract<br />
To build a comparable Wikipedia corpus of German, French, Italian, Norwegian, Polish and Hungarian for contrastive grammar<br />
research, we used a set of XSLT stylesheets to transform the mediawiki annotations to XML. Furthermore, the data has been<br />
annotated with word class information using different taggers. The outcome is a corpus with rich meta data and linguistic annotation<br />
that can be used for multilingual research in various linguistic topics.<br />
Keywords: Wikipedia, Comparable Corpus, Multilingual Corpus, POS-Tagging, XSLT<br />
1. Background<br />
The project EuroGr@mm 1<br />
aims at describing German<br />
grammar from a multi-lingual perspective. Therefore, an<br />
international research team consisting of members from<br />
Germany, France, Italy, Norway, Poland and Hungary,<br />
collaborates in bringing in their respective language<br />
knowledge to a contrastive description of German. The<br />
grammatical topics that have been tackled so far are<br />
morphology, word classes, tense, word order and phrases.<br />
A corpus-based approach is used to compare the<br />
grammatical means of the languages in focus. But so far,<br />
no comparable corpus of the chosen languages was at the<br />
project’s disposal. Of course, for all the languages big<br />
corpora are available, but they consist of different text<br />
types and are in different states of preparation regarding<br />
linguistic markup.<br />
Hence we wanted to build our own corpus of comparable<br />
data in the different languages. The Wikipedia is a<br />
suitable source for building such a corpus. The<br />
disadvantage of the Wikipedia is its limitations regarding<br />
text types: The articles are (or are at least intended to be)<br />
very uniform in their linguistic structure. To overcome<br />
this problem we decided to include also the discussions<br />
of the articles in our corpus, which can broaden at least<br />
slightly the text type diversity.<br />
In this paper we describe, how the Wikipedia was<br />
converted to an XML format and part-of-speech-tagged.<br />
1 See<br />
http://www.ids-mannheim.de/gra/eurogr@mm.html.<br />
2. Wikipedia conversion to XCES<br />
To be able to integrate the linguistic annotated version of<br />
the Wikipedia into our existing corpus repository, the<br />
data has to be in the XML format XCES 2<br />
. There are<br />
already some attempts to convert the Wikipedia to a<br />
corpus linguistic usable data source (Fuchs, 2010:136).<br />
But they offer either only the data of a specific language<br />
version of the Wikipedia in an XML format (Wikipedia<br />
XML Corpus, Denoyer & Gallinari, 2006; SW1, Atserias<br />
et al., 2008), the format isn’t suitable for our needs<br />
(WikiPrep, Gabrilovich & Markovitch, 2006; WikIDF,<br />
Krizhanovsky, 2008; Java Wikipedia Library, Zesch et al.<br />
2008) or the conversion tool does not work anymore with<br />
the current mediawiki engine (WikiXML Collection;<br />
Wiki2TEI, Desgraupes & Loiseau 2007). To have a<br />
lasting solution, the conversion routines need to be<br />
useable also in the future which would allow us to get<br />
from time to time a new version of the Wikipedia.<br />
Therefore we developed our own solution of XSLT<br />
transformations to get an XCES version of the data.<br />
All Wikipedia articles and their discussions are available<br />
as mediawiki database dumps in XML (Extensible<br />
Markup Language, Bray et al., 1998). These database<br />
dumps contain different annotations. Metadata of articles<br />
display in XML while the articles display in mediawiki<br />
language. We convert these documents into XCES format<br />
using XSLT 2.0 transformations to ease research.<br />
2 http://www.xces.org/<br />
141
Multilingual Resources and Multilingual Applications - Regular Papers<br />
This process is divided into 2 sections:<br />
1) The conversion from mediawiki language to XML<br />
2) The conversion from the generated XML to XCES<br />
format<br />
The mediawiki language consists of a variety of special<br />
signs for special annotations. E.g. to describe a level 2<br />
header the line displays as text wrapped into two equal<br />
signs on each side, like this:<br />
== head ==<br />
Likewise lists display as a chain of hash or asterisk signs,<br />
according to the level, e.g. a level 3 list entry:<br />
### list entry<br />
During the first conversion we process the paragraphs<br />
according to their type and detect headers, lists, tables<br />
and usual paragraphs. We convert these signs into clean<br />
XML, so<br />
== head ==<br />
turns to<br />
text<br />
and<br />
### list entry<br />
turns to<br />
list entry.<br />
Of course inside the paragraphs there may be<br />
text-highlighting markup. We access the paragraphs and<br />
convert these wikimedia annotations to XML, too. Here<br />
we follow a certain pattern to detect text-highlighting<br />
signs.<br />
Still the document’s hierarchy is flat. In the next step we<br />
add structure to the lists. We group the list items<br />
according to their level to highlight the structure. In a<br />
later step we group all articles into sections depending on<br />
the occurrence of head elements. Whenever we add<br />
structure we need to take care of possible errors in the<br />
mediawiki syntax.<br />
Now the articles need to be transformed into the XCES<br />
structure. Here we sort the articles into alphanumerical<br />
sections. We transform the corpus and enrich every<br />
article with meta data. We provide a unique id for every<br />
article and discussion so that they can easily be<br />
referenced. Also the actual article text can be<br />
distinguished from the discussion part of the article,<br />
which is important because they are different text types.<br />
These conversion routines should work for all the<br />
language versions of the Wikipedia, but have so far only<br />
142<br />
been tested with the languages necessary for the project:<br />
German, French, Italian, Norwegian (Bokmål), Polish<br />
and Hungarian.<br />
3. POS-Tagging<br />
To enable searching for word class information in the<br />
corpus, the data needs being part-of-speech tagged. This<br />
task has not been finished yet, but preliminary tests have<br />
been done already. Not having any additional resources,<br />
we have to rely on ready to use taggers and cannot do any<br />
improvements or adjustments of the taggers. 3<br />
We are<br />
using the following taggers:<br />
German TreeTagger (Schmid, 1994) with<br />
the available training library for German<br />
(STTS-Tagset, Schiller et al., 1995)<br />
French TreeTagger with the available training<br />
library for French<br />
Italian TreeTagger with the available training<br />
library for Italian<br />
Polish TaKIPI (Piasecki, 2007), based on<br />
Morfeusz SIaT (Saloni et al., 2010)<br />
Hungarian System developed by the Hungarian<br />
National Corpus team (Váradi, 2002), based<br />
on TnT (Brants, 2000)<br />
Norwegian (Bokmål) Oslo-Bergen Tagger<br />
4<br />
(Hagen et al., 2000)<br />
The input for the taggers are raw text files without any<br />
XML mark-up and containing only those parts of the<br />
Wikipedia, which need to be tagged. So all meta<br />
information is being ignored.<br />
A Perl script is used to send the input data in manageable<br />
chunks to the tagger. The script also transfers the output<br />
of the tagger to a XML file that contains to each token the<br />
character position reference to the original data file.<br />
Because of the size of the Wikipedia, the tagging process<br />
is very time consuming. E.g. the XCES file of the<br />
German Wikipedia holds about 15.4 GB of data<br />
(785’791’766 tokens). The size of the stand-off file<br />
containing the linguistic mark-up produced by the<br />
3 Nevertheless we get support of the developers of the taggers,<br />
which we greatly appreciate.<br />
4 See http://tekstlab.uio.no/obt-ny/english/<br />
history.html for the newest developments of the tagger.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
TreeTagger (POS information to each token) is about<br />
157.9 GB. It took about 30 hours on a standard double<br />
core PC to process this file.<br />
4. Corpus Query System<br />
Our existing corpus management software COSMAS II 5<br />
is used as corpus query system. COSMAS II is currently<br />
used to manage the DeReKo (German Reference Corpus,<br />
see Kupietz et al., 2010), which contains about 4 billion<br />
tokens. Therefore COSMAS II is also able to cope with<br />
the Wikipedia data.<br />
To be able to build from time to time new versions of our<br />
corpus based on the latest Wikipedia, we can rely on the<br />
same version controlling mechanisms as the DeReKo<br />
does.<br />
For technical reasons, COSMAS II cannot handle UTF-8<br />
encoding. Therefore the encoding of the XCES files have<br />
to be changed to ISO-8859-1 and characters outside this<br />
range converted to numeric character references referring<br />
to the Unicode code point.<br />
At the end of this process, the Wikipedias in the XCES<br />
and the tagged format will be made publicly available to<br />
the scientific community.<br />
5. Conclusion<br />
While the Wikipedia is a often used and attractive source<br />
for various NLP and corpus linguistic tasks, it is not easy<br />
to get an endurable XML conversion routine which<br />
produces proper XML versions of the data. It was our<br />
attempt to find such a solution using XSLT stylesheets.<br />
After the part-of-speech tagging of the six language<br />
versions of the Wikipedia (German, French, Italian,<br />
Polish, Hungarian, Norwegian) we are able to build a<br />
multilingual comparable corpus for contrastive grammar<br />
research in our project.<br />
For future investigations, the advantage of a XML<br />
version of the Wikipedia is clearly visible: The XML<br />
structure holds all the meta information available in the<br />
mediawiki code and can therefore be used to differentiate<br />
findings of grammatical structures: Are there variants of<br />
specific constructions in different text types (lexicon<br />
entry vs. user discussion)? Or does the usage of the<br />
constructions depend on topic domains? And how do<br />
5 See http://www.ids-mannheim.de/cosmas2/.<br />
these observations change in the light of inter-lingual<br />
comparisons?<br />
6. References<br />
Atserias, J., Zaragoza, H., Ciaramita, M., Attardi, G.<br />
(2008): Semantically Annotated Snapshot of the<br />
English Wikipedia. In Proceedings of the Sixth<br />
International Language Resources and Evaluation<br />
(LREC 08), Marrakech, pp. 2313–2316.<br />
Brants, T. (2000): TnT – A Statistical Part-of-Speech<br />
Tagger. In Proceedings of the Sixth Conference on<br />
Applied Natural Language Processing (ANLP),<br />
Seattle, WA.<br />
Bray, T., Paoli, J., Sperberg-McQueen, C. M. (1998):<br />
Extensible Markup Language (XML) 1.0. W3C<br />
Recommendation<br />
.<br />
Denoyer, L., Gallinari, P. (2006): The Wikipedia XML<br />
Corpus. In SIGIR Forum.<br />
Desgraupes, B., Loiseau, S. (2007): Wiki to TEI 1.0<br />
project .<br />
Fuchs, M. (2010): Aufbau eines linguistischen Korpus<br />
aus den Daten der englischen Wikipedia. In Semantic<br />
Approaches in Natural Language Processing.<br />
Proceedings of the Conference on Natural Language<br />
Processing 2010 (KONVENS 10), Saarbrücken:<br />
<strong>Universität</strong>sverlag des Saarlandes, pp. 135–139.<br />
Gabrilovich, E., Markovitch, S. (2006): Overcoming the<br />
Brittleness Bottleneck using Wikipedia: Enhancing<br />
Text Categorization with Encyclopedic Knowledge.<br />
In Proceedings of The 21st National Conference<br />
on Artificial Intelligence (AAAI), Boston,<br />
pp. 1301–1306.<br />
Hagen, K., Johannessen, J. B., Nøklestad, A. (2000): A<br />
Constraint-based Tagger for Norwegian. In 17th<br />
Scandinavian Conference of Linguistics, Lund,<br />
Odense: University of Southern Denmark, 19,<br />
pp. 31–48 (Odense Working Papers in Language and<br />
Communication).<br />
Krizhanovsky, A. A. (2008): Index wiki database: design<br />
and experiments. In CoRR abs/0808.1753.<br />
Kupietz, M., Belica, C., Keibel, H., Witt, A. (2010): The<br />
German Reference Corpus DeReKo: A primordial<br />
sample for linguistic research. In Proceedings of the<br />
7th conference on International Language Resources<br />
143
Multilingual Resources and Multilingual Applications - Regular Papers<br />
and Evaluation, Valletta, Malta: European Language<br />
Resources Association (ELRA), pp. 1848-1854.<br />
Piasecki, M. (2007): Polish Tagger TaKIPI: Rule Based<br />
Construction and Optimisation. In Task Quarterly<br />
11(1–2), pp. 151–167.<br />
Saloni, Z., Gruszczyński, W., Woliński, M., Wołosz, R.<br />
(2010): Analizator morfologiczny Morfeusz<br />
.<br />
Schiller, A., Teufel, S., Thielen, C. (1995): Guidelines <strong>für</strong><br />
das Tagging deutscher Textcorpora mit STTS.<br />
<strong>Universität</strong> Stuttgart, Institut <strong>für</strong> maschinelle<br />
Sprachverarbeitung; <strong>Universität</strong> Tübingen, Seminar<br />
<strong>für</strong> Sprachwissenschaft, Stuttgart<br />
.<br />
Schmid, H. (1994): Probabilistic Part-of-Speech Tagging<br />
Using Decision Trees<br />
.<br />
Váradi, T. (2002): The Hungarian National Corpus. In<br />
Proceedings of the 3rd LREC Conference, Las Palmas,<br />
Spanyolország, pp. 385–389<br />
.<br />
Zesch, T., Müller, C., Gurevych, I. (2008): Extracting<br />
Lexical Semantic Knowledge from Wikipedia and<br />
Wiktionary. In Proceedings of the Sixth International<br />
Language Resources and Evaluation (LREC 08),<br />
Marrakech, pp. 1646–1652<br />
.<br />
144
Multilingual Resources and Multilingual Applications - Regular Papers<br />
A German Grammar for Generation in OpenCCG<br />
Jean Vancoppenolle * Eric Tabbert * Gerlof Bouma + Manfred Stede *<br />
* Dept of Linguistics, University of Potsdam, + Dept of Swedish, University of Gothenburg<br />
E-mail: * {vancoppenolle,tabbert,stede}@uni-potsdam.de + gerlof.bouma@gu.se<br />
Abstract<br />
We present a freely available CCG fragment for German that is being developed for natural language generation tasks in the<br />
domain of share price statistics. It is implemented in OpenCCG, an open source Java implementation of the computationally<br />
attractive CCG formalism. Since generation requires lexical categories to have semantic representations, so that possible<br />
realizations can be produced, the underlying grammar needs to define semantics. Hybrid Logic Dependency Semantics, a logic<br />
calculus especially suited for encoding linguistic meaning, is used to declare the semantics layer. To our knowledge, related work<br />
on German CCG development has not yet focused on the semantics layer. In terms of syntax, we concentrate on aspects of German<br />
as a partially free constituent order language. Special attention is payed to scrambling, where we employ CCG's type-changing<br />
mechanism in a manner that is somewhat unusual, but that allows us to a) minimize the amount of syntactic categories that are<br />
needed to model scrambling, compared to providing categories for all possible argument orders, and b) retain enough control to<br />
impose restrictions on scrambling.<br />
Keywords: CCG, Generation, Scrambling, German<br />
Introduction<br />
“Der Kurs der Post ist vom 13. September bis 29.<br />
Oktober stetig gefallen und dann bis zum 15. November<br />
wieder leicht angestiegen.<br />
Zwischen dem 13. und dem 29. September schwankte<br />
der Kurs leicht zwischen 15 und 16 Euro. Anschließend<br />
fiel er um mehr als die Hälfte ab und erreichte am 29.<br />
Oktober seinen Tiefststand bei 7 Euro. Bis zum 15.<br />
November stieg der Kurs nach einigen Schwankungen<br />
auf seinen Schlusswert von 10 Euro.”<br />
Consider the graph depicting the development of a share<br />
price. Undoubtedly, a human could interpret the<br />
mathematical properties of that graph and quite easily<br />
describe this information in prose. He would probably<br />
produce a text more or less similar to the one presented<br />
above. In computational linguistics (or, more general,<br />
artificial intelligence), people attempt to go one step<br />
further and let the computer do that work for us.<br />
Basically, it will have to perform the same steps that a<br />
human would need to in order to accomplish this task:<br />
determine the mathematical properties of interest and<br />
generate a text that is faithful to the input and easy to<br />
read. The present paper addresses the latter sub task –<br />
i.e., the text generation.<br />
Our goal is to develop a freely available fragment of a<br />
German grammar in OpenCCG that is suitable for<br />
natural language generation tasks in the domain of share<br />
prices. Related work on German in OpenCCG includes<br />
Hockenmaier (2006) and Hockenmaier and Young<br />
(2008), who employ grammar induction algorithms to<br />
induce CCG grammars automatically from treebanks<br />
(e.g. TiGERCorpus). To our knowledge, however, very<br />
little resources are actually freely available. In<br />
particular, the coverage of a part of the semantic layer is<br />
a novel contribution of the grammar that we present<br />
here.<br />
1. CCG<br />
CCG (Combinatory Categorial Grammar, Steedman<br />
2000, Steedman & Baldridge <strong>2011</strong>) is a lexicalized<br />
grammar formalism in which all constituents, lexical<br />
145
Multilingual Resources and Multilingual Applications - Regular Papers<br />
ones included, are assigned a syntactic category that<br />
describes its combinatory possibilities. These categories<br />
may be atomic or complex. Complex categories are<br />
functions from one category into another, with<br />
specification of the relative position of the function and<br />
its argument. For instance, the notation s∖ np describes<br />
a complex category that can be combined with an np<br />
on its left (direction of the slash) to yield an s .<br />
Category combination always applies to adjacent<br />
constituents and is governed by a set of combinatory<br />
rules, of which the simplest is function application. In<br />
the example in Fig. 1, we build a sentence (category s )<br />
around a transitive verb ( (s∖ np)/np )). There are two<br />
versions of function application used in the derivation:<br />
backward (), depending on which<br />
constituent is the argument and which is the function.<br />
An overview of other derivation rules is given in Table<br />
1.<br />
146<br />
Figure 1: A basic CCG derivation.<br />
The atomic categories in CCG come from a very<br />
restricted set. They may be enriched with features to<br />
handle case, agreement, clause types, etc. In addition, a<br />
grammar writer may choose to handle language-specific<br />
phenomena with unary type-changing rules. Finally, the<br />
grammar presented uses multi-modal CCG (henceforth<br />
MMCCG), which gives extended lexical control over<br />
derivation possibilities by adding modalities to the<br />
slashes in complex categories (see Baldridge 2002;<br />
Steedman & Baldridge <strong>2011</strong>, for introduction and<br />
overview).<br />
In its basic form, CCG has mildly context-sensitive<br />
generative power and is thus able to account for noncontext-free<br />
natural language phenomena (Steedman,<br />
2000). Its attractiveness is due to the linguistic<br />
expressiveness on the one hand and the fact that it is<br />
efficiently parsable in theory (Shanker & Weir, 1990), as<br />
well as in practice (Clark & Curran, 2007).<br />
Function application<br />
��� α/β �<br />
(T) α<br />
(
Multilingual Resources and Multilingual Applications - Regular Papers<br />
objects. Finally, the satisfaction operator @ states that<br />
the formula p in @ i p holds at world i.<br />
OpenCCG implements a flexible surface realizer that<br />
when given a logical form (LF) like in Fig. 1 returns one<br />
or more realizations of it, based on the underlying<br />
grammar. Both the number of realizations and their<br />
surface forms depend on how much information a LF<br />
specifies, thereby allowing to either enhance or restrain<br />
non-determinism of the realization process. For<br />
example, given that the LF in Fig. 2 does not specify<br />
which of the two arguments is fronted, the following<br />
two surface forms are possible:<br />
2) Der Kurs erreicht seinen Höchststand.<br />
the share-price reaches its peak<br />
3) Seinen Höchststand erreicht der Kurs.<br />
Its peak reaches the share-<br />
price.<br />
'The share-price reaches its peak'<br />
2. Coverage<br />
Our current work focuses on different aspects of<br />
German as a partially free constituent order language,<br />
including basic constituent order and scrambling in<br />
particular, but also on complex nominal phrases, clausal<br />
subordination, and coordination. In the next two<br />
Figure 3: NP fronting<br />
sections, we first give a brief overview of how<br />
topicalization is modeled in our grammar, followed by<br />
an approach to scrambling that we are currently<br />
investigating and that, as far as we know, is new in<br />
CCG.<br />
2.1. Topicalization<br />
The finite verb can occupy three different positions that<br />
depend on the clause type and determine the sentence<br />
mood: matrix clauses are either verb-initial (declarative<br />
or yes/no-interrogative), or verb-second (declarative or<br />
wh-interrogative), and subordinate clauses are always<br />
verb-final (declarative or interrogative).<br />
Following Steedman (2000), Hockenmaier (2006) and<br />
Hockenmaier and Young (2008), we implemented a<br />
topicalization rule that systematically derives verbsecond<br />
order from verb-initial order by fronting an<br />
argument of the verb, e.g. an NP, a PP, or a clause. This<br />
also covers partial fronting (see Fig. 3 for examples):<br />
4) T ⇒sv2 /(s v1/T ) , T={np , pp , sto-inf $∖ np,...}<br />
Sentence modifiers (e.g. heute in heute fällt der Kurs<br />
'today, the share price is falling') are analyzed as sv2/s v1<br />
and can thus form verb-second clauses on their own.<br />
147
Multilingual Resources and Multilingual Applications - Regular Papers<br />
2.2. Scrambling<br />
Much of the constituent order freedom in German is due<br />
to the fact that it allows for permutation of verbal<br />
arguments within a clause (local scrambling, 5) and<br />
'extraction' of arguments of an arbitrarily deeply<br />
embedded infinite clause (long-distance scrambling, 6):<br />
5) dass [dem Unternehmen]2 [das Richtige]3<br />
148<br />
that the enterprise the right-thing<br />
[der Berater]1 empfiehlt.<br />
the counselor advises<br />
'that the counselor advises the enterprise the right<br />
thing'<br />
6) dass [dem Unternehmen]2 [das Richtige]3<br />
that the enterprise the right-thing<br />
[der Berater]1 [_ _ zu empfehlen hofft].<br />
the counselor to advise hopes<br />
'that the counselor hopes to advise the enterprise the right<br />
thing'<br />
Different proposals have been made in MMCCG to<br />
account for constituent order freedom in general. To our<br />
knowledge, the two most common approaches are to<br />
provide separate categories for each possible order<br />
(Hockenmaier, 2006; Hockenmaier & Young, 2008) or<br />
to allow lexical underspecification of argument order<br />
through multi-sets (Hoffman, 1992; Steedman &<br />
Baldridge, 2003).<br />
We are investigating an approach to local scrambling<br />
that aims at combining the advantages of both methods,<br />
namely having fine-grained control over argument<br />
permutation on the one hand, and requiring as few<br />
categories as possible on the other. It is based on a set of<br />
type-changing rules that change categories 'on the fly'.<br />
(7) shows a simplified rule that allows to derive plural<br />
NPs from plural nouns, reflecting the optionality of<br />
determiners in German plural NPs (e.g. sie isst<br />
Kartoffeln 'she eats potatoes'):<br />
7) n pl ⇒nppl Type-changing rules can also be used to swap two<br />
consecutive argument NPs, (i and j denote indexes):<br />
8) s/ np 〈 i〉+base/np 〈 j〉 -pron⇒ s/np 〈 j〉/np 〈~i〉 -base<br />
9) s$∖ np 〈i〉-pron∖ np 〈 j〉 +base ,-pron ⇒s$∖ np 〈 ~j〉-base , -pron∖ np 〈i〉<br />
This essentially emulates the behavior of multi-sets and<br />
at the same time reduces the number of categories to a<br />
minimum, thereby enhancing the maintainability of the<br />
grammar. The advantage over multi-sets is that<br />
restrictions on scrambling can be formulated<br />
straightforwardly, such as that full NPs should not<br />
scramble over pronouns (i.e. NPs having the -pron(oun)<br />
feature) (see Uszkoreit (1987) for an overview of<br />
scrambling regularities in German).<br />
Rules like (8) and (9) require special caution, though.<br />
Type-changing rules are supposed to actually change the<br />
type of the argument category as they could otherwise<br />
apply over and over again, causing an infinite recursion.<br />
This is where the ±base feature comes into play. It<br />
indicates whether an NP occupies its base position or<br />
has already been scrambled, restricting the application<br />
of (8) and (9) to the former case and thereby preventing<br />
infinite recursion. The so-called dollar variable $ in<br />
(9) ranges over complex categories that have the same<br />
functor (here: s ), such as s ∖np . It is not crucial to our<br />
scrambling rules but generalizes (9) to apply to both<br />
transitive and ditransitive verbs.<br />
Four more rules are sufficient to capture all possible<br />
local permutations and also some of the long-distance<br />
permutations, as the one in (6).<br />
Figure 4: Parse of an infinite clause.<br />
The derivation in Fig. 4 contains the derivation of the<br />
complex verb cluster of example (6). The composed<br />
category s ∖np∖ np ∖ np corresponds to the one of an<br />
ordinary ditransitive verb, so although (6) is an instance<br />
of long-distance scrambling, it can be derived by means<br />
of our local scrambling rules (8) and (9).<br />
3. Lexicon<br />
The grammar is intended for use in the domain of the<br />
stock market, thus providing the means to describe the<br />
development of share prices. Since the expansion and<br />
proper implementation of a lexical database is a fullfledged<br />
task of its own and the focus of our current work<br />
is to extend the grammar, our current lexicon is still<br />
quite limited in its scope.<br />
At a later point one might consider to make use of the
CCGbank lexicon (Hockenmaier, 2006).<br />
3.1. Nouns<br />
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Our lexicon currently contains approximately 125<br />
nouns. For the different inflectional paradigms we made<br />
use of inflection tables presented on the free online<br />
service canoo.net. 2 For each of these paradigms we<br />
wrote an 'expansion'. OpenCCG's expansions provide a<br />
means to define inflectional paradigms as an applicable<br />
rule and link lexical information to them, so that<br />
OpenCCG generates the different tokens of a word and<br />
its syntactic and semantic properties as interpretable<br />
lexical entries. Thus a typical noun entry is a one-liner<br />
like this:<br />
10) # Höchststand<br />
noun_infl_1(Höchststand, Höchstständ, masc<br />
peak, graph_point_definition)<br />
The first two arguments contain the singular and plural<br />
stem, to which the inflection endings will be attached by<br />
the expansion. The following arguments are gender (for<br />
agreement), a predicate (as semantic reference) and a<br />
semantic type from the ontology. While seemingly plain<br />
English, these semantic predicates should be thought of<br />
as grossly simplified meta language, which guarantees a<br />
unique and unambiguous semantic representation.<br />
3.2. Verbs<br />
For the verbs we followed a similar approach, with three<br />
expansions. The first two actually cover the same<br />
inflection paradigm, with the difference that for verbs<br />
ending in -ern like klettern (to climb) we duplicated the<br />
paradigm and made slight adjustments to circumvent the<br />
concatenation of the word stem kletter and certain<br />
inflectional morphs like -en to ungrammatical forms like<br />
*(wir) kletteren (instead of klettern). The third<br />
expansion covers several modal verbs like können (to<br />
can) or müssen (to have to).<br />
Each of those rules sets the features of the respective<br />
inflection (e.g. fin, 1st, sg, pres) and those for past tense.<br />
Sample entries:<br />
11) regular-vv(schwanken, schwank, schwankte,<br />
fluctuate)<br />
12) regular-vv-ern(klettern, kletter, kletterte, climb)<br />
2 http://www.canoo.net/<br />
4. Generation<br />
We would like to conclude with a brief outline of how<br />
our grammar fits into the generation scenario presented<br />
in the introduction.<br />
The idea is to generate text automatically from share<br />
price graphs, i.e., from collections of data points. Graphs<br />
are analyzed in terms of different mathematical<br />
properties (e.g. extremes and inflection points). These<br />
properties, together with user-provided realization<br />
parameters that allow fine-grained control over the<br />
'specificity' of LFs (and thus over the number of surface<br />
realizations), are input to static LF templates. The filled<br />
LF templates are then fed to the OpenCCG realizer<br />
where our grammar is used to compute the appropriate<br />
surface forms. In the last step, orthographic postprocessing,<br />
the surface forms are normalized with<br />
respect to language-specific orthographic standards (e.g.<br />
number or date formats, etc.). The figure below<br />
illustrates this procedure:<br />
Figure 5: Procedure of the generation process.<br />
5. Summary<br />
We have presented a freely available CCG fragment of a<br />
generation grammar for German that is equipped with a<br />
semantic layer implemented in Hybrid Logic<br />
149
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Dependency Semantics. In terms of syntax, we have<br />
focused on aspects of German as a partially free<br />
constituent order language and investigated an approach<br />
to scrambling by employing OpenCCG's type-changing<br />
rules in a somewhat unconventional manner. In doing<br />
so, we aimed at minimizing the amount of categories<br />
needed to allow different argument orders while<br />
retaining a certain degree of flexibility regarding<br />
argument order restrictions. Future work will<br />
concentrate more on the lexicon, for instance by refining<br />
and extending our expansions for inflectional paradigms<br />
of various word classes. We also hope to use<br />
OpenCCG's interesting regular expression facilities for<br />
derivational morphology.<br />
Our grammar can be downloaded from www.ling.unipotsdam.de/~stede/AGacl/ressourcen/GerGenGram.<br />
6. Acknowledgements<br />
We would like to thank the other participants of the<br />
course Automatische Textgenerierung (Winter 2010/11)<br />
at the University of Potsdam, and also the GSCL<br />
reviewers for their comments.<br />
7. References<br />
Baldridge, J. (2002): Lexically Specified Derivational<br />
Control in Combinatory Categorial Grammar. PhD<br />
thesis, School of Informatics, University of<br />
Edinburgh.<br />
Baldridge, J., Kruijff, G.-J.M. (2002): Coupling CCG<br />
and Hybrid Logic Dependency Semantics. In<br />
Proceedings of the 40th Annual Meeting of the<br />
Association for Computational Linguistics (ACL<br />
2002).<br />
Baldridge, J., Kruijff, G.-J.M. (2003): Multi-Modal<br />
Combinatory Categorial Grammar. In Proceedings of<br />
the 10th Conference of the European Chapter of the<br />
Association for Computational Linguistics (EACL<br />
2003).<br />
Blackburn, P. (1993): Modal Logic and Attribute Value<br />
Structures. In M. de Rijke, editors, Diamonds and<br />
Defaults, Synthese Language Library, pp. 19–65,<br />
Kluwer Academic Publishers, Dordrecht, 1993.<br />
Blackburn, P. (2000): Representation, Reasoning, and<br />
Relational Structures: a Hybrid Logic Manifesto.<br />
Logic Journal of the IGPL, 8(3), pp. 339-625.<br />
150<br />
Bozsahin, C., Kruijff, G.-J.M., White, M. (2008):<br />
Specifying Grammars for OpenCCG: A Rough Guide.<br />
http://openccg.sourceforge.net/<br />
Clark, S., Curran, S. (2007): Wide-coverage Efficient<br />
Statistical Parsing with CCG and Log-linear Models.<br />
Computational Linguistics, 33(4), pp. 493-552.<br />
Drach, E. (1937): Grundgedanken der deutschen<br />
Satzlehre. Diesterweg.<br />
Hockenmaier, J. (2006): Creating a CCGbank and a<br />
wide-coverage CCG lexicon for German. In<br />
Proceedings of the 21st International Conference on<br />
Computational Linguistics and 44th Annual Meeting<br />
of the ACL.<br />
Hockenmaier, J., Young, P. (2008): Non-local<br />
scrambling: the equivalence of TAG and CCG<br />
revisited. Proceedings of The Ninth International<br />
Workshop on Tree Adjoining Grammars and Related<br />
Formalisms, pp. 41–48, Tübingen, Germany.<br />
Hoffman, B. (1992): A CCG Approach to Free Word<br />
Order Languages. Proceedings of the 30th Annual<br />
Meeting of ACL, pp. 300-302.<br />
Müller, S. (2010): Grammatiktheorie. Stauffenburg<br />
Verlag.<br />
Steedman, M. (2000): The Syntactic Process. MIT Press.<br />
Steedman, M., Baldridge, J. (<strong>2011</strong>): Combinatory<br />
Categorial Grammar. In Borsley and Börjars (eds),<br />
Non-tranformational Syntax: Formal and explicit<br />
models of grammar, Wiley-Blackwell.<br />
Uszkoreit, H. (1987): Word Order and Constituent<br />
Structure in German. CSLI.<br />
Vijay-Shanker, K., Weir, D.J. (1990): Polynomial Time<br />
Parsing of Combinatory Categorial Grammars. Proceedings<br />
of the 28th Annual Meeting of Computational<br />
Linguistics, pp. 1-8, Pittsburgh, PA, June 1990.<br />
White, M.: OpenCCG Realizer Manual. Documentation<br />
of the OpenCCG Realizer.<br />
White, M. (2004): Efficient Realization of Coordinate<br />
Structures in Combinatory Categorial Grammar.<br />
Research on Language & Computation, 4(1), pp. 39-<br />
75.<br />
White, M., Rajkumar R., Martin, S. (2007): Towards<br />
Broad Coverage Surface Realization with CCG. In<br />
Proceedings of the Workshop on Using Corpora for<br />
NLG: Language Generation and Machine Translation<br />
(UCNLG+MT).
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Multilingualism in Ancient Texts: Language Detection by Example of Old High<br />
German and Old Saxon<br />
Zahurul Islam 1 , Roland Mittmann 2 , Alexander Mehler 1<br />
1 AG Texttechnology, Institut <strong>für</strong> Informatik, Goethe-<strong>Universität</strong> Frankfurt<br />
2 Institut <strong>für</strong> Empirische Sprachwissenschaft, Goethe-<strong>Universität</strong> Frankfurt<br />
E-mail: zahurul, mittmann, mehler@em.uni-frankfurt.de<br />
Abstract<br />
In this paper, we present an approach to language detection in streams of multilingual ancient texts. We introduce a supervised<br />
classifier that detects, amongst others, Old High German (OHG) and Old Saxon (OS). We evaluate our model by means of three<br />
experiments that show that language detection is possible even for dead languages. Finally, we present an experiment in unsupervised<br />
language detection as a tertium comparationis for our supervised classifier.<br />
Keywords: Language identification, Ancient text, n-gram, classification, clustering<br />
1. Introduction<br />
With the rise of the web, we face more and more on-line<br />
resources that mix different languages. This multilin-<br />
gualism of textual resources poses a challenge for many<br />
tasks in Natural Language Processing (NLP). As a consequence,<br />
Language Identification (LI) is now an indispensable<br />
step of preprocessing for many NLP applications.<br />
This includes machine translation, automatic<br />
speech recognition, text-to-speech systems as well as text<br />
classification in multilingual scenarios.<br />
Obviously, LI is a well-established field of application of<br />
NLP. However, if one looks at documents that were<br />
written in low-density languages or documents that mix<br />
several dead languages, adequate models of language<br />
detection are rarely found. In any event, ancient languages<br />
are becoming more and more central in approach<br />
to computational Humanities, historical semantics and<br />
studies on language evolution. Thus, we are in need of<br />
models of language detection of dead languages.<br />
In this paper, we present such a model. We introduce a<br />
supervised classifier that detects amongst others, OHG<br />
and OS. To do so, we extend the model of Waltinger and<br />
Mehler (2009) so that it also accounts for dead languages.<br />
For any segment of the logical document structure of a<br />
text, our task is to detect the corresponding language in<br />
which it was written. This detection at the segment level<br />
rather than at the level of whole texts allows us to make<br />
explicit the multilingualism of ancient documents starting<br />
from the level of words via the level of sentences up<br />
to the level of texts. As a result, language-specific preprocessing<br />
tools can be used in such a way that they focus<br />
on those segments that provide relevant input for them. In<br />
this way, our approach is a first step towards building a<br />
preprocessor of multilingual ancient texts.<br />
The paper is organized as follows: Section 3 describes the<br />
corpus of texts that we have used for our experiments.<br />
Section 4 briefly introduces our approach to supervised<br />
language detection, which is evaluated in Section 5.<br />
Section 6 describes unsupervised language classifier.<br />
Finally, a conclusion is given in Section 7.<br />
2. Related Work<br />
As we present a model of n-gram-based language detection,<br />
we briefly discuss work in this area.<br />
Cavnar and Trenkle (1994) describe a system of n-gram<br />
based text and language categorization. Basically, they<br />
calculate n-gram profiles for each target category. Categorization<br />
occurs by means of measuring the distances of<br />
the profiles of input documents with those of the target<br />
categories. Regarding language classification, the accuracy<br />
of this system is 99.8%.<br />
The same technique has been applied by Mansur et al.<br />
(2006) for text categorization. In this approach, a corpus<br />
of newspaper articles has been used as input to categorization.<br />
Mansur et al. (2006) show that n-grams of<br />
length 2 and 3 are most efficiently used as features for<br />
text categorization.<br />
151
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Kanaris and Stamatatos (2007) used character level<br />
n-grams to categorize web genres. Their approach is<br />
based on n-grams of characters of variable length that<br />
were combined with information about most frequently<br />
used HTML-tags.<br />
Note that the language detection toolkit of Google<br />
translator may also be considered as a related work.<br />
However, at present, this system does not recognize<br />
sentences in OHG. We have tested 10 example sentences.<br />
The toolkit categorized only one of these input sentences<br />
as modern German; other sentences were categorized as<br />
different languages (e.g., Italian, French, English and<br />
Danish).<br />
These approaches basically explore n-grams as features<br />
of language classification. However, they do that for<br />
modern languages. In this paper we present an approach<br />
that fills the gap of ancient language detection.<br />
3. The Corpus<br />
The corpus used consists of 160 complete texts in six<br />
diachronically and diatopically diverging stages of the<br />
German language plus the OS glosses, all collected from<br />
the TITUS 1<br />
online database. High German is the language<br />
variety spoken historically south of a bundle of<br />
isogloss lines stretching from Aachen through Düsseldorf,<br />
Siegen, Kassel and Halle to Frankfurt (Oder) and has<br />
developed into what today constitutes standard German.<br />
Low German was spoken historically north of this line<br />
but has undergone a decline in native speakers to the<br />
point that it is now considered a regional vernacular of<br />
and alongside standard German, despite the fact that Low<br />
German and High German were once distinct languages.<br />
Table 1 shows the historical and geographical varieties of<br />
older German.<br />
New discoveries of texts in the various historical forms<br />
and varieties of German are being made continually. Due<br />
to the steadily increasing number of transmitted texts<br />
from throughout the history of the German language, the<br />
focus of the TITUS corpus is on the older stages: it<br />
comprises the whole OHG corpus (apart from the glosses)<br />
as well as the entire OS corpus, including one mixed<br />
OHG and OS text. Of the younger language stages only<br />
unrepresentative amounts of texts are contained: several<br />
1 Thesaurus of Indo-European Text and Language Materials –<br />
see http://titus.uni-frankfurt.de<br />
152<br />
dozen Middle High German (MHG) texts, some Middle<br />
Low German (MLG) texts, a sample of Early New High<br />
German (ENHG) texts and one mixed ENHG and Early<br />
New Low German (ENLG) text all of them varying<br />
considerably in length, from a few words to several tens<br />
of thousands per text.<br />
Language Stage Period of Time<br />
OHG ca. 750 – 1050 CE<br />
MHG ca. 1050 – 1350 CE<br />
ENHG ca. 1350 – 1650 CE<br />
OS ca. 800 – 1200 CE<br />
MLG ca. 1200 – 1600 CE<br />
ENLG ca. 1600 – 1750 CE<br />
Table 1: Historical and geographical varieties<br />
Among the oldest transmissions are interlinear translations<br />
of Latin texts, but also free translations and adaptations<br />
as well as mixed German-Latin texts. Translations<br />
consist mainly of religious literature, prayers, hymns, but<br />
also of ancient authors and scientific writings. These are<br />
later on complemented by epic and lyrical poetry (minnesongs),<br />
prose literature, sermons and other religious<br />
works, specialist books, chronicles, legislative texts and<br />
philosophical treatises. The latest texts of the corpus<br />
cover a biographical and a historical work, a collection of<br />
legal texts for a prince, an experimental re-narration of a<br />
parodistic novel as well as the German parts of two<br />
bilingual texts, a High German-Old Prussian enchiridion<br />
and a mixed High and Low German textbook for learning<br />
Russian.<br />
Language Stage #Texts #Tokens<br />
OHG 101 437,390<br />
MHG 31 1,776,900<br />
ENHG 6 237,432<br />
OS 17 62,706<br />
MLG 4 133,584<br />
ENLG 1 26,679<br />
Total 160 2,674,691<br />
Table 2: Composition of the corpus<br />
The corpus was generated by entering plain text, either<br />
completely by hand or by scanning, performing OCR<br />
recognition and correcting it manually. The texts were<br />
then indexed and provided with information on languages<br />
and subdivisions using the
Word-Cruncher 2<br />
Multilingual Resources and Multilingual Applications - Regular Papers<br />
software developed by Brigham Young<br />
University in Provo, Utah. They were then converted into<br />
HTML format and were simultaneously conveyed into<br />
several SQL database files, classified by the words’<br />
language family, to enable the set-up of an on-line search.<br />
4. Approach<br />
In this section, we describe our language detection approach.<br />
We start with describing how we prepared the<br />
corpus from TITUS database to get input for our classifier<br />
(Section 4.1), introduce our model (Section 4.2)<br />
and describe its system design (Section 4.3).<br />
4.1. Corpus Preparation<br />
The training and test corpora that we used in our experiments<br />
were extracted from the database dump of TITUS<br />
(see Section 3). Each word in this extraction has been<br />
annotated with its corresponding language name (example:<br />
German), sub-language name (example: Old High<br />
German), document number, division number and its<br />
position within the underlying HTML corpus files. TI-<br />
TUS only annotates the boundaries of divisions so that<br />
any division may contain one or more sentences. For any<br />
sub-language (i.e., OHG, OS, MHG, MLG, ENLG and<br />
ENHG), we extracted text as reported in Table 2.<br />
4.2. Language Detection Toolkit<br />
Our approach for language detection is based on Cavnar<br />
and Trenkle (1994) and Waltinger and Mehler (2009). As<br />
in these studies, for every target category we learn an<br />
ordered list of most frequent n-grams that occur in descending<br />
order. The same is done for any input text so that<br />
categorization is done by measuring the distance between<br />
n-gram profiles of the target categories and the n-gram<br />
profiles of the test data.<br />
The idea behind this approach is that the more similar<br />
two texts are, the more they share features that are<br />
equally ordered.<br />
In general, classification is done by using a range of<br />
corpus features as are listed in Waltinger and Mehler<br />
(2009). Predefined information is extracted from the<br />
corpus to build sub-models based on those features. Each<br />
sub-model consists of a ranked frequency distribution of<br />
subset of corpus features. Corresponding n-gram information<br />
are extracted for n = 1 to 5. Each n-gram gets its<br />
2 http://wordcruncher.byu.edu<br />
own frequency counter. The normalized frequency distribution<br />
of relevant features is calculated according to<br />
�� �� =<br />
� ��<br />
��� �� � �(� � )� ��<br />
� (0,1]<br />
��is �� the frequency of feature ai in Dj, divided by the<br />
frequency of the most frequent feature ak in the feature<br />
representation L(Dj) of document Dj (see Waltinger &<br />
Mehler, 2009). To categorize any document Dm, it is<br />
compared to each category Cn using the distance d of the<br />
rank rmk of feature ak in the sub-model of Dm with the<br />
corresponding rank of that feature in the representation<br />
of Cn:<br />
�(� � , � � , � �) = � |� �� � � �� |� � � �(� � ) � � � � �(� � )<br />
max � � � �(� � ) � � � � �(� � )<br />
�(�� , ��, ��) equals max if feature ak does not belong to<br />
the representation of Dm or to the one of category Cn. max<br />
is the maximum that the term |��� � ��� | can assume.<br />
4.3. System Design<br />
The language detection toolkit (Waltinger & Mehler,<br />
2009) is used to build training models. It creates several<br />
n-gram models for each language which are used by the<br />
same tool for detection. Figure 1 shows the basic system<br />
diagram.<br />
To detect the language of a document, the toolkit traverses<br />
the document sentence by sentence and detects the<br />
language of each sentence. If the document is homogeneous,<br />
(i.e., all sentences belong to the same language),<br />
then sentence level detection suffices to trigger other<br />
tools for further processing (e.g., Parsing, Tagging and<br />
Morpho-syntactic analysis) of that document, where<br />
language detection is necessary for preprocessing.<br />
In the case that the sentences belong to more than one<br />
language (i.e., in the case of a heterogeneous document),<br />
the toolkit process the document word by word and detect<br />
the language of each token separately. This step is necessary<br />
in the case of multilingual documents that contain<br />
words from different languages are in single sentences.<br />
For example: in a scenario of lemmatization or morphological<br />
analysis of a multilingual document, it is necessary<br />
to trigger language specific tools to avoid errors. Just<br />
one tool needs to be triggered for further processing of a<br />
homogeneous document, whereas for a heterogeneous<br />
�<br />
153
Multilingual Resources and Multilingual Applications - Regular Papers<br />
document the same kind of tool has to be triggered based<br />
on the word level.<br />
154<br />
Figure 1: Basic system diagram<br />
Language Accuracy F-score<br />
OHG 100% 1<br />
OS 100% 1<br />
Table 3: Sentence level evaluation<br />
5. Evaluation<br />
In order to evaluate the language detection system, we<br />
extracted 200 sentences from the OHG corpus and 200<br />
sentences from the OS corpus. These evaluation sets had<br />
not been used for training. There are many evaluation<br />
metrics used to evaluate NLP tools, we decided to use<br />
Accuracy and F-score (Hotho et al., 2005). Table 3 shows<br />
the evaluation result of the sentence level language<br />
detection, where we obtained 100% accuracy for both<br />
test sets. Table 4 shows the evaluation result of the word<br />
level language detection. 153 out of 1,259 words in the<br />
OHG test set were detected as OS and 33 out of 799<br />
words in the OS test set were classified as OHG. The<br />
accuracy of the test set was 79.95% and 91.36% respectively.<br />
The evaluation result shows that the OHG test set<br />
might contain words from other languages, which is<br />
basically true. Petrova et al. (2009) show that the OHG<br />
diachronic corpus contains many Latin words. The<br />
evaluation becomes more effective when the result is<br />
compared with a gold-standard reference set. We came up<br />
with a list of 1,548 words (818 types) where each token is<br />
manually annotated with the name of the language to<br />
which the word belongs. Of 1,548 words, 564 overlapped<br />
with training data. Each word in the gold-standard test set<br />
is detected by the toolkit and the result was compared<br />
with the reference set. We obtained 91.66% accuracy and<br />
an F-score of 95%.<br />
Language Accuracy F-score<br />
OHG 79.95% 0.88<br />
OS 91.36% 0.<strong>96</strong><br />
Table 4: Word level evaluation<br />
6. Unsupervised Language Classification<br />
In addition to the classifier presented above, we experimented<br />
with an unsupervised classifier. The reason was<br />
twofold: one the one hand, we wanted to detect the<br />
added-value of an unsupervised classifier in comparison<br />
to its supervised counterpart. On the other hand, we<br />
aimed at extending the number of target languages to be<br />
detected. We collected several documents per target<br />
language, where each document was represented by a<br />
separate feature vector that counts the frequencies of a<br />
selected set of lexical features. As target classes we<br />
referred to six languages (whose cardinalities are displayed<br />
in Table 6): Early New High German (ENHG),<br />
Early New Low German (ENLG), Middle High German<br />
(MHG), Middle Low German (MLG), Old High German<br />
(OHG), and Old Saxon (OS). In order to implement an<br />
unsupervised language classifier, we followed the approach<br />
described in Mehler (2008). That is, we performed<br />
a hierarchical agglomerative clustering together<br />
with a subsequent partitioning that is informed about the<br />
number of target classes. However, other than in Mehler<br />
(2008), we did not perform a genetic search of the best<br />
performing subset of features as in the present case their<br />
number is too large. Table 5 shows the classification<br />
results. Performing a hierarchical-agglomerative clustering<br />
based on the cosine measure as the operative<br />
measure of object distance, we get an F-score of around<br />
78%. This is a promising result as it is accompanied by a<br />
remarkable high accuracy. However, as seen in Table 4,<br />
the target classes perform quite differently: while we fail<br />
to separate ENHG and ENLG (certainly due to the small<br />
number of respective target documents), we separate<br />
MHG, MLG, OHG and OS to a reasonable degree. In this<br />
sense, the unsupervised classifier makes expectable even<br />
higher F-score supposed that we look for better performing<br />
features in conjunction with well-trained supervised<br />
classifiers. At least, the present study provides a
Multilingual Resources and Multilingual Applications - Regular Papers<br />
baseline that can be referred to in future experiments in<br />
this area.<br />
Approach Object Distance F-Score Accuracy<br />
hierarchical/complete cosine 0.78098 0.91134<br />
hierarchical/weighted cosine 0.69325 0.86934<br />
hierarchical/average cosine 0.61763 0.8307<br />
hierarchical/single cosine 0.56675 0.7926<br />
Table 5: F-scores and accuracies of classifying historical<br />
language data in a semi semi-supervised environment<br />
Language #Texts F-score Recall Precision<br />
ENHG 6 0 0 0<br />
ENLG 1 0 0 0<br />
MHG 31 0.895 1 0.810<br />
MLG 4 0.8 0.8 0.8<br />
OHG 101 0.762 0.615 1<br />
OS 17 0.889 0.889 0.889<br />
Table 6: F-scores, recalls, and precisions differentiated by the target classes<br />
7. Conclusion<br />
Language detection plays an important role in processing<br />
multilingual documents. This is true especially for ancient<br />
documents that, due to their genealogy, mix different<br />
ancient languages. Here, documents need to be<br />
annotated in such a way that preprocessors can activate<br />
language specific routines on a segment by segment basis.<br />
In this paper, we presented an extended version of the<br />
language detection toolkit that allows us decide when to<br />
activate language specific analyses. Notwithstanding the<br />
low density of training material that is available for these<br />
languages, our classification results are very promising.<br />
At this point one may object that corpora of ancient texts<br />
are essentially so small that language detection can be<br />
done by hand. Actually, this objection is wrong if one<br />
considers corpora like the Patrologia Latina (Jordan,<br />
1995), which mixes classical Latin with medieval Latin<br />
as well as with French and other Romance languages that<br />
are used in commentaries. From the size of this corpus<br />
alone (more than 120 million tokens), it is evident that a<br />
reliable means of automatizing segment-based language<br />
detection needs to be a viable option. We also described<br />
an unsupervised language detector that is evaluated<br />
simultaneously by means of OHG, OS, MHG, MLG,<br />
ENLG and ENHG. Although this unsupervised classifier<br />
does not outperform its supervised counterpart, it shows<br />
that language detection in text streams of ancient languages<br />
comes into reach.<br />
8. Acknowledgements<br />
We would like to thank Ulli Waltinger, Armin Hoenen,<br />
Andy Lücking and Timothy Price for fruitful suggestions<br />
and comments. We also acknowledge funding by the<br />
LOEWE Digital-Humanities project in the<br />
Goethe-<strong>Universität</strong> Frankfurt.<br />
9. References<br />
Cavnar , W. B., Trenkle, J. M. (1994): Ngram-based text<br />
categorization. In In Proceedings of SDAIR-94, 3rd<br />
Annual Symposium on Document Analysis and Information<br />
Retrieval, pp. 161–175.<br />
Hotho, A. Nürnberger, A., Paaß, G. (2005): A Brief<br />
Survey of Text Mining. Journal for Language Technology<br />
and Computational Linguistics (JLCL), 20(1),<br />
pp. 19–62.<br />
Jordan, M. D., editor (1995): Patrologia Latina database.<br />
Chadwyck-Healey, Cambridge.<br />
Kanaris, I:, Stamatatos, E. (2007): Webpage genre identification<br />
using variable-length character n-grams. In<br />
155
Multilingual Resources and Multilingual Applications - Regular Papers<br />
156<br />
Proc. of the 19th IEEE Int. Conf. on Tools with Ar-<br />
tificial Intelligence (ICTAI’07), Washington, DC,<br />
USA. IEEE Computer Society.<br />
Mansur, M., UzZaman, N., Khan, M. (2006): Analysis of<br />
n-gram based text categorization for Bangla in a<br />
newspaper corpus. In Proceedings of the 9th International<br />
Conference on Computer and Information<br />
Technology (ICCIT 2006).<br />
Mehler, A. (2008): Structural similarities of complex<br />
networks: A computational model by example of wiki<br />
graphs. Applied Artificial Intelligence, 22(7&8),<br />
pp. 619–683.<br />
Petrova, S., Solf, M., Ritz, J., Chiarcos, C, Zeldes, A.<br />
(2009): Building and using a richly annotated interlinear<br />
diachronic corpus: The case of old high german<br />
tatian. Journal of Traitement automatique des langues<br />
(TAL), 50(2), pp. 47–71.<br />
Waltinger, U., Mehler, A. (2009): The feature difference<br />
coefficient: Classification by means of feature distributions.<br />
In Proceedings of the Conference on Text<br />
Mining Services (TMS 2009), Leipziger Beiträge zur<br />
Informatik: Band XIV, pp. 159–168. Leipzig University,<br />
Leipzig.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Multilinguale Phrasenextraktion mit Hilfe einer lexikonunabhängigen Analysekomponente<br />
am Beispiel von Patentschriften und nutzergenerierten Inhalten<br />
Daniela Becks, Julia Maria Schulz, Christa Womser-Hacker, Thomas Mandl<br />
<strong>Universität</strong> Hildesheim, Institut <strong>für</strong> Informationswissenschaft und Sprachtechnologie<br />
Marienburger Platz 22, 31141 Hildesheim<br />
E-mail: {daniela.becks, julia-maria.schulz, womser, mandl}@uni-hildesheim.de<br />
Abstract<br />
Die Extraktion von sinntragenden Phrasen aus Korpora setzt in der Regel eine verhältnismäßig tiefe linguistische Analyse der Texte<br />
voraus. Darüber hinaus ist häufig eine Adaptation der verwendeten Wissensbasen sowie der zugrunde liegenden Modelle notwendig,<br />
was sich meist als zeit- und arbeitsintensiv erweist. Der vorliegende Artikel beschreibt einen neuen sprach- und domänenübergrei-<br />
fenden Ansatz, der Aspekte von Shallow und Deep Parsing kombiniert. Ein Vorteil des vorgestellten Verfahrens besteht darin, dass es<br />
sich mit wenig Aufwand und ohne komplexe Lexika realisieren und auf andere Sprachen und Domänen übertragen lässt. Als Beispiel<br />
fungieren englische und deutsche Dokumente aus zwei sehr unterschiedlichen Korpora: Kundenrezensionen (nutzergenerierte Inhalte)<br />
und Patentschriften.<br />
Keywords: Shallow Parsing, Multilinguale Phrasenextraktion<br />
1. Einleitung<br />
Vor dem Hintergrund einer globalisierten Welt liegen<br />
Informationen häufig in Dokumenten vor, die nicht in der<br />
Muttersprache der Benutzer verfasst sind. Um ihnen<br />
dennoch die Möglichkeit zu bieten, diese aufzufinden,<br />
bedarf es spezieller Methoden. Damit beschäftigt sich das<br />
Crosslinguale Information Retrieval (CLIR) 1 . Die in<br />
diesem Kontext entstehenden Herausforderungen werden<br />
unter anderem bei Evaluierungsinitiativen wie CLEF 2<br />
und NTCIR 3<br />
untersucht.<br />
Eine weitere Entwicklung, die sich seit einiger Zeit im<br />
Bereich des Information Retrieval abzeichnet, liegt in der<br />
zunehmenden Ablösung des klassischen Bag-of-Words-<br />
Ansatzes, der bislang sowohl innerhalb des Indexierungsprozesses<br />
als auch im Rahmen der Anfrageformulierung<br />
Anwendung findet. In der Literatur wird derzeit<br />
1<br />
Im crosslingualen Information Retrieval stimmen die<br />
Sprachen der Anfrage- und der Ergebnisdokumente nicht immer<br />
überein. Mit Hilfe einer deutschsprachigen Anfrage können<br />
beispielsweise auch englischsprachige Dokumente gewonnen<br />
werden.<br />
2 Cross Language Evaluation Forum: http://clef<strong>2011</strong>.org,<br />
http://www.clef-campaign.org<br />
3 National Institute of Informatics Test Collection for IR Systems:<br />
http://research.nii.ac.jp/ntcir/index-en.html<br />
vermehrt auf die Vorteile von Phrasen gegenüber einfachen<br />
Termen hingewiesen (vgl. z.B. Tseng et al.,<br />
2007:1222). Diese zeigen sich auch anhand eines einfachen<br />
Recherchebeispiels. Eine Suchanfrage zum Thema<br />
Züge der DB liefert auch Dokumente zum Thema Datenbanken,<br />
da es eine ambige Abkürzung ist, dessen<br />
Bedeutung erst im Kontext eindeutig wird. Begreift man<br />
die einzelnen Terme jedoch als zusammengehörige<br />
Phrase, so wird diese Mehrdeutigkeit aufgelöst und es<br />
werden lediglich diejenigen Dokumente ausgewiesen, in<br />
denen die Kombination der Terme auftritt.<br />
Das Extrahieren geeigneter Phrasen stellt jedoch vor<br />
einem multilingualen Hintergrund eine anspruchsvolle<br />
Aufgabe dar, da jedes Korpus unterschiedliche Besonderheiten<br />
aufweist, die es zu berücksichtigen gilt. Darüber<br />
hinaus spielt die Morphologie der einzelnen Sprachen<br />
eine entscheidende Rolle (z.B. abgetrennte Partikel zusammengesetzter<br />
Verben im Deutschen).<br />
Innerhalb dieses Artikels wird ein Ansatz vorgestellt, der<br />
Shallow und Deep Parsing kombiniert und mit nur geringen<br />
Anpassungen sprach- und domänenübergreifend<br />
<strong>für</strong> die Extraktion von sinntragenden Phrasen verwendet<br />
werden kann. Als Anwendungsbeispiele fungieren Patentschriften<br />
und Kundenrezensionen, die in den Spra-<br />
157
Multilingual Resources and Multilingual Applications - Regular Papers<br />
chen Deutsch und Englisch vorliegen. In Zukunft ist<br />
geplant, Dokumente der Sprachen Spanisch und Französisch<br />
zu untersuchen.<br />
Im Folgenden werden zunächst die beiden Anwendungsbereiche<br />
sowie die zugrunde liegenden Korpora<br />
vorgestellt (2.1). Des Weiteren werden einige Verfahren<br />
der Phrasenextraktion skizziert (3), an die sich die Beschreibung<br />
des sprach- und domänenübergreifenden<br />
Ansatzes anschließt (4). Dieser Artikel schließt mit einer<br />
Beschreibung des verwendeten Evaluierungsansatzes<br />
sowie ersten Ergebnissen ab.<br />
158<br />
2. Kontext der Forschungen<br />
2.1. Anwendungsbereiche<br />
Als Anwendungsbereiche <strong>für</strong> die entwickelte Phrasenextraktionskomponente<br />
werden in diesem Artikel zwei<br />
Projekte vorgestellt. Das erste Projekt findet in Kooperation<br />
mit dem FIZ Karlsruhe statt und fokussiert die<br />
Patentdomäne. Es zielt darauf ab, den Mehrwert von<br />
Phrasen <strong>für</strong> die Patentrecherche zu evaluieren (vgl. Becks,<br />
2010:423). Das zugrunde liegende Korpus beinhaltet<br />
etwa 105.000 Dokumente der CLEF-IP 4<br />
Testkollektion<br />
2009, welche sich aus ca. 1,6 Millionen Patent- und<br />
Anmeldeschriften des Europäischen Patentamtes zusammensetzt.<br />
Die Kollektion umfasst sowohl Dokumente<br />
in Englisch als auch Patente in Deutsch und<br />
Französisch (vgl. Roda et al., 2010:388).<br />
Die Kundenrezensionen, die als Beispiel <strong>für</strong> nutzergenerierte<br />
Inhalte herangezogen werden, stammen aus einem<br />
Projekt, das sich mit crosslingualem Opinion Mining<br />
befasst, und sich dabei ebenfalls mit der Extraktion von<br />
Phrasen beschäftigt. Das Ziel dieses Projektes besteht<br />
darin, Phrasen zu extrahieren, die Meinungen bezüglich<br />
der rezensierten Produkte und deren Eigenschaften<br />
enthalten. Als Grundlage dient in diesem Fall ein Korpus<br />
aus Kundenrezensionen (vgl. Hu, Liu, 2004, Ding et al.,<br />
2008, Schulz et al., 2010).<br />
Insbesondere im Hinblick auf die Länge der Dokumente<br />
unterscheiden sich beide Korpora signifikant, denn im<br />
4<br />
Cross Language Evaluation Forum, Intellectual Property<br />
Track<br />
Falle von Patentschriften handelt es sich um sehr lange<br />
und komplexe Dokumente (vgl. u.a. Iwayama et al., 2003).<br />
Eine wesentliche Herausforderung besteht somit darin,<br />
dass die sprachübergreifende Phrasenextraktion <strong>für</strong> sehr<br />
unterschiedliche Textsorten und Phrasen unterschiedlicher<br />
Komplexität gleichermaßen effektiv funktionieren<br />
soll.<br />
Eine Phrase wird als eine Kombination von Termen ver-<br />
standen, die zueinander in einer Head-Modifier-Relation<br />
stehen. Diese Beziehung kann in verschiedenen Aus-<br />
prägungen (z.B. Adjektiv-Nomen-Relation, Nomen-<br />
Präpositionalphrasen-Relation) auftreten. Die Phrasen<br />
unterscheiden sich jedoch von Chunks, die nach (Abney,<br />
1991) typischerweise aus einem einzelnen Content Word<br />
bestehen, das von einer Konstellation von Funktionswörtern<br />
und Pre-Modifiern umgeben ist, und einem festen<br />
Template folgen (vgl. Abney, 1991:257). Betrachtet<br />
man das folgende Beispiel, so zeigt sich deutlich, dass<br />
eine Phrase über die Grenzen eines Chunks hinausgehen<br />
kann. Aufgrund der fokussierten Anwendungsbereiche<br />
Information Retrieval und Opinion Mining unterscheidet<br />
sich der hier verwendete Phrasenbegriff von der klassischen<br />
linguistischen Definition. Er umfasst auch Mehrwortgruppen<br />
(z.B. information retrieval system) und<br />
Kombinationen aus Subjekt und Prädikat, die im Deutschen<br />
auch diskontinuierlich sein können. Eine Liste der<br />
erfassten Phrasentypen findet sich in Abschnitt 5.<br />
Beispiel:<br />
[a system] [for information retrieval]<br />
vs.<br />
Chunks<br />
a [system for information retrieval]<br />
Phrase<br />
2.2. Problemstellung und Anforderungen an die<br />
Phrasenextraktion<br />
Die Entwicklung einer geeigneten Extraktionskomponente<br />
wird innerhalb des Projektkontextes durch zwei<br />
wesentliche Zielsetzungen bestimmt:<br />
� Die Phrasenextraktion soll mit geringem Anpassungsaufwand<br />
<strong>für</strong> verschiedene europäische<br />
Sprachen realisierbar sein (ressourcenarmer<br />
Extraktionsansatz).
Multilingual Resources and Multilingual Applications - Regular Papers<br />
� Obgleich linguistische Ressourcen noch nicht<br />
flächendeckend verfügbar sind, soll die Phrasenextraktion<br />
<strong>für</strong> mehrere Sprachen möglich<br />
sein.<br />
Die Phrasenextraktion muss dem „Unknown Words<br />
Problem“ entgegenwirken. Infolgedessen soll das System<br />
in der Lage sein, Wörter zu bearbeiten, die bislang weder<br />
in den vom System benutzen Korpora noch in Wörterbüchern<br />
erfasst sind (vgl. Uchimoto et al., 2001:91). Dies<br />
spielt insbesondere innerhalb der Patentdomäne eine<br />
bedeutende Rolle.<br />
3. Verwandte Ansätze<br />
Zu den traditionellen Methoden der Phrasenextraktion<br />
zählen unter anderem regelbasierte Verfahren wie das<br />
Begrenzerverfahren von Jaene und Seelbach (vgl. Jaene<br />
& Seelbach, 1975). Die Autoren haben es sich zur Aufgabe<br />
gemacht, <strong>für</strong> die Inhaltserschließung Phrasen in<br />
Form von Mehrwortgruppen, die sie als mehrere eine<br />
syntaktisch-semantische Einheit bildende Wörter definieren<br />
(vgl. Jaene & Seelbach, 1975:9), aus englischen<br />
Fachtexten zu ermitteln. Zu diesem Zweck definieren<br />
Jaene und Seelbach sogenannte Begrenzerpaare, die die<br />
zu extrahierenden Nominalphrasen einschließen (vgl.<br />
Jaene & Seelbach, 1975:7). Ein ähnliches Verfahren <strong>für</strong><br />
die Extraktion von Nominalphrasen maximaler Länge,<br />
die mit dem Ziel der Identifikation von Fachtermini aus<br />
französischen Dokumenten dreier Domänen extrahiert<br />
werden, beschreiben (Bourigault & Jacquemin, 1999). In<br />
diesem Zusammenhang werden die Nominalphrasen in<br />
einem zweiten Schritt in ihre Bestandteile (Head und<br />
Modifier) zerlegt. Für den Extraktionsprozess werden<br />
sowohl Begrenzerpaare als auch die grammatische<br />
Struktur der Phrasen herangezogen. Vergleichbare Ansätze<br />
beschreiben (Tseng et al., 2007) <strong>für</strong> die Patentdomäne.<br />
Phrasen oder Schlüsselwörter werden hier auf<br />
Basis einer Stoppwortliste extrahiert. Als besonders geeignet<br />
erweisen sich dabei die längsten sich wiederholenden<br />
Phrasen (vgl. Tseng et al., 2007:1223). Auch (Guo<br />
et al., 2009) verwenden im Bereich Opinion Mining <strong>für</strong><br />
die Extraktion von Produkteigenschaften aus Satzsegmenten<br />
im semistrukturierten Bereich von Kundenrezensionen<br />
Stoppwörter, ergänzt durch meinungstragende<br />
Wörter (z.B. Adjektive). Anhand dieses kurzen Überblicks<br />
zeigt sich bereits, dass sich die Phrasenextraktion<br />
bislang überwiegend auf die Identifikation einfacher<br />
Nominalstrukturen konzentriert. In diesem Zusammenhang<br />
kommen neben den regelbasierten Ansätzen auch<br />
wörterbuchabhängige Verfahren wie beispielsweise das<br />
Dependenzparsing zum Einsatz. Im Information Retrieval<br />
kommen Dependenzrelationen häufig in Form von<br />
Head/Modifier-Paaren zum Einsatz, welche sich aus<br />
einem Head und einem Modifier zusammensetzen, wobei<br />
letzterer den Head präzisiert (vgl. Koster, 2004:423).<br />
Head/Modifier-Paare bieten den Vorteil, dass sie neben<br />
syntaktischer auch semantische Information beinhalten<br />
(vgl. u.a. Ruge, 1989:9). Infolgedessen kommen sie vor<br />
allem innerhalb des Indexierungsprozesses zum Einsatz<br />
(vgl. Koster, 2004; Ruge, 1995) und erweisen sich in<br />
Form von Tripeln (Term-Relation-Term) im Zusammenhang<br />
mit Klassifikationsaufgaben als vorteilhaft<br />
(vgl. Koster, Beney, 2009).<br />
4. Domänen- und sprachübergreifende<br />
Phrasenextraktion<br />
Dieser Artikel beschreibt eine neue Methode <strong>für</strong> die<br />
Extraktion von Phrasen, der die beiden zuvor genannten<br />
Kategorien vereinigt. Das Ziel des dargestellten Extraktionsverfahrens<br />
besteht im Wesentlichen darin, ein<br />
Werkzeug <strong>für</strong> die Identifikation von Phrasen zur Verfügung<br />
zu stellen, das sich mit geringem Aufwand <strong>für</strong> unterschiedliche<br />
Domänen und Sprachen adaptieren lässt<br />
(z.B. Anpassung einzelner Begrenzerpaare oder der zulässigen<br />
Präpositionen bei Nomen-Genitiv-Phrasen (NG)<br />
bzw. Nomen-Präpositionalphrasen (NP)). Dabei wird auf<br />
den Einsatz von domänenspezifischen Wissensbasen<br />
verzichtet, um die Domänenunabhängigkeit zu gewährleisten.<br />
Die Semantik der extrahierten Phrasen darf dabei<br />
nicht außer Acht gelassen werden. Infolgedessen handelt<br />
es sich um ein Mischverfahren, das die Funktionalität<br />
eines Shallow Parsers aufweist, aber eine flache semantische<br />
Klassifikation aufgrund linguistischer Regeln<br />
gewährleistet (vgl. Becks & Schulz, <strong>2011</strong>).<br />
Innerhalb der Phrasenextraktionskomponente findet ein<br />
regelbasiertes Verfahren Anwendung, das das Begrenzerverfahren<br />
(vgl. Jaene & Seelbach, 1975, Bourigault &<br />
Jacquemin, 1999) und die Grundzüge des Dependenzparsings<br />
(vgl. z.B. Ruge, 1995) aufgreift. Die Extraktion<br />
der Phrasen erfolgt in diesem Fall mit Hilfe verschiede-<br />
159
Multilingual Resources and Multilingual Applications - Regular Papers<br />
ner Regeln, in denen jeweils Paare von Begrenzern, de-<br />
finiert sind. Die Begrenzer sind, anders als in bisherigen<br />
Ansätzen, nicht Wörter, sondern morphosyntaktische<br />
Wortklassen (Pos-Tags). An dieser Stelle zeigt sich be-<br />
reits, dass das entwickelte System lediglich auf die Im-<br />
plementierung entsprechender Regeln sowie einen<br />
Part-of-Speech-Tagger angewiesen ist. Es handelt sich<br />
somit um einen ressourcenarmen Ansatz.<br />
Die implementierten Regeln variieren je nach Phrasentyp.<br />
Im Falle einer Adjektiv-Nomen-Relation (AN-R) wird<br />
die Phrase häufig von der Klasse Artikel und einem<br />
Interpunktionszeichen oder einer Präposition eingeschlossen<br />
(siehe Abb. 1). Darüber hinaus muss diese<br />
mindestens ein Adjektiv und ein Nomen enthalten, damit<br />
es sich um eine gültige AN-R handelt. Da die Kategorie<br />
Artikel sowohl die deutschen Artikel der, die, das als<br />
auch das englische Pendant the umfasst, kann diese Regel<br />
auch auf andere Sprachen angewendet werden. Diese<br />
abstrahierte Version des Begrenzerverfahrens ist demnach<br />
generalisierbar. Eine Einbindung komplexer Wortlisten<br />
erübrigt sich.<br />
Wie bereits erwähnt, wurde zudem auf Grundzüge des<br />
Dependenzparsings zurückgegriffen. Daher verfügt jede<br />
der extrahierten Phrasen sowohl über einen Head als auch<br />
einen Modifier, deren Ermittlung ebenfalls regelbasiert<br />
erfolgt. Im Falle der in Abbildung 1 dargestellten Beispiele<br />
befindet sich der Head am Ende der Phrase („stud“;<br />
„front panel button layout“). Der Modifier ist diesem<br />
vorangestellt.<br />
Anhand der Beispiele wird deutlich, dass es sich im<br />
Falle der extrahierten Phrasen nicht unbedingt um<br />
Head/Modifier-Paare handeln muss, sondern auch län-<br />
gere Phrasen mit mehreren Head/Modifier-Relationen<br />
durch dieses Verfahren abgebildet werden können.<br />
160<br />
linker Begrenzer: a (DT)<br />
rechter Begrenzer: with (IN)<br />
5. Evaluierung<br />
In der Regel erfolgt die Beurteilung der Qualität des<br />
gewonnen Outputs anhand eines definierten Goldstandards.<br />
Dieser Ansatz wurde beispielweise von Verbene<br />
und Kollegen gewählt (vgl. Verbene et al., 2010).<br />
Als Evaluierungsbasis dient eine manuell annotierte<br />
Stichprobe bestehend aus 100 Sätzen. Die Berechnung<br />
der Precision erfolgt auf Basis eines Vergleichs der extrahierten<br />
Phrasen mit der annotierten Stichprobe<br />
(vgl. Becks & Schulz <strong>2011</strong>: 391).<br />
Für die Erstellung des hier verwendeten Goldstandards<br />
werden zunächst aus den beiden in Abschnitt 2.1 beschriebenen<br />
Korpora <strong>für</strong> die Sprachen Deutsch und Englisch<br />
zufällig Sätze mit einem jeweiligen Gesamtumfang<br />
von ca. 2000 Tokens ausgewählt. Basis <strong>für</strong> die Berechnung<br />
der Anzahl der Tokens ist der vom Pos-Tagger<br />
generierte Output. Infolgedessen gelten auch Interpunktionszeichen<br />
jeweils als ein Token. Diese werden manuell<br />
jeweils unabhängig von zwei Annotatoren (der erste und<br />
der zweite Autor des Papers) hinsichtlich der folgenden<br />
Phrasentypen annotiert:<br />
� Subjekt-Prädikat (z. B. he thinks)<br />
� Prädikat-Objekt (z. B. extract phrases)<br />
� Verb-Adverb (z. B. extract easily)<br />
� Mehrwortgruppen (z. B. information retrieval<br />
system)<br />
� Adjektiv-Nomen (z. B. linguistic phrases)<br />
� Nomen-Präpositionalphrase (z.B. system for<br />
retrieval)<br />
� Nomen-Genitiv (z. B. rules of extraction)<br />
� Nomen-Relativsatz bzw. Nomen-Partizip (z. B.<br />
phrases extracted by the system)<br />
linker Begrenzer: a (DT)<br />
rechter Begrenzer: ,<br />
a shank-like stud with a very good front panel button layout ,<br />
Abbildung 1: Beispiel einer extrahierten Adjektiv-Nomen-Phrase; links: Patentschrift (EP-1120530-B1),<br />
rechts: Kundenrezension (Hu & Liu 2004)<br />
Insgesamt sind <strong>für</strong> die Auswahl der englischen Sätze aus<br />
den Kundenrezensionen 688 und <strong>für</strong> die deutschen Sätze<br />
639 Phrasen annotiert. Für die Patentdomäne liegen im<br />
Englischen 619 und im Deutschen 499 Phrasen vor. Von
Multilingual Resources and Multilingual Applications - Regular Papers<br />
den insgesamt 2445 Phrasen im Goldstandard sind ca.<br />
51% unkontrovers, d.h. bei diesen Phrasen stimmen<br />
sowohl die von beiden Annotatoren identifizierten<br />
Phrasengrenzen als auch die annotierten Relationen<br />
überein. Weitere 27% der Phrasen weisen eine identische<br />
syntaktische Relation auf, unterscheiden sich jedoch im<br />
Hinblick auf die annotierten Phrasengrenzen. Diese<br />
Fehlerkategorie umfasst beispielsweise koordinierte<br />
Phrasen. Im Falle der nicht bzw. nur teilweise übereins-<br />
timmenden Phrasengrenzen wurde mittels Diskussion<br />
oder durch Hinzuziehen einer dritten unabhängigen<br />
Meinung eine Einigung herbeigeführt. Die zuvor genannten<br />
Prozentangaben weisen bereits darauf hin, dass<br />
sich die von den Annotatoren identifizierten und klassifizierten<br />
Phrasen sehr häufig decken. Die exakte Übereinstimmungsrate<br />
lässt sich anhand des berechneten<br />
Kappa ablesen. Folgende Formel wurde in diesem Zusammenhang<br />
zugrunde gelegt:<br />
(aus Cohen, 1<strong>96</strong>0:40)<br />
Gemäß dieser Gleichung ergibt sich ein Kappa von 0.61.<br />
Vor dem Hintergrund, dass es sich bei den betrachteten<br />
Domänen um sehr divergierende Anwendungsfelder<br />
handelt und, dass sehr verschiedenartige, zum Teil diskontinuierliche<br />
Phrasen zu annotieren waren, kann dieser<br />
Wert als gut erachtet werden.<br />
Für die Evaluierung werden die zusammengestellten<br />
Stichproben mit Hilfe der Phrasenextraktionskompo-<br />
nente automatisch annotiert. Der resultierende Output<br />
wird anschließend gegen den Goldstandard evaluiert,<br />
welcher neben der Phrase die syntaktische Relation und<br />
die Angabe der relativen Häufigkeit innerhalb der Stichprobe<br />
beinhaltet. Die Evaluierung geht demzufolge über<br />
einen Vergleich der Zeichenketten hinaus und erfolgt<br />
zusätzlich unter Einbeziehung der folgenden Kriterien:<br />
� Syntaktische Relation<br />
� Häufigkeit<br />
Der Evaluierung liegen somit drei Faktoren zugrunde,<br />
welche als gleichgewichtet betrachtet werden. Es werden<br />
sowohl Exact als auch Partial Matches mit einer Abweichung<br />
von einem Term berücksichtigt. Phrasen, die<br />
im Hinblick auf die Phrasengrenze, die identifizierte<br />
Relation und die Häufigkeit mit der innerhalb des Gold-<br />
standards hinterlegten Phrase übereinstimmen, gelten im<br />
Rahmen der Evaluierung als Exact Matches.<br />
Für das Englische wird domänenübergreifend eine Precision<br />
von ca. 52% erzielt. Dabei werden <strong>für</strong> einige<br />
Phrasentypen deutlich bessere Werte erreichen (AN:<br />
86,5%; NG: 76,7%; NN: 76%, VA: 71,8%). Die<br />
schlechtere Precision im Falle der übrigen Phrasentypen<br />
ist einerseits auf fehlerhafte Pos-Tags (dies gilt besonders<br />
<strong>für</strong> die Patentdomäne) und andererseits auf die Diskontinuität<br />
der Phrasen zurückzuführen, welche die Formalisierung<br />
deutlich erschwert.<br />
I.d.R. zeigt sich, dass sowohl die Precision- als auch die<br />
Recall-Werte im Bereich der nutzergenerierten Inhalte<br />
durchschnittlich 13 bzw. 18 Prozentpunkte über denen im<br />
Patentbereich liegen. Dies unterstreicht die Schwierigkeit<br />
in diesem Kontext und legt die Vermutung nahe, dass es<br />
innerhalb der Patentdomäne gewisser Anpassungen be-<br />
darf. Um dies zu überprüfen, wurden <strong>für</strong> die Patentdo-<br />
mäne einige leichte Modifizierungen, z. B. Erweiterung<br />
der maximalen Phrasenlänge sowie die Berücksichtigung<br />
von Gerundien im Englischen, vorgenommen. Bereits<br />
eine geringe Anpassung der Verbalphrasen erhöht die<br />
Precision insgesamt auf 60,5% (+8,5%).<br />
Im Deutschen zeigt sich <strong>für</strong> die bislang realisierten Nominalphrasen<br />
ein ähnliches Bild. Hier wird domänenübergreifend<br />
eine Precision von 63% erreicht. Auch<br />
hier scheiden einzelne Phrasentypen deutlich besser ab<br />
(z. B.: AN: 89,4%).<br />
Insgesamt fällt auf, dass der Recall <strong>für</strong> beide Sprachen<br />
(ca. 38%) nicht sehr hoch ist. Dies lässt sich ebenfalls auf<br />
den Anwendungskontext zurückführen, denn <strong>für</strong> die<br />
Phrasenextraktion kommt der Precision in diesem Fall<br />
eine deutlich größere Bedeutung zu.<br />
6. Schlussbetrachtung<br />
Dieser Artikel bestätigt, dass sich mit einem ressourcenarmen,<br />
sprach- und domänenübergreifendem Ansatz z. T.<br />
gute Precision-Werte, die <strong>für</strong> die Phrasenextraktion im<br />
Retrieval-Kontext von vorrangiger Bedeutung sind, erzielen<br />
lassen. Allerdings weisen die Ergebnisse darauf<br />
hin, dass gewisse Modifikationen (z.B. innerhalb der<br />
161
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Patentdomäne) zu einer Steigerung der Ergebnisse führen Proceedings of the SIGIR’03. New York, NY, USA:<br />
können.<br />
ACM, S. 251-258.<br />
Jaene, H.; Seelbach, D. (1975): Maschinelle Extraktion von<br />
Zukünftig soll der Ansatz auf weiteren Sprachen (Fran- zusammengesetzten Ausdrücken aus englischen Fachzösisch,<br />
Spanisch) getestet und der Einfluss des texten. Berlin, Köln, Frankfurt (Main): Beuth.<br />
Pos-Tagging Modells untersucht werden, um die Ge- Koster, C. H. A. (2004): Head/Modifier Frames for Infornauigkeit<br />
des Algorithmus weiter zu verbessern.<br />
mation Retrieval. In: Proceedings of the CICLing’04.<br />
7. References<br />
Seoul, Korea: Springer (LNCS 2945), S. 420–432.<br />
Koster, C. H. A.; Beney, J. G.. (2009): Phrase-based docu-<br />
Abney, S. P. (1991): Parsing by Chunks. In: Berwick, R. C.;<br />
ment categorization revisited. In: Proceedings of PaIR’09.<br />
Abney, S. P.; Tenny, C. (Hrsg.): Principle-based parsing.<br />
New York, NY, USA: ACM, S. 49-56.<br />
Computation and psycholinguistics. Dordrecht: Kluwer<br />
Roda, G.; Tait, J.; Piroi, F.; Zenz, V. (2010): CLEF-IP 2009:<br />
(Studies in linguistics and philosophy, 44), S. 257-278.<br />
Retrieval Experiments in the Intellectual Property Do-<br />
Becks, D. (2010): Begriffliche Optimierung von Patentmain.<br />
In: Peters, C.; Di Nunzio, G.; Kurimo, M.; Mandl,<br />
anfragen. In: Information - Wissenschaft & Praxis, Jg. 61,<br />
Th.; Mostefa, D.; Peñas, A.; Roda, G. (Hrsg.): Multilin-<br />
H. 6-7, S. 423.<br />
gual Information Access Evaluation I. Text Retrieval<br />
Becks, D.; Schulz, J. M. (<strong>2011</strong>): Domänenübergreifende<br />
Experiments. Proceedings of CLEF '09. Berlin, Heidel-<br />
Phrasenextraktion mithilfe einer lexikonunabhängigen<br />
berg: Springer (Lecture Notes in Computer Science),<br />
Analysekomponente. In: Griesbaum, J.; Mandl, Th.;<br />
Bd. 6241, S. 385–409.<br />
Womser-Hacker, Ch. (Hrsg.): Information und Wissen:<br />
Ruge, G. (1989): Generierung semantischer Felder auf<br />
global, sozial und frei? Boizenburg: Werner Hülsbusch<br />
der Basis von Frei-Texten. In: LDV Forum 6, H. 2,<br />
(Schriften zur Informationswissenschaft Bd. 58),<br />
S. 3–17.<br />
S. 388–392.<br />
Ruge, G. (1995): Wortbedeutung und Termassoziation.<br />
Bourigault, D.; Jacquemin, Ch. (1999): Term extraction +<br />
Methoden zur automatischen semantischen Klassifika-<br />
term clustering: an integrated platform for computtion.<br />
Hildesheim, Zürich, New York: Olms.<br />
er-aided terminology. In: Proceedings of the EACL‘99<br />
Schulz, J. M.; Womser-Hacker, Ch.; Mandl, Th. (2010):<br />
Stroudsburg, PA, USA: Association for Computational<br />
Multilingual Corpus Development for Opinion Mining.<br />
Linguistics, S. 15-22.<br />
In: Calzolari, N.; Choukri, K.; Maegaard, B.; Mariani, J.;<br />
Cohen, J. (1<strong>96</strong>0): A Coefficient of Agreement for Nominal<br />
Odijk, J.; Piperidis, S.; Rosner, M.; Tapias, D. (Hrsg.):<br />
Scales. In: Educational and Psychological Measurement<br />
Proceedings of the LREC‘10. Valletta, Malta:<br />
20 (1), S. 37–46.<br />
European Language Resources Association (ELRA),<br />
Ding, X.; Liu, B.; Yu, P. S. (2008): A holistic lexicon-based<br />
S. 3409–3412.<br />
approach to opinion mining. In: Proceedings of the<br />
Tseng, Y.-H.; Lin, C.-J.; Lin, Y.-I (2007): Text mining<br />
WSDM’08. Palo Alto, California, USA: ACM, S.<br />
techniques for patent analysis. In: Information<br />
231–240.<br />
Processing and Management, Jg. 43, H. 5, S. 1216-1247.<br />
Guo, H.; Zhu, H.; Guo, Z.; Zhang, X. X.; Su, Z. (2009):<br />
Uchimoto, K.; Sekinez, S.; Isahara, H. (2001): The Un-<br />
Product feature categorization with multilevel latent<br />
known Word Problem: a Morphological Analysis of<br />
semantic association. In: Proceeding of the CIKM’09.<br />
Japanese Using Maximum Entropy Aided by a Dictio-<br />
Hong Kong, China: ACM, S. 1087–10<strong>96</strong>.<br />
nary. In: Lee, L.; Harman, D. (Hrsg.): Proceedings of the<br />
Hu, M.; Liu, B. (2004): Mining Opinion Features in<br />
EMNLP ´01: ACL, S. 91–99.<br />
Customer Reviews. In: Mcguinness, D. L.; Ferguson, G.<br />
Verbene, S.; D'hondt, E.; Oostdijk, N. (2010): Quantifying<br />
(Hrsg.): AAAI: AAAI Press/The MIT Press, S. 755–760.<br />
the Challenges in Parsing Patent Claims. In: Proceedings<br />
Iwayama, M.; Fujii, A.; Kando, N.; Marukawa, Y. (2003):<br />
An Empirical Sudy on Retrieval Models for Different<br />
Document Genres: Patents and Newspaper Articles. In:<br />
of AsPIRe'10. Milton Keynes, S. 14–21.<br />
162
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Die Digitale Rätoromanische Chrestomathie – Werkzeuge und Verfahren <strong>für</strong><br />
die Korpuserstellung durch kollaborative Volltexterschließung<br />
Claes Neuefeind, Jürgen Rolshoven, Fabian Steeg<br />
Institut <strong>für</strong> Linguistik, Sprachliche Informationsverarbeitung<br />
<strong>Universität</strong> zu Köln<br />
Albertus-Magnus-Platz<br />
50923 Köln<br />
E-mail: c.neuefeind@uni-koeln.de, rols@spinfo.uni-koeln.de, fabian.steeg@uni-koeln.de<br />
Abstract<br />
Das Paper beschreibt die Entwicklung und den Einsatz von Werkzeugen und Verfahren <strong>für</strong> die kollaborative Korrektur bei der<br />
Erstellung eines rätoromanischen Textkorpus mittels digitaler Tiefenerschließung. Textgrundlage bildet die „Rätoromanische<br />
Chrestomathie“ von Caspar Decurtins, die 1891-1919 in der Zeitschrift „Romanische Forschungen“ erschienen ist. Bei dem hier<br />
vorgestellten Ansatz werden manuelle und automatische Korrektur unter Einbeziehung von Angehörigen und Interessierten der<br />
rätoromanischen Sprachgemeinschaft über eine kollaborative Arbeitsumgebung kombiniert. In dem von uns entwickelten<br />
netzbasierten Editor werden die automatisch gelesenen Texte den Digitalisaten gegenübergestellt. Korrekturen, Kommentare und<br />
Verweise können nach Wiki-Prinzipien vorgeschlagen und eingebracht werden. Erstmalig wird so die Sprachgemeinschaft einer<br />
Kleinsprache aktiv in den Prozess der Dokumentation und Bewahrung ihres eigenen sprachlichen und kulturellen Erbes<br />
eingebunden. In diesem Paper wird die konkrete Umsetzung der kollaborativen Arbeitsumgebung beschrieben, von der<br />
architektonischen Grundlage und aktuellen technologischen Umsetzung bis hin zu Weiterentwicklungen und Potentialen. Die<br />
Entwicklung erfolgt von Beginn an quelloffen unter http://github.com/spinfo/drc.<br />
Keywords: Volltexterschließung, Korpuserstellung, kollaborative Korrektur<br />
1. Einleitung<br />
Für die Digitalisierung von Texten gibt es seitens der<br />
nationalen und internationalen Förderinstitutionen eine<br />
Vielzahl von Initiativen, Programmen und Projekten.<br />
Über die reine Massendigitalisierung hinaus zielen die<br />
Maßnahmen auch auf die digitale Tiefenerschließung<br />
von Texten. Diese ermöglicht zum einen den Zugriff<br />
über Volltextsuche, zum anderen kann sie zur Erstellung<br />
von spezialisierten Korpora genutzt werden, etwa auf<br />
Grundlage historischer Textsammlungen.<br />
Ein wesentliches Problem der automatischen Volltexterschließung<br />
sind Lesefehler bei der optischen Zeichenerkennung<br />
(Optical Character Recognition, OCR).<br />
Besonders bei älteren Texten machen die unterschiedlichen<br />
Verschriftungstraditionen und variierenden<br />
Typographien eine fehlerfreie OCR faktisch unmöglich.<br />
Im Zuge der hier beschriebenen Digitalisierung der<br />
“Rätoromanischen Chrestomathie” setzen wir deshalb<br />
bei der Korrektur der OCR-Fehler auf die Einbindung<br />
von Angehörigen und Interessierten der rätoromanischen<br />
Sprachgemeinschaft über eine netzbasierte Arbeitsumgebung,<br />
in der die OCR-gelesenen Texte den<br />
zugrunde liegenden Digitalisaten gegenübergestellt sind.<br />
2. Ähnliche Arbeiten<br />
Die Idee einer kollaborativen Korrektur von OCR-<br />
Ergebnissen findet zunehmend auch im Kontext<br />
größerer strategischer Digitalisierungsprogramme<br />
Beachtung, so z.B. im IMPACT-Projekt1 der Europäischen<br />
Kommission. Die Einschätzung, dass die<br />
Einbindung freiwilliger Korrektoren eine realistische<br />
Option ist, wird dabei u. a. durch die positiven<br />
Erfahrungen des Australian Newspapers Digitisation<br />
Program (ANDP) 2 der National Library of Australia<br />
1 Improving Access To Text; http://www.impact-project.eu.<br />
2 Siehe http://www.nla.gov.au/ndp/.<br />
163
Multilingual Resources and Multilingual Applications - Regular Papers<br />
gestützt, das im Zuge der Volltexterschließung der<br />
zwischen 1803 und 1954 in Australien erschienenen<br />
Zeitungen bereits seit 2008 erfolgreich eine Communityorientierte<br />
Fehlerkorrektur umsetzt (Holley, 2009).<br />
Einen vergleichbaren Ansatz plant auch das Deutsche<br />
Textarchiv (DTA) 3 . In der dort bislang nur intern<br />
eingesetzten Korrekturumgebung können Fehler<br />
allerdings nicht direkt vom Nutzer bearbeitet, sondern<br />
lediglich anhand einer differenzierten Fehlertypologie<br />
markiert und an die Mitarbeiter des DTA gemeldet<br />
werden, die diese anschließend offline korrigieren. Das<br />
Konzept der Verknüpfung von Digitalisat und Text in<br />
einem Editor wird zudem in dem im Rahmen des<br />
Textgrid-Projekts4 entwickelten Text-Bild-Link-Editor<br />
aufgenommen, der zwar eine kontrollierte Metadaten-<br />
Annotation von Bildelementen durch entsprechend<br />
qualifizierte Nutzer ermöglicht, jedoch aufgrund der<br />
fehlenden Benutzerverwaltung und Versionierung sowie<br />
der ausschließlich manuellen Text-Bild-Verknüpfung<br />
nicht <strong>für</strong> eine netzbasierte, kollaborative Korrektur von<br />
OCR-Ergebnissen durch interessierte Laien ausgelegt<br />
ist. Da die weiteren Ansätze zu Beginn unserer Arbeiten<br />
an der Digitalen Rätoromanischen Chrestomathie zum<br />
Teil noch nicht vorlagen (IMPACT, DTA), oder aber<br />
starke Differenzen im Ausgangsmaterial und damit im<br />
Digitalsierungs-Workflow aufweisen (großformatige<br />
Zeitungsseiten im ANDP), haben wir uns <strong>für</strong> eine<br />
Eigenentwicklung entschieden, um dadurch auch auf die<br />
speziellen Anforderungen einer mehrsprachigen<br />
Textbasis und das Fehlen von Korrekturlexika eingehen<br />
zu können. Während im DTA wie auch im Textgrid-<br />
Projekt der Schwerpunkt auf exakten Metadaten liegt,<br />
zielt der hier vorgestellte Ansatz auf die originalgetreue<br />
Wiedergabe des Textes anhand der Vorlage, weshalb auf<br />
elaborierte Korrekturguidelines verzichtet wurde.<br />
3. Die Digitale Rätoromanische<br />
Chrestomathie<br />
Die “Rätoromanische Chrestomathie” (RC) von Caspar<br />
Decurtins, die 1891-1919 in der Zeitschrift<br />
“Romanische Forschungen” (Erlangen: Junge) erschienen<br />
ist, gilt als wichtigste Textsammlung des<br />
Rätoromanischen (Egloff & Mathieu, 1986:7). Damit ist<br />
3 Siehe http://www.deutschestextarchiv.de/.<br />
4 Siehe http://www.textgrid.de/.<br />
164<br />
sie eine hervorragende Basis <strong>für</strong> die Erstellung eines<br />
rätoromanischen Textkorpus. Mit ihren etwa 8000 Seiten<br />
Text aus vier Jahrhunderten, ihrer thematischen Vielfalt,<br />
unterschiedlichen Textsorten und Genres sowie der<br />
Abdeckung der fünf Hauptidiome des Bündnerromanischen<br />
ist sie <strong>für</strong> nahezu alle sprach- und<br />
kulturwissenschaftlichen Disziplinen von außerordentlichem<br />
Interesse. Sie stimuliert lexikographisches<br />
und lexikologisches, morphologisches und<br />
syntaktisches, semantisches und textlinguistisches,<br />
literaturwissenschaftliches, volkskundliches und<br />
historisches Arbeiten. Sie ermöglicht datenbasierte<br />
Untersuchungen zu Strukturen und Textsorten und ist<br />
aufgrund ihres Varietätenreichtums von hohem Wert <strong>für</strong><br />
diachrone (über vier Jahrhunderte reichende) und<br />
diatopische (fünf Hauptidiome umfassende) Untersuchungen,<br />
etwa zu Sprachkontakt, Sprachverwandtschaft<br />
und Sprachwandel.<br />
3.1. Digitalisierung und OCR<br />
Ausgangspunkt der Korpuserstellung sind die<br />
Digitalisate der RC aus der Zeitschrift "Romanische<br />
Forschungen", die von der Staats- und <strong>Universität</strong>sbibliothek<br />
Göttingen im Rahmen des Digizeitschriften-<br />
Projekts5 digitalisiert und zusammen mit den in einem<br />
METS-basierten Format6 erstellten Metadaten zur<br />
Verfügung gestellt wurden. Um die Digitalisate <strong>für</strong> die<br />
textuelle Verarbeitung zugänglich zu machen, werden<br />
sie mittels OCR in eine maschinenlesbare Form<br />
überführt. Die hohe typographische und orthographische<br />
Vielfalt der RC stellt dabei eine besondere<br />
Herausforderung <strong>für</strong> die OCR dar, um so mehr, als der<br />
Zeichenerkennung keine angemessenen Korrekturlexika<br />
<strong>für</strong> die verschiedenen Idiome zur Verfügung stehen.<br />
Gerade die älteren Texte der Chrestomathie sind<br />
orthographisch nicht normiert, weil die Idiome des<br />
Bündnerromanischen unterschiedlichen Verschriftungsformen<br />
und -traditionen folgen. Auf Grundlage des<br />
OCR-Ergebnisses werden PDF-Dateien generiert, bei<br />
denen der erkannte Text unter dem Digitalisat<br />
positioniert wird. Das generierte PDF enthält damit nicht<br />
nur den gesamten Text, sondern auch die<br />
5 Siehe http://www.digizeitschriften.de<br />
6 Siehe http://gdz.sub.unigoettingen.de/entwicklung/standardisierung/
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Positionskoordinaten der einzelnen Wörter. Die<br />
Extraktion der Wörter mitsamt ihrer Positions-<br />
koordinaten erfolgte mit der Software-Bibliothek<br />
PDFBox 7 . Die ausgelesenen Informationen (Wort und<br />
Position) werden in XML-Form abgelegt und stellen die<br />
Grundlage <strong>für</strong> das Highlighting-Feature in der<br />
Korrekturumgebung dar.<br />
3.2. Der DRC-Editor<br />
Kern des hier beschriebenen Ansatzes ist die Erstellung<br />
einer kollaborativen Korrekturumgebung, in der die<br />
Digitalisate und die mittels OCR gewonnenen Texte<br />
zusammengeführt werden. Mit dem Editor können die<br />
elektronisch eingelesenen Texte der RC durchsucht,<br />
gelesen und bearbeitet werden.<br />
Abbildung 1: Screenshot des Editors (Beta-Version)<br />
Die Auswahl der zu bearbeitenden Seiten erfolgt über<br />
Volltextsuche sowie über die aus dem Digizeitschriften-<br />
Projekt übernommenen Metadaten. Ziel der Bearbeitung<br />
ist die Erstellung einer fehlerfreien digitalen Textfassung,<br />
weshalb zu Vergleichszwecken stets die<br />
Originalfassung als digitales Faksimile mit angezeigt<br />
wird. Die Bilddarstellung ist dabei mit dem Text<br />
gekoppelt: Während man den Text wortweise bearbeitet,<br />
wird das jeweils korrespondierende Wort unter Nutzung<br />
der bei der OCR gewonnenen Positionskoordinaten auf<br />
dem Bild hervorgehoben (siehe Abbildung 1, Bereich<br />
Verifitgar). Die Synchronisation von Text und<br />
7 Siehe http://pdfbox.apache.org/.<br />
Bildkoordinaten bleibt auch bei Korrekturen bestehen,<br />
da die vorgenommenen Änderungen ebenso wie die<br />
Positionskoordinaten der ursprünglichen Wortform<br />
zugeordnet werden. Über die Tastatur nicht verfügbare<br />
Sonderzeichen können über ein Auswahlfenster<br />
hinzugefügt werden.<br />
Als zusätzliches Hilfsmittel besteht die Option zur<br />
Anzeige von Korrekturvorschlägen (siehe Abbildung 1,<br />
Propostas da Correcturas), die auf Grundlage von<br />
Wortlisten über die Levenshtein-Distanz, einen<br />
Algorithmus <strong>für</strong> den Stringvergleich, ermittelt werden.<br />
Da solche Wortlisten bzw. Prüfklassen derzeit nur <strong>für</strong><br />
eines der Idiome, das Surselvische, verfügbar sind, ist<br />
zusätzlich ein automatisierter Auf- und Ausbau von<br />
Benutzerlexika geplant, indem die manuellen<br />
Korrekturen unter Nutzung der Versionierungsmechanismen<br />
der Korrekturumgebung aufgezeichnet<br />
werden. Hieraus resultiert eine stetig wachsende Liste<br />
von als korrekt bestätigten Wörtern, die einerseits als<br />
Grundlage <strong>für</strong> die Berechnung von Korrekturvorschlägen<br />
dient, andererseits dazu eingesetzt werden<br />
kann, dem Nutzer nach einer vorgenommenen Korrektur<br />
Verbesserungsvorschläge an anderen, gleichen oder<br />
ähnlichen Stellen des Textes vorzuschlagen. Sämtliche<br />
Bearbeitungen werden unter Angabe von Nutzer und<br />
Bearbeitungszeitpunkt protokolliert. Damit verbunden<br />
ist ein einfaches Bewertungs- und Wettbewerbssystem,<br />
das über die Korrekturen Buch führt.<br />
Die Erfahrungen im laufenden Projekt haben gezeigt,<br />
dass über die reine Korrektur hinaus auch die<br />
Möglichkeit zu einer Verschlagwortung und<br />
Kommentierung nutzerseitig gewünscht ist, da dies<br />
neben erweiterten Recherchemöglichkeiten auch die<br />
Möglichkeit zur Markierung unklarer oder (im Sinne der<br />
kollaborativen Bearbeitung) strittiger Fälle bietet. In der<br />
aktuellen Beta-Version können die Daten deshalb auf<br />
Seitenebene mittels frei wählbarer Tags oder durch<br />
Hinzufügung von Freitext-Kommentaren annotiert<br />
werden. Über eine fehlerfreie Dokumentation hinaus<br />
erfolgt auf diese Weise auch eine Anreicherung der<br />
Texte. Hierbei wird die Textbasis in gewissem Sinne<br />
'aktualisiert', indem das Wissen der Sprecher in Form<br />
von Metadaten (Schlagworte, Verweise, Nutzungskontexte)<br />
in die Texte zurückfließt.<br />
165
Multilingual Resources and Multilingual Applications - Regular Papers<br />
3.3. Das DRC-Portal<br />
Für den Datenzugriff wurde neben dem Editor eine<br />
mehrsprachige Portalseite erstellt, die als zentraler<br />
Anlaufpunkt <strong>für</strong> interessierte Nutzer dient (vgl.<br />
Abbildung 2). Über das Portal kann der DRC-Editor<br />
heruntergeladen und ein Account <strong>für</strong> dessen Benutzung<br />
angelegt werden.<br />
166<br />
Abbildung 2: Portalseite der DRC (siehe<br />
http://www.crestomazia.ch)<br />
Neben Hilfestellungen und Hinweisen zum Editor bietet<br />
die Portalseite erweiterte Recherchemöglichkeiten und<br />
enthält Hintergrundinformationen zum Projekt sowie zu<br />
ausgewählten Aspekten der bearbeiteten Daten.<br />
3.4. Einbindung der Sprachgemeinschaft<br />
Von zentraler Bedeutung <strong>für</strong> das hier beschriebene<br />
Vorgehen war die Frage, wie die Einbindung von<br />
Sprechern in einen kollaborativen Erschließungsprozess<br />
erfolgen kann. Um die Beteiligung einer ausreichenden<br />
Zahl von Sprechern sicherzustellen, setzten wir auf die<br />
Zusammenarbeit mit Partnern vor Ort, die neben der<br />
Presse- und Öffentlichkeitsarbeit auch eine Nutzer-<br />
akquise übernehmen. Das Projekt wurde mit Hilfe der<br />
Schweizer Partner über die lokalen und überregionalen<br />
Medien propagiert. In Kombination mit einer gezielten<br />
Nutzerakquise konnte dadurch bereits <strong>für</strong> die aktuelle<br />
Beta-Version des DRC-Editors eine größere Anzahl an<br />
Nutzern gewonnen werden. So waren im August <strong>2011</strong><br />
ca. 100 Nutzer angemeldet, seit dem Schaltungstermin<br />
der DRC Anfang Juni <strong>2011</strong> wurde etwa ein Drittel der<br />
Texte bearbeitet.<br />
4. Systemarchitektur<br />
Der Natur des Vorhabens wird eine dreischichtige<br />
Systemarchitektur gerecht: Gesamtziel ist die kollabora-<br />
tive Produktion annotierter, textueller Daten. Diese<br />
Daten sind <strong>für</strong> alle Benutzer des Systems identisch, und<br />
können daher zentral gespeichert werden (Datenschicht).<br />
Verschiedene Nutzer sollen unabhängig voneinander auf<br />
diese Daten zugreifen und diese verändern können,<br />
wobei die Integrität der Daten gewährleistet werden<br />
muss (Logikschicht). Der Zugriff erfolgt über eine<br />
graphische Benutzerschnittstelle (Präsentationsschicht).<br />
Abbildung 3: Grundlegende Systemarchitektur<br />
Die Präsentationsschicht kommuniziert mit der Logik-<br />
schicht und diese mit der Datenschicht. Da es keine<br />
direkte Verbindung zwischen Präsentations- und Daten-<br />
schicht gibt, ist das System lose gekoppelt und erlaubt<br />
Austausch und Wiederverwendung der Schichten, etwa<br />
<strong>für</strong> eine Nutzung der Daten in anderen Kontexten.<br />
4.1. Technologien<br />
Aufgrund des modernen Programmierkonzepts, der<br />
hohen Modularität und Wiederverwertbarkeit durch<br />
OSGi 8 , der nativen GUI-Technologie sowie der<br />
Integration von Webstandards (z.B. CSS zur Gestaltung<br />
der GUI), haben wir uns <strong>für</strong> Eclipse4 9 als Technologie<br />
<strong>für</strong> die Präsentationsschicht entschieden. Für eine<br />
kompakte und zugleich effiziente und kompatible<br />
Logikschicht setzen wir auf die JVM-Sprache Scala 10 .<br />
Für die Datenschicht wird mit eXist 11 eine XML-<br />
Datenbank eingesetzt. Da eXist über eine eingebaute<br />
Serverfunktionalität verfügt, war es zweckmäßig, die<br />
Logikschicht als Teil des Clients umzusetzen, und so<br />
keine eigenen serverseitigen Komponenten imple-<br />
8 Open Service Gateway Initiative, siehe http://www.osgi.org/.<br />
9 Siehe http://eclipse.org/eclipse4/.<br />
10 Siehe http://www.scala-lang.org/.<br />
11 Siehe http://exist.sourceforge.net/.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
mentieren zu müssen. Über den Datenbankserver<br />
können die Daten unabhängig von der beschriebenen<br />
Infrastruktur über standardisierte REST-Schnittstellen 12<br />
zur Verfügung gestellt werden.<br />
4.2. Implementierungen<br />
Die Beta-Version des Editors ist als Eclipse-basierte<br />
Desktop-Applikation realisiert, die als Client des<br />
Datenbankservers fungiert. Der Editor wurde mit auto-<br />
matischen Aktualisierungen versehen, um neue<br />
Funktionalitäten und Fehlerbehebungen in der Software<br />
ohne Aufwand seitens der Nutzer bereitzustellen. Damit<br />
ergibt sich die folgende technologische Umsetzung der<br />
oben skizzierten Architektur:<br />
Abbildung 4: Implementierung der Architektur in der<br />
aktuellen Beta-Version<br />
Derzeit arbeiten wir an alternativen Umsetzungen der<br />
GUI. Die aktuelle Beta-Version ermöglicht sowohl eine<br />
Weiterentwicklung zu einer Offline-fähigen Desktop-<br />
Applikation, die ohne Netzzugang verwendet werden<br />
kann und bei Bedarf die Daten mit dem Server<br />
synchronisiert, als auch die automatische Generierung<br />
einer Web-Oberfläche mithilfe des RAP-Frameworks13 (vgl. Abbildung 5).<br />
Abbildung 5: Alternative Umsetzungen der Architektur<br />
12 Representational State Transfer, vgl. dazu (Fielding, 2000).<br />
13 Rich Ajax Platform, siehe http://eclipse.org/rap/.<br />
Die Software-Entwicklung erfolgte von Beginn an<br />
quelloffen; der vollständige Programmcode steht ebenso<br />
wie die jeweils aktuelle Version des Editors unter<br />
https://github.com/spinfo/drc frei zur Verfügung.<br />
5. Erweiterungen<br />
Mit der Digitalen Rätoromanischen Chrestomathie wird<br />
erstmals der digitale Zugriff auf eine größere<br />
rätoromanische Textsammlung geschaffen. Über die<br />
reine Dokumentation und Archivierung hinaus kann eine<br />
frei zugängliche RC eine Vielzahl neuer Impulse <strong>für</strong> die<br />
wissenschaftliche, mediale, edukative, aber auch private<br />
Nutzung geben. Die Möglichkeiten reichen von<br />
historischen und genealogischen Recherchen nach<br />
Personen und Ortsnamen über die kreative<br />
Auseinandersetzung durch Hinzufügung eigener Texte<br />
oder Übersetzungen, bis hin zur lexikographischen<br />
Arbeit mit der RC. Für eine Nutzung jenseits einfacher<br />
Suchanfragen ist zudem eine Annotation der Texte mit<br />
linguistischen Merkmalen geplant14 . Insbesondere <strong>für</strong><br />
eine adäquate (computer-)linguistische Nachnutzung<br />
bedarf es einer linguistischen Aufbereitung der<br />
erschlossenen Texte, da die reine Volltexterschließung<br />
nur als ein erster Schritt auf dem Weg zur Bereitstellung<br />
von computer- bzw. korpuslinguistisch ausgiebig<br />
nutzbaren Ressourcen betrachtet werden kann.<br />
Analog zum hier beschriebenen Vorgehen soll auch die<br />
linguistische Annotation durch die Kombination<br />
automatischer und manueller Verfahren erfolgen. Um<br />
der weitgehend fehlenden orthographischen Normierung<br />
der RC zu begegnen, sollen in einem Folgeprojekt<br />
zunächst die <strong>für</strong> die fünf Hauptidiome verfügbaren<br />
lexikalischen Ressourcen digital aufbereitet werden. Auf<br />
dieser Grundlage automatisch vorgenommene Annotationen<br />
können anschließend über das entsprechend<br />
erweiterte Editor-Werkzeug durch Muttersprachler und<br />
Interessierte kollaborativ überprüft und ggf. korrigiert<br />
bzw. ergänzt werden. Das aus lexikalischer und<br />
manueller Annotation gewonnene Wissen soll mittels<br />
spezialisierter Lernverfahren zur erneuten automatischen<br />
Annotation der Texte genutzt werden.<br />
14 Vgl. dazu bspw. das Vorgehen im Projekt “Text+Berg digital”<br />
(Volk et al., 2010), siehe auch http://www.textberg.ch.<br />
167
Multilingual Resources and Multilingual Applications - Regular Papers<br />
168<br />
6. Potentiale<br />
Durch den hier vorgestellten Ansatz werden Probleme<br />
der OCR gerade bei älteren und typographisch varianten<br />
Schriftsystemen abgefangen. Die im Vorhaben<br />
eingesetzten Erschließungs- und Auszeichnungs-<br />
techniken können in der Folge auf weitere Text-<br />
sammlungen des Bündnerromanischen und anderer<br />
kleiner Sprachen angewandt werden. Über das konkrete<br />
materielle Ziel der Erstellung eines rätoromanischen<br />
Textkorpus hinaus werden damit übertragbare und somit<br />
nachhaltige, kompetenzorientierte Verfahren entwickelt,<br />
die <strong>für</strong> die Tiefendigitalisierung des schriftlichen<br />
kulturellen Erbes kleinerer Sprachgemeinschaften<br />
prototypisch sind. Von besonderem Interesse ist hier<br />
auch die Möglichkeit <strong>für</strong> Mitglieder solcher Sprach-<br />
gemeinschaften, über Wiki-Technologien den Erhalt des<br />
eigenen sprachlichen und kulturellen Erbes aktiv zu<br />
unterstützen.<br />
7. Danksagung<br />
Die Digitale Rätoromanische Chrestomathie ist ein<br />
gemeinsames Projekt der Sprachlichen Informations-<br />
verarbeitung und der <strong>Universität</strong>s- und Stadtbibliothek<br />
Köln. Für die Organisation und Durchführung in der<br />
Schweiz konnten wir mit Dr. Florentin Lutz einen<br />
ausgewiesenen Linguisten und sehr gut vernetzten<br />
Muttersprachler gewinnen. Das DRC-Projekt wird von<br />
der Deutschen Forschungsgemeinschaft gefördert. In der<br />
Schweiz erhielt das Projekt zusätzliche finanziellen<br />
Unterstützung durch das Legat Anton Cadonau, das<br />
Institut <strong>für</strong> Kulturforschung Graubünden und das<br />
Kulturamt des Kantons Graubünden. Auch seitens der<br />
rätoromanischen Verbände und Organisationen erfuhr<br />
das Projekt regen Zuspruch und weitere Unterstützung,<br />
insbesondere durch die Lia Rumantscha 15 , den<br />
Dachverband der Bündnerromanen, sowie die Societad<br />
Retorumantscha 16 , den Trägerverein des „Dicziunari<br />
Rumantsch Grischun“, einem der vier nationalen<br />
Wörterbücher der Schweiz. All diesen Einrichtungen<br />
schulden wir unseren herzlichsten Dank.<br />
15 Siehe http://www.liarumantscha.ch.<br />
16 Siehe http://www.drg.ch/main.php?l=r&a=srr.<br />
8. Referenzen<br />
Decurtins, C. (1984-1986): Rätoromanische<br />
Chrestomathie. Band 1-14. Chur: Octopus-Verlag /<br />
Società Retorumantscha.<br />
Egloff, P., Mathieu, J. (1986): Rätoromanische<br />
Chrestomathie - Register. In: Rätoromanische<br />
Chrestomathie, Band 15. Chur: Octopus-Verlag /<br />
Società Retorumantscha.<br />
Fielding, R (2000): Architectural Styles and the Design<br />
of Network-based Software Architectures.<br />
Doktorarbeit, University of California, Irvine.<br />
Holley, R. (2009): Many Hands Make Light Work:<br />
Public Collaborative OCR Text Correction in<br />
Australian Historic Newspapers. National Library of<br />
Australia.<br />
Volk, M., Bubenhofer, N., Althaus, A. , Bangerter, M.,<br />
Furrer, L., Ruef, B. (2010): Challenges in Building a<br />
Multilingual Alpine Heritage Corpus. In: Seventh<br />
International Conference on Language Resources and<br />
Evaluation (LREC), Malta 2010.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Ein umgekehrtes Lehnwörterbuch als Internetportal und elektronische<br />
Ressource: Lexikographische und technische Grundlagen<br />
Peter Meyer, Stefan Engelberg<br />
Institut <strong>für</strong> Deutsche Sprache<br />
Mannheim<br />
E-mail: meyer@ids-mannheim.de, engelberg@ids-mannheim.de<br />
Abstract<br />
Der vorliegende Beitrag stellt einen neuartigen Typ von mehrsprachiger elektronischer Ressource vor, bei dem verschiedene<br />
Lehnwörterbücher zu einem ‚umgekehrten Lehnwörterbuch‘ <strong>für</strong> eine bestimmte Gebersprache zusammengefasst werden. Ein<br />
solches Wörterbuch erlaubt es, die zu einem Etymon der Gebersprache gehörigen Lehnwörter in verschiedenen Nehmersprachen zu<br />
finden. Die Entwicklung einer solchen Webanwendung, insbesondere der zugrundeliegenden Datenbasis, ist mit zahlreichen<br />
konzeptionellen Problemen verbunden, die an der Schnittstelle zwischen lexikographischen und informatischen Themen liegen. Der<br />
Beitrag stellt diese Probleme vor dem Hintergrund wünschenswerter Funktionalitäten eines entsprechenden Internetportals dar und<br />
diskutiert einen möglichen Lösungsansatz: Die Artikel der Einzelwörterbücher werden als XML-Dokumente vorgehalten und dienen<br />
als Grundlage <strong>für</strong> die gewöhnliche Online-Ansicht dieser Wörterbücher; insbesondere <strong>für</strong> portalweite Abfragen werden aber<br />
grundlegende, standardisierte Informationen zu Lemmata und Etyma aller Portalwörterbücher samt deren Varianten und<br />
Wortbildungsprodukten (hier zusammenfassend als ‚Portalinstanzen‘ bezeichnet) sowie die verschiedenartigen Relationen zwischen<br />
diesen Portalinstanzen zusätzlich in relationalen Datenbanktabellenabgelegt, die performante und beliebig komplex strukturierte<br />
Suchabfragen gestatten.<br />
Keywords: Lehnwörter, elektronische Lexikografie, mehrsprachige Ressource, Internetportal<br />
1. Ein Lehnwörterbuchportal als<br />
‚umgedrehtes Lehnwörterbuch‘<br />
Ziel des vorgestellten Projekts ist ein<br />
Internet-Wörterbuchportal <strong>für</strong> Lehnwörterbücher, die<br />
Entlehnungen aus dem Deutschen dokumentieren. Dieses<br />
Portal ist dadurch gekennzeichnet, dass zum einen die<br />
eingestellten Wörterbücher als Einzelwerke<br />
veröffentlicht werden und zum anderen auf Portalebene<br />
komplexe Abfragen über sämtliche integrierte<br />
Wörterbücher hinweg formuliert werden können, zum<br />
Beispiel nach dem Weg einzelner deutscher Quellwörter<br />
über Mittlersprachen in die verschiedenen Zielsprachen,<br />
nach sämtlichen Lehnwörtern in bestimmten historischen<br />
Zeitspannen und geographischen Räumen, oder auch<br />
nach sämtlichen deutschen Lehnwörtern, die bestimmte<br />
Charakteristika aufweisen (z. B. Wortart, semantische<br />
Klasse). Das Portal realisiert damit – nicht in den<br />
Einzelwörterbüchern, aber in seiner Gesamtheit – als<br />
umgekehrtes Lehnwörterbuch das Konzept eines neuen<br />
Wörterbuchtyps. 1<br />
Während es in der Sprachkontaktlexikographie<br />
– etwa in Fremdwörterbüchern – üblich ist,<br />
Entlehnungsprozesse aus der Perspektive der Zielsprache<br />
zu beschreiben, erfasst das geplante Portal aus der<br />
Perspektive der Quellsprache die Wege, die deutscher<br />
Wortschatz in andere Sprachen genommen hat<br />
(Engelberg, 2010). Gegenwärtig wird am Institut <strong>für</strong><br />
Deutsche Sprache (Mannheim) im Rahmen eines über 18<br />
Monate laufenden und vom Beauftragten der<br />
Bundesregierung <strong>für</strong> Kultur und Medien geförderten<br />
Pilotprojektes die grundsätzliche Softwarearchitektur des<br />
Portals entwickelt und implementiert sowie die<br />
Integration dreier Lehnwörterbücher in das Portal<br />
vorgenommen, und zwar zu deutschen Entlehnungen im<br />
Polnischen (Vincenz & Hentschel, 2010), zu deutschen<br />
Entlehnungen im Teschener Dialekt des Polnischen<br />
(Menzel & Hentschel, 2005) und zu deutschen<br />
1 Wiegand (2001) spricht in diesem Zusammenhang von aktiven<br />
bilateralen Sprachkontaktwörterbüchern. Wörterbücher dieses<br />
Typs sind extrem selten, vgl. auch (Engelberg, 2010). (Görlach,<br />
2001) ist das einzige nennenswerte Beispiel.<br />
169
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Entlehnungen im Slovenischen (Striedter-Temps, 1<strong>96</strong>3).<br />
Da das Portal auf Offenheit bezüglich der Integration<br />
weiterer Ressourcen konzipiert ist, können jederzeit<br />
weitere Lehnwörterbücher integriert wird. Entsprechende<br />
Wörterbücher zu Entlehnungen aus dem Deutschen<br />
existieren zu relativ vielen Sprachen (Englisch, Japanisch,<br />
Portugiesisch, Schwedisch, Serbokroatisch, Tok Pisin,<br />
Ukrainisch, Usbekisch, …). Hier wären entsprechende<br />
Kooperationen anzustreben und Rechtefragen zu klären. 2<br />
170<br />
2. Nutzen eines Lehnwörterbuchportals<br />
Das Portal soll sowohl <strong>für</strong> Laien wie <strong>für</strong> Wissenschaftler<br />
nutzbar sein. Die Laiennutzung kann über einfache<br />
Suchanfragen erfolgen, die wissenschaftliche Nutzung<br />
orientiert sich an der Möglichkeit komplexer<br />
Suchanfragen und an direkten Schnittstellen<br />
(Webservices). Dabei wird sowohl die<br />
sprachwissenschaftliche Sprachkontaktforschung wie<br />
auch die historisch, soziologisch oder anthropologisch<br />
ausgerichtete Kulturkontaktforschung Nutzen aus dem<br />
Portal ziehen.<br />
Im Rahmen der wissenschaftlichen Nutzung soll das<br />
Portal nicht nur philologisch motivierte, interpretative<br />
Einzelstudien unterstützen, sondern durch die in ihm<br />
realisierte Kumulation von Daten auch spezifische<br />
neuartige, zum Teil quantitativ orientierte<br />
Forschungsfragen<br />
Untersuchungen<br />
ermöglichen. Dazu gehören<br />
� zum Zusammenhang zwischen bestimmten Typen<br />
von soziokulturellen Entwicklungen<br />
(Herrschaftswechsel, Migration, Technologieschub)<br />
und Zeitverlaufstypen der Entlehnungsfrequenzen<br />
von Lexemen (wie etwa eine plötzliche oder eine<br />
eher graduelle quantitative Zunahme von<br />
�<br />
Entlehnungen),<br />
zu Faktoren und Prozessen der Etablierung von<br />
Lehnwörtern, 3<br />
� dazu, ob verschiedene Typen des Sprachkontakts<br />
typische quantitative und zeitliche<br />
Verteilungsmuster von Lehnwörtern hervorbringen, 4<br />
2 Zum Teil ist die Beschreibungssprache in diesen Wörtern die<br />
Quellsprache (z. B. Usbekisch, Portugiesisch), so dass im Falle<br />
entsprechende Übersetzungen erforderlich wären.<br />
3 Solche Studien können auf lexikographischer und<br />
sprachübergreifender Basis Ergebnisse aus korpusbasierten<br />
Arbeiten zum lexikalischen Entrenchment von Entlehnungen<br />
komplementieren, vgl. (Chesley & Baayen, 2010).<br />
4 Sprachkontakttypen wären etwa (i) langandauernder Kontakt<br />
� zur Lebensdauer von Lehnwörtern (insoweit die<br />
integrierten Wörterbücher auch das Verschwinden<br />
oder die Obsoletheit von Entlehnungen verzeichnen),<br />
abhängig von onomasiologischen, grammatischen<br />
und anderen Faktoren, vgl. etwa (Schenke, 2009;<br />
Hentschel, 2009),<br />
� zu Lehnwortketten (z. B. Deutsch > Polnisch ><br />
Weißrussisch > Russisch > Usbekisch) im<br />
Zusammenhang mit onomasiologischen und<br />
�<br />
quantitativen Faktoren,<br />
zu „Germanoversalien“, d. h. etwa dazu, ob<br />
bestimmte phonologische, morphologische oder<br />
semantische Eigenschaften deutscher Lexeme<br />
besonders entlehnungsfördernd sind.<br />
3. Grundsätzliche Überlegungen<br />
zur lexikographischen Datenstruktur<br />
des Portals<br />
Hinsichtlich der Datenorganisation des<br />
Lehnwörterbuchportals lassen sich auf einer<br />
konzeptionellen Ebene grob drei Bereiche unterscheiden:<br />
(1) Lexikographische Grundlage des Portals sind<br />
einzelne Lehnwörterbücher traditionellen Zuschnitts, die<br />
nach den fremdsprachigen Lehnwörtern einer<br />
bestimmten Nehmersprache lemmatisiert sind. (2) Um<br />
sprach- und wörterbuchübergreifende Suchen im Portal<br />
zu ermöglichen, muss über diese Datengrundlage eine<br />
möglichst dünne Zugriffsstruktur gelegt werden, die von<br />
den Idiosynkrasien der Einzelwörterbücher abstrahiert.<br />
(3) Für die Etyma der Gebersprachemuss eine<br />
‚Metalemmaliste‘ erstellt werden, deren Einträge jeweils<br />
über die unter Punkt 2 genannte Abstraktionsschicht<br />
untereinander und mit zugehörigen Artikeln der<br />
Einzelwörterbücher vernetzt sind.<br />
Die folgenden Unterabschnitte stellen die in den drei<br />
genannten Bereichen auftretenden lexikographischen<br />
und technischen Anforderungen und Probleme<br />
ausführlicher dar, bevor im letzten Abschnitt die<br />
technische Umsetzung ihres Zusammenspiels erörtert<br />
wird.<br />
an Bevölkerungsgrenzen (Deutsch – Slowenisch, Deutsch –<br />
Polnisch), (ii) Kontakt durch Emigration mit<br />
Sprachinselbildung (Deutsch – Rumänisch, Deutsch – Russisch,<br />
Deutsch – Amerikanisches Englisch) und Kontakt durch<br />
Elitenaustausch (Deutsch – Japanisch, Deutsch – Russisch,<br />
Deutsch – Britisches Englisch, Deutsch – Tok Pisin).
3.1. Die Ebene der Einzelwörterbücher<br />
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Die zugrundeliegenden Lehnwörterbücher werden im<br />
Regelfall bereits existierende Werke sein, die nicht von<br />
vornherein <strong>für</strong> ein Lehnwörterbuchportal des hier<br />
diskutierten Typs entwickelt worden sind. Technische<br />
Minimalanforderung <strong>für</strong> die Verwendung im Portal ist,<br />
dass die Wörterbücher in geeigneter Form digitalisiert<br />
bzw. retrodigitalisiert als XML-Dokumente vorliegen. 5<br />
Auch eine Bilddigitalisierung ist denkbar, sofern zu<br />
jedem Artikel zusätzlich ein XML-Dokument mit den<br />
portalrelevanten lexikographischen Daten (und<br />
gegebenenfalls Verweisen auf Bildkoordinaten im<br />
Digitalisat) vorliegt. Angesichts der enormen Vielfalt<br />
möglicher Makro- und Mikrostrukturen in<br />
Wörterbüchern ist es nicht praktikabel, <strong>für</strong> das Portal ein<br />
festes XML-Schema vorzugeben, in das sich die<br />
XML-Repräsentationen aller Wörterbücher überführen<br />
lassen müssen. Es wird jedoch, um weitgehend<br />
automatisierte Verarbeitung zu ermöglichen, vom<br />
XML-Schema <strong>für</strong> die Einzelartikel eines jeden<br />
Wörterbuchs jeweils verlangt, dass es möglichst<br />
weitgehend von Layout- und Präsentationsaspekten<br />
abstrahiert, etwa im Sinne der<br />
TEI.dictionaries-Richtlinien; vgl. (Burnard & Bauman,<br />
2010). Es gibt gute Gründe, die XML-Digitalisate der<br />
Ausgangswörterbücher selber nicht mit portalrelevanten<br />
Informationen anzureichern. Abgesehen von<br />
urheberrechtlichen Erwägungen und dem angestrebten<br />
Erhalt der Wörterbücher als digitalen<br />
Einzelpublikationen ist es so möglich, dass an den<br />
Einzelwörterbüchern völlig unabhängig von ihrer<br />
Nutzung im Lehnwörterbuchportal weiterhin<br />
Veränderungen und Erweiterungen von den Autoren des<br />
betreffenden Werks vorgenommen werden.<br />
Ähnlich wie bei anderen Portalen können ganze<br />
Wörterbuchartikel oder Teile davon (XML-Dokumente<br />
bzw. XML-Fragmente) beispielsweise durch<br />
5 Aus expositorischen Gründen wird hier auf der Ebene der<br />
Einzelwörterbücher durchgehend von einer XML-basierten<br />
Datenhaltung ausgegangen, so wie sie im Projekt selber<br />
tatsächlich verwendet wird. Technisch lassen sich die<br />
Mikrostrukturen von Wörterbüchern natürlich auch in<br />
relationalen Datenbankschemata abbilden, was aus<br />
Performanzgründen ratsam sein kann. Andererseits können<br />
einige moderne Datenbankmanagementsysteme (z. B. Oracle)<br />
XML-Daten mit fester Struktur intern ohnehin relational<br />
repräsentieren. Vgl. z. B. (Müller-Spitzer & Schneider, 2009)<br />
<strong>für</strong> das OWID-Portal als ein konkretes Beispiel zur<br />
texttechnologischen Umsetzung von XML-Verarbeitung in<br />
einem Wörterbuchportal.<br />
XSL-Transformationen in eine geeignete<br />
HTML-Präsentation überführt werden. Dies ist die<br />
Grundlage <strong>für</strong> eine wörterbuchspezifische<br />
Online-Ansicht der Einzelwörterbuchartikel, vgl.<br />
(Engelberg & Müller-Spitzer, <strong>2011</strong>) <strong>für</strong> eine<br />
ausführlichere Darstellung. 6<br />
Die XML-Repräsentation<br />
ermöglicht außerdem im Prinzip beliebig komplexe<br />
Suchvorgänge auf den Einzelwörterbüchern, da konkrete<br />
Informationen über Abfragesprachen wie XPath und<br />
XQuery aus den Artikeln ausgelesen werden können.<br />
Allerdings sind solche XML-basierten Abfragen häufig<br />
datenbankseitig mit hohen Verarbeitungskosten versehen<br />
und daher <strong>für</strong> performante Webanwendungen kaum<br />
praktikabel. Dies ist ein wesentlicher Grund, die <strong>für</strong><br />
wörterbuchspezifische sowie portalweite<br />
(wörterbuchübergreifende) Suchen relevanten<br />
Informationen zusätzlich in separaten relationalen<br />
Datenbanktabellen vorzuhalten. Diese zusätzlichen<br />
Tabellen ermöglichen nicht nur ungleich performantere<br />
Datenbankanfragen, sie dienen auch, wie im folgenden<br />
ausgeführt wird, dazu, von den Spezifika der<br />
Einzelwörterbücher zu abstrahieren.<br />
3.2. Wörterbuchübergreifende<br />
Abstraktionsschicht<br />
Im Normalfall werden die einzelnen Lehnwörterbücher<br />
hinsichtlich ihrer Artikel- und Lemmatisierungsstruktur<br />
sowie der <strong>für</strong> Periodisierung und Lokalisierung der<br />
Entlehnung verwendeten Begriffe und Angabeformate<br />
nicht vollständig kompatibel sein. Auch hinsichtlich der<br />
zugrunde gelegten grammatischen Beschreibungssprache<br />
kann es Differenzen geben. Der hier vorzustellende<br />
Ansatz zur Lösung dieser Probleme stellt insbesondere<br />
<strong>für</strong> wörterbuchübergreifende Suchen eine eigene,<br />
relational aufbereitete Datenschicht bereit, die <strong>für</strong> das<br />
Portal relevante Informationen zu allen vorliegenden<br />
lexikalischen Einheiten aus den verschiedenen<br />
Wörterbüchern in portaleinheitlicher Weise erfasst. In<br />
einer wörterbuchübergreifenden Datenbanktabelle<br />
werden daher alle Lemmata, alle in den betreffenden<br />
Artikeln genannten (diasystematischen, ggf. auch<br />
orthographischen) Ausdrucksvariantender Lemmata<br />
sowie sämtliche in Einzelartikeln aufgeführten Derivate<br />
6 In der skizzierten Weise wird auch bei dem am Institut <strong>für</strong><br />
deutsche Sprache entwickelten OWID-Wörterbuchportal<br />
verfahren (http://www.owid.de/index.html).<br />
171
Multilingual Resources and Multilingual Applications - Regular Papers<br />
und Komposita der Lemmata als je eigene Entitäten – im<br />
Folgenden als ‚Portalinstanzen‘ bezeichnet – behandelt,<br />
also in jeweils einer separaten Tabellenzeile aufgeführt.<br />
Eine Tabellenzeile spezifiziert außer dem Wörterbuch,<br />
aus dem die Instanz (also das gegebene Lemma bzw. die<br />
gegebene Ausdrucksvariante, das Derivat oder<br />
Kompositum) stammt, u.a. folgende weiteren<br />
Informationen (Attribute), sofern das Wörterbuch diese<br />
zur Verfügung stellt: (a) eine räumliche, zeitliche und<br />
diasystematische Einordnung des Entlehnungsvorganges;<br />
(b) grammatische Informationen, insbesondere Wortart;<br />
(c) ggf. eine semantische/onomasiologische<br />
Kategorisierung. Außerdem muss jeweils angegeben<br />
werden, ob es sich bei der betreffenden Instanz um die<br />
Lemmavariante des zugehörigen Wörterbuchartikels<br />
handelt, so dass sich aus der Tabelle der Instanzen die<br />
Lemmalisten der Einzelwörterbücher ableiten lassen.<br />
Falls ein verwendetes Lehnwörterbuch innerhalb eines<br />
Artikels z. B. Lesarten unterscheidet, <strong>für</strong> die<br />
unterschiedliche Etymologien diskutiert werden, sind<br />
diese in je separaten Portalinstanzen zu kodieren, da von<br />
makrostrukurellen Eigenheiten der Einzelwörterbücher<br />
abstrahiert werden muss.<br />
Bei hinreichend komplexer und rigider XML-Kodierung<br />
eines Lehnwörterbuchs können die Portalinstanzen<br />
weitestgehend automatisiert aus den Originalartikeln<br />
extrahiert werden. Die Portalinstanzen sollten keine<br />
Informationen aus den Lehnwörterbüchern duplizieren;<br />
daher enthalten sie außerdem Verweise auf den<br />
zugehörigen Artikel und gegebenenfalls auf das dem<br />
relevanten Artikelausschnitt entsprechende<br />
XML-Element, so dass sämtliche weiteren <strong>für</strong> die Instanz<br />
relevanten Informationen mechanisch aus dem<br />
Ursprungsartikel gewonnen und z.B. <strong>für</strong> eine<br />
HTML-basierte Darstellung aufbereitet werden können.<br />
Damit portalweite, wörterbuchübergreifende<br />
Suchvorgänge möglich sind, müssen zur Erstellung der<br />
Portalinstanzen die Angaben der Ausgangswörterbücher<br />
zur zeitlichen und räumlichen Einordnung des<br />
Entlehnungsvorgangs sowie grammatische<br />
Informationen in ein einheitliches konzeptuelles Schema<br />
überführt werden. Neben komplexen Technologien wie<br />
Raum- und Zeitontologien stehen <strong>für</strong> das Pilotprojekt<br />
einfachere Lösungen wie die wörterbuchspezifisch<br />
definierte Abbildung von Sprachstufenangaben auf<br />
standardisierte Jahresintervalle zur Verfügung. Auch der<br />
172<br />
Einsatz von Georeferenzierungsverfahren ist in einer<br />
späteren Ausbaustufe des Projektes denkbar, um<br />
kartographische Visualisierungen zu ermöglichen.<br />
Wichtig ist, dass Portalinstanzen mit Informationen<br />
angereichert werden können, die keinerlei Entsprechung<br />
im zugrundeliegenden Lehnwörterbuch haben. So kann<br />
jede Instanz einem Synset einer WordNet-artigen<br />
Ressource zugeordnet oder anderweitig semantisch<br />
klassifiziert werden, um Abfragen mit<br />
onomasiologischer Komponente zu ermöglichen.<br />
Schwierig ist dies sicherlich besonders in<br />
Wortschatzbereichen, aus denen intensiv und bis hin in<br />
fachsprachliche Details entlehnt wurde (z. B. Bergbau,<br />
Chemie, Religion).<br />
Auch die Einführung von zusätzlichen Portalinstanzen<br />
kann sinnvoll sein; ist etwa ein deutsches Wort über das<br />
Polnische in das Russische gelangt, kann der womöglich<br />
im polnischen Lehnwörterbuch des Portals gar nicht<br />
verzeichnete polnische ‚Zwischenschritt‘ als eigene<br />
Portalinstanz hinzugefügt werden.<br />
3.3. Metalemmaliste und etymologische<br />
Information<br />
Die lexikographisch und linguistisch anspruchsvollste<br />
und zum Großteil manuell zu erstellende Datenschicht ist<br />
die Erarbeitung einer Metalemmaliste der Etyma der<br />
Gebersprache. Da Lehnwörterbücher häufig mehrere<br />
diasystematische bzw. Wortbildungsvarianten der Etyma<br />
angeben (darunter auch bloß rekonstruierte Formen) und<br />
verschiedene mögliche Etymologisierungen diskutieren,<br />
muss – auch angesichts der Probleme mit verschiedenen<br />
Transkriptionen – ein möglichst allgemeiner Ansatz<br />
gewählt werden. In der von uns gewählten Lösung<br />
werden <strong>für</strong> die in den Einzelwörterbüchern genannten<br />
Etymonformen – als tertia comparationis des<br />
umgekehrten Lehnwörterbuchs – jeweils ebenfalls<br />
Portalinstanzen angelegt, die in der Datenbanktabelle mit<br />
einem speziellen Attribut als (deutsche) Etymonformen<br />
gekennzeichnet werden. Im folgenden bezeichnen wir<br />
solche Portalinstanzen kurz als ‚Etymoninstanzen‘.<br />
Taucht ein deutsches Lexem in mehreren Wörterbüchern<br />
als Herkunftswort auf, wird <strong>für</strong> jedes Wörterbuch eine<br />
eigene Etymoninstanz angelegt, da die Identifikation<br />
dieser Instanzen ja erst in einem nachgelagerten<br />
lexikographischen Arbeitsschritt auf der Portalebene<br />
geschieht. Entscheidend ist daher die Identifizierung von
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Gruppen „zusammengehöriger“ Etymoninstanzen. In der<br />
von uns vorgeschlagenen Datenmodellierung wird <strong>für</strong><br />
jede solche Gruppe eine wörterbuchunabhängige<br />
Etymon-Instanz erstellt, die unter verschiedenen<br />
lexikographischen Gesichtspunkten ein besonders<br />
geeigneter Kandidat <strong>für</strong> ein Metalemma ist, also<br />
prototypischerweise ein heute noch gebräuchliches,<br />
standardsprachliches deutsches Simplex. Dieses<br />
‚Meta-Etymon‘ kann sinnvoll insbesondere in einer<br />
Stichwortliste aller deutschen Etyma des Portals<br />
verwendet werden. Alle synchronen oder diachronen<br />
Varianten, Wortbildungsprodukte/<br />
-bestandteile usw. eines Etymons werden dann auf die im<br />
folgenden Abschnitt geschilderte Weise mit ihren<br />
zugehörigen Meta-Etyma vernetzt. Es kann<br />
wünschenswert sein, zusätzliche Meta-Etyma<br />
aufzunehmen, etwa, damit der Benutzer zu einem<br />
deutschen Simplex auch dann Entlehnungen von daraus<br />
gebildeten Komposita findet, wenn dieses Simplex selber<br />
in keinem Wörterbuch als Herkunftswort geführt wird.<br />
4. Zur Architektur der Webanwendung<br />
Die Einführung einer Tabelle von Portalinstanzen<br />
ermöglicht die saubere Entkopplung der Portalerstellung<br />
von der Ebene der Einzelwörterbücher. Typische<br />
portalbezogene Suchvorgänge operieren i.a.<br />
ausschließlich auf dieser Abstraktionsschicht.<br />
4.1. Kodierung und Verwaltung der<br />
Vernetzungen zwischen Portalinstanzen<br />
Portalinstanzen müssen untereinander vernetzt werden,<br />
etwa zur Modellierung von Wortbildungsrelationen. Eine<br />
besondere Rolle spielen etymologische Angaben, die als<br />
Vernetzungen von Portalinstanzen auf Etymoninstanzen<br />
kodiert werden können. Der häufigste Fall ist die<br />
Vernetzung von Portalinstanzen, die demselben<br />
Quellwörterbuch zugeordnet sind. Um Verkettungen von<br />
Entlehnungsvorgängen zu modellieren oder<br />
‚Identitätsbeziehungen‘ zwischen Etymoninstanzen<br />
sowie zwischen Lemmata in sehr eng verwandten<br />
Sprachformen zu formulieren, müssen aber auch<br />
Vernetzungen zwischen aus verschiedenen Quellen<br />
stammenden Portalinstanzen angesetzt werden.<br />
Zur Modellierung der Vernetzungen zwischen Artikeln<br />
und Instanzen könnten im Prinzip standardisierte<br />
Repräsentationsformate wie RDF und die da<strong>für</strong><br />
entwickelten Speicher- und Zugriffstechnologien<br />
verwendet werden, vgl. (Hitzler, Krötzsch & Rudolph,<br />
2009).Da aber die Vernetzungsstruktur des Portals sehr<br />
regelmäßig ist, ziehen wir eine einfachere Lösung vor,<br />
die alle Vernetzungen in einer separaten relationalen<br />
Datenbanktabelle als geordnete Paare aus einer Quellund<br />
einer Zielinstanz repräsentiert. Jede Vernetzung von<br />
Portalinstanz P auf Portalinstanz Q wird per Attribut<br />
einem bestimmten Typ zugeordnet; unter anderem sind<br />
folgende Typen vorgesehen: (i) P ist Variante von Q<br />
(dabei können Varianten verschiedenen Typs<br />
unterschieden werden, z.B.<br />
orthographisch/synchron/diachron); (ii) P ist Derivat von<br />
Q; (iii) P ist Kompositum zu Q; (iv) P hat Q als Etymon;<br />
(v) P ist dasselbe Lexem / dieselbe Lexemvariante wie Q<br />
(wenn in einer Entlehnungskette das Lehnwort P selber<br />
wieder als Grundlage eines Entlehnungsprozesses<br />
gedient hat, wird <strong>für</strong> dieses Lehnwort eine zweite<br />
Portalinstanz Q angesetzt, die das Wort in seiner Rolle als<br />
Ausgangswort <strong>für</strong> die weitere Entlehnung repräsentiert);<br />
(vi) P gehört im jeweiligen Einzelwörterbuch zum<br />
Lemma bzw. Meta-Etymon Q.<br />
Weitere Attribute von Vernetzungen sind die Quelle der<br />
Vernetzungsinformation sowie eine einfache,<br />
ordinalskalierte Kategorisierung der in der Quelle selber<br />
angegebenen Verlässlichkeit dieser Information.<br />
Die Vernetzungen bilden einen gerichteten azyklischen<br />
Graphen (DAG). Bei typischen Suchvorgängen müssen<br />
im DAG Pfade von ggf. vorab nicht bekannter Länge<br />
ermittelt werden – etwa, um Entlehnungsketten zu finden<br />
oder ausgehend von einem Meta-Etymon E nach<br />
Derivaten/Varianten/…von Entlehnungen beliebiger<br />
Derivaten/Varianten/… von E zu suchen. Um<br />
performante SQL-Abfragen auf den Tabellen<br />
durchführen zu können, wird in der Vernetzungstabelle<br />
der transitive Abschluss der Vernetzungsrelationen oder<br />
eine geeignete Teilmenge davon abgebildet, d.h. es<br />
werden – zumindest auf der Ebene der Meta-Etyma und<br />
Einzelwörterbuch-Lemmata – auch<br />
‚indirekte‘ Vernetzungen gespeichert und als solche<br />
etikettiert. Die Verwaltung der Verweisstrukturen<br />
zwischen den Datenschichten muss softwaregestützt<br />
erfolgen. 7<br />
7 Änderungen an den Einzelwörterbüchern ziehen<br />
entsprechende Änderungen in den relationalen Instanzen- und<br />
Vernetzungstabellen nach sich, die in den meisten Fällen<br />
173
Multilingual Resources and Multilingual Applications - Regular Papers<br />
4.2. Präsentation<br />
Der Benutzer kann die Einzelwörterbücher mit jeweils<br />
eigener (neben der Suchformular-/Artikelansicht<br />
ausschnittsweise angezeigten) Lemmaliste und<br />
Suchfunktionalität nutzen. Die Etymoninstanzen bilden<br />
die Grundlage <strong>für</strong> ein separates umgekehrtes<br />
Lehnwörterbuch, also das Portalwörterbuch der<br />
deutschen Herkunftswörter, dessen Lemmaliste aus den<br />
Meta-Etyma erstellt wird. Suchvorgänge in diesem<br />
Portalwörterbuch erzeugen eine Liste von Verweisen auf<br />
passende Artikel in den Einzelwörterbüchern.<br />
174<br />
5. Literatur<br />
Burnard, L., Bauman, S. (2010): TEI P5: Guidelines for<br />
Electronic Text Encoding and Interchange.<br />
Encoding Initiative. Online:<br />
http://www.tei-c.org/release/doc/tei-p5-doc/en/<br />
Guidelines.pdf.<br />
Text<br />
Chesley, P., Baayen, R.H. (2010): Predicting new words<br />
from newer words: Lexical borrowings in French.<br />
Linguistics 48 (4), pp. 1343-1374.<br />
Engelberg, S. (2010): An inverted loanword dictionary of<br />
German loanwords in the languages of the South<br />
Pacific. In A. Dykstra & T. Schoonheim<br />
(Eds.),Proceedings of the XIV EURALEX<br />
International Congress (Leeuwarden, 6-10 July 2010).<br />
Ljouwert<br />
pp. 639-647.<br />
(Leeuwarden): Fryske Akademy,<br />
Engelberg, S., Müller-Spitzer, S. (erscheint <strong>2011</strong>):<br />
Dictionary portals. In R. Gouws, U. Heid, W.<br />
Schweickard, & H.E. Wiegand (Eds.),Wörterbücher /<br />
Dictionaries / Dictionnaires. Ein internationales<br />
Handbuch zur Lexikographie / An International<br />
Encyclopedia of Lexicography / Encyclopédie<br />
internationale de lexicographie. Bd. 4. Berlin, New<br />
York: de Gruyter.<br />
Görlach, M. (Ed.) (2001): A Dictionary of European<br />
Anglicisms: a Usage Dictionary of Anglicisms in<br />
Sixteen European Languages. Oxford etc.: Oxford<br />
University Press.<br />
Hentschel, G. (2009): Intensität und Extensität<br />
deutsch-polnischer Sprachkontakte von den<br />
automatisch durch Datenbanktrigger durchgeführt werden<br />
können. Durch solche Trigger können auch<br />
Konsistenzprüfungen durchgeführt werden, die manuellen<br />
Anpassungsbedarf feststellen und melden.<br />
mittelalterlichen Anfängen bis ins 20. Jahrhundert am<br />
Beispiel deutscher Lehnwörter im Polnischen. In Stolz,<br />
Ch. (Ed.): Unsere sprachlichen Nachbarn in Europa.<br />
Die Kontaktbeziehungen zwischen Deutsch und seinen<br />
Grenznachbarn. Bochum: Brockmeyer, pp. 155-171.<br />
Hitzler, P., Krötzsch, M., Rudolph, S. (2009):<br />
Foundations of Semantic Web Technologies. Boca<br />
Raton, FL etc.: Chapman & Hall/CRC Textbooks in<br />
Computing.<br />
Menzel, T., Hentschel, G., unter Mitarbeit von P. Jančák<br />
und J. Balhar (2005): Wörterbuch der deutschen<br />
Lehnwörter im Teschener Dialekt des Polnischen.<br />
Studia slavica Oldenburgensia, Band 10 (2003).<br />
Oldenburg: BIS-Verlag. 2., ergänzte und korrigierte<br />
elektronische Ausgabe. Online:<br />
http://www.bkge.de/14451.html.<br />
Müller-Spitzer, C., Schneider, R. (2009): Ein<br />
XML-basiertes Datenbanksystem <strong>für</strong> digitale<br />
Wörterbücher. Ein Werkstattbericht aus dem Institut<br />
<strong>für</strong> Deutsche Sprache. it - Information Technology<br />
4/2009, pp. 197-206.<br />
Schenke, M. (2009): Sprachliche Innovation – lokale<br />
Ursachen und globale Wirkungen. Das ‚Dynamische<br />
Sprachnetz‘. Saarbrücken: Südwestdeutscher Verlag<br />
<strong>für</strong> Hochschulschriften.<br />
Striedter-Temps, H. (1<strong>96</strong>3): Deutsche Lehnwörter im<br />
Slovenischen. Wiesbaden: Harrassowitz.<br />
Vincenz, A. de, Hentschel, G. (2010): Wörterbuch der<br />
deutschen Lehnwörter in der polnischen Schrift- und<br />
Standardsprache. Von den Anfängen des polnischen<br />
Schrifttums bis in die Mitte des 20. Jahrhunderts.<br />
Studia slavica Oldenburgensia, Band 20. Oldenburg:<br />
BIS-Verlag. Online:<br />
http://www.bis.uni-oldenburg.de/bis-verlag/wdlp.<br />
Wiegand, H.E. (2001): Sprachkontaktwörterbücher:<br />
Typen, Funktionen, Strukturen. In: B. Igla, P. Petkov &<br />
H.E. Wiegand (Eds.). Theoretische und praktische<br />
Probleme der Lexikographie. 1. Internationales<br />
Kolloquium zur Wörterbuchforschung am Institut<br />
Germanicum der St. Kliment-Ohridski-<strong>Universität</strong><br />
Sofia, 7. bis 8. Juli 2000 (= Germanistische Linguistik,<br />
161-162). Hildesheim, Zürich, New York: Georg Olms<br />
Verlag, pp. 115-224.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Localizing A Core HPSG-based Grammar for Bulgarian<br />
Petya Osenova<br />
The Sofia University and IICT-BAS<br />
25 A, Acad. G. Bonchev Str., Sofia 1113<br />
E-mail: petya@bultreebank.org<br />
Abstract<br />
The paper presents the main directions, in which the localization of an HPSG-based Formal Core Grammar (called Grammar Matrix)<br />
has been performed for Bulgarian. On the one hand, the adoption process took into account the predefined theoretical schemas and<br />
their adequacy with respect to the Bulgarian language model. On the other hand, the implementation within a typological framework<br />
posited some challenges with respect to the language specific features. The grammar is being further developed, and it is envisaged to<br />
be extensively used for parsing and generation of Bulgarian texts.<br />
Keywords: localization, core grammar, HPSG, Bulgarian<br />
1. Introduction<br />
Recently, a number of successful attempts have been<br />
made towards the design and application of<br />
wide-coverage grammars, which have incorporated deep<br />
linguistic knowledge and have been tested on several<br />
natural languages. Especially active in this area have<br />
been the lexicalist frameworks, such as HPSG<br />
(Head-driven Phrase Structure Grammar), LFG<br />
(Lexical-Functional Grammar) and LTAG (Lexicalized<br />
Tree Adjoining Grammar). A lot of NLP applications<br />
have been performed within HPSG-based<br />
implementation – treebanks (the LinGO Redwoods<br />
Treebank, Polish HPSG Treebank, Bulgarian<br />
HPSG-based Treebank, among others), grammar<br />
developing tools, parsers, etc.<br />
In HPSG there already exist quite extensive implemented<br />
formal grammars – for English (Flickinger, 2000),<br />
German (Muller & Kasper, 2000), Japanese (Siegel, 2000;<br />
Siegel & Bender, 2002). They provide semantic analyses<br />
in the Minimal Recursion Semantics framework<br />
(Copestake et al., 2005). HPSG is the underlying theory<br />
of the international initiative LinGO Grammar Matrix<br />
(Bender et al., 2010; Bender et al., 2002). At the moment,<br />
precise and linguistically motivated grammars,<br />
customized on the base of the Grammar Matrix, have<br />
been or are being developed for Norwegian, French,<br />
Korean, Italian, Modern Greek, Spanish, Portuguese,<br />
etc. 1<br />
. The most recent developments in the Grammar<br />
Matrix framework report also on successful<br />
implementation of grammars for endangered languages,<br />
such as Wambaya (Bender, 2008).<br />
In addition to the HPSG framework and the Grammar<br />
Matrix architecture, there is also an open source software<br />
system, which support the grammar and lexicon<br />
development – LKB (Linguistic Knowledge Builder)<br />
( http://wiki.delph-in.net/moin/LkbTop) 2<br />
.<br />
Our motivation to start the development of a Bulgarian<br />
Resource Grammar in the above-mentioned setting was<br />
as follows: there already was an HPSG-based Treebank<br />
of Bulgarian (BulTreeBank), constructed in a<br />
semi-automatic way. The knowledge within the treebank<br />
seemed to be sufficient for the construction of a wide<br />
coverage and precise formal grammar, which to parse and<br />
generate Bulgarian texts. Bulgarian is considered neither<br />
an endangered language, nor a less-processed language<br />
any more. However, it still lacks a deep linguistic<br />
grammar. Bulgarian is viewed as a “classic and exotic”<br />
language, because it combines Slavic features with<br />
Balkan Sprachbund peculiarities. These factors make<br />
Bulgarian a real challenge for the computational<br />
modeling.<br />
1<br />
http://www.delph-in.net/index.php?page=3<br />
2<br />
The projects DELPH-IN and Deep Thought are also closely<br />
related to the Grammar Matrix initiative.<br />
175
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Our preliminary supporting components were the<br />
following ones: the HPSG theoretical framework for<br />
modeling the linguistic phenomena in Bulgarian; a<br />
suitable Bulgarian corpus, which is HPSG-based, and<br />
supporting pre-processing modules; the LinGO<br />
Matrix-based Grammars software environment for<br />
encoding and integrating the suitable components; the<br />
best practices from the work on other languages. More on<br />
the current grammar model and implementation of the<br />
Bulgarian Grammar can be read in (Osenova, 2010).<br />
176<br />
2. Grammar Matrix Architecture<br />
The Grammar Matrix (Bender et al., 2002) has been<br />
intended as a typological core for initiating the grammar<br />
writing on a specific language. It also provides a<br />
customization web interface (Bender et al., 2010). The<br />
purpose of such a core is, on the one hand, to ensure a<br />
common basis for comparing various language grammars,<br />
and thus – to focus on typological similarities and<br />
differences, and on the other hand, to speed up the<br />
process of the grammar development. Thus, it supplies<br />
the skeleton of the grammar – the type hierarchy with<br />
basic types and features as well as the basic inheritance<br />
directions. Grammar Matrix is based on the experience<br />
with several languages (predominantly English and<br />
Japanese), and it is being developed further when new<br />
languages are modeled in the framework.<br />
In spite of supporting all the linguistic levels of<br />
representation, the Grammar Matrix aims at semantic<br />
modeling of a language. It introduces referential entities<br />
and events; semantic relations; semantic encoding and<br />
contribution of the linguistic phenomena (definiteness/<br />
indefiniteness; aspect; tense, among others). For example,<br />
the verbs, the adjectives, the adverbs and the prepositions<br />
are canonically viewed as introducing events, while<br />
nouns are considered introducing referential entities.<br />
Such an approach is a challenge for a language like<br />
Bulgarian, which grammaticalizes a lot of linguistic<br />
phenomena. Thus, the most common level of description<br />
would be the morphosyntactic level rather than the<br />
semantic one. Consequently, the balance of represented<br />
information between semantics and morphosyntax<br />
should be detected and distributed in an adequate way.<br />
Ideally, one should only inherit from Grammar Matrix<br />
types, without changing them. In real life, however, it<br />
turns out that each language challenges, and is challenged<br />
by the Matrix model. On the one hand, Matrix predefines<br />
some phenomena too strictly, on the other – it gives<br />
possibilities for generalizations. All this is inevitable,<br />
since the ideal granularity between specificity and<br />
universality is difficult to be established.<br />
The localization goes into several directions. First, the<br />
Grammar Matrix is implemented in accordance with<br />
some version of the HPSG theory – thus it implies certain<br />
decisions with respect to the possible analyses. However,<br />
the grammar developer in adapting the Grammar Matrix<br />
to a new language might want to apply another analysis<br />
within the language specific grammar. This is the case for<br />
Portuguese, for example. Instead of working with<br />
head-specifier and head-adjunct phrases, which are part<br />
of the standard HPSG94, the grammar adopted the more<br />
recent head-functor approach to these phrases. Another<br />
direction would be the preference towards the linguistic<br />
phenomena. Thus, in Portuguese the preferences concern<br />
agreement, modification and basic phrase structures,<br />
while in Modern Greek the phenomena to start with were<br />
clitization, word order, politeness constructions. In this<br />
respect, only a common testset might ensure the<br />
implementation of common linguistic phenomena. Such<br />
a testset is briefly discussed in 3.1. Thus, depending on<br />
the preference, grammar developers might have to extend<br />
and/or change the core grammar. For example, the<br />
addition of types for contracted or missing determiners in<br />
Modern Greek, since this information influences the<br />
semantics.<br />
Last, but not least, it is up to the grammar developer how<br />
much information to encode within the grammar, and<br />
which steps to be manipulated outside the grammar. For<br />
example, the Portuguese grammar uses a<br />
morphologically preprocessed input, while in Modern<br />
Greek grammar all the analyses are handled within the<br />
system.<br />
3. Localization in Bulgarian<br />
3.1. The Multilingual Testset.<br />
The Grammar Matrix is equipped with a testset in<br />
English, which has been already translated into a number<br />
of other languages. It comprises around 100 sentences,<br />
which in the Bulgarian translated set became 178. The<br />
grammar development started with the aim this set to be<br />
covered, since it represented some very important
Multilingual Resources and Multilingual Applications - Regular Papers<br />
common phenomena. Needless to say, the translated set<br />
incorporated also a bunch of language specific<br />
phenomena, which will be discussed in more detail below.<br />
Thus, some additional test sentences have been<br />
incorporated into the common testset, which made the<br />
positive sentences 193. Also, 20 ungrammatical<br />
sentences have been included, which checked the<br />
agreement, word order of clitics, definiteness, subject<br />
control, etc. The whole set is 213 sentences, which is<br />
comparable to the testset for Portuguese in the first phase<br />
of the grammar development. The common phenomena<br />
are as follows: complementation, modification,<br />
coordination, agreement, control, quantification,<br />
negation, illocutionary force, passivization,<br />
nominalization, relative clauses, light verb constructions,<br />
etc. The types in the initial grammar are 297. It is<br />
expected that they will expand dramatically when the<br />
lexicon is enriched further. Let us comment on some<br />
localization specificities in the translated set, which made<br />
it larger in comparison to the English testset.<br />
First of all, Bulgarian is a pro-drop language. Thus, it has<br />
always counterparts with null subjects. In the discourse, it<br />
can also omit its complements in many cases. Second,<br />
Bulgarian verbs encode aspect lexically. The English<br />
sentences often have been translated with verbs in both<br />
aspects (perfective or imperfective). When combined<br />
with the tense, the translation counterparts became even<br />
more. For example, the sentence Abrams wondered<br />
which dog barked might have two possibilities for the<br />
matrix verb (imperfect tense, imperfective and aorist<br />
tense, perfective), while the verb in the subordinate<br />
clause might have normally three possibilities (present<br />
tense, imperfective; aorist tense, perfective and imperfect<br />
tense, imperfective).<br />
In some sentences more Bulgarian verb synonyms have<br />
been provided to the English one. For example, the verb<br />
to hand in the sentence Abrams handed the cigarette to<br />
Browne can be translated into at least four Bulgarian<br />
verbs – дам (give), подам (pass), връча (deliver),<br />
предам (hand in).<br />
Next, Bulgarian has clitic counterparts to the<br />
complements as well as a clitic reduplication mechanism.<br />
Thus, translations with a clitic and a full-fledged<br />
complement have been provided to the single English one,<br />
when appropriate. Bulgarian polar questions are formed<br />
with a special question particle, which has also a<br />
focalizing role. The modification is mostly done by the<br />
adjectives – garden dog (en) vs. градинско куче (bg,<br />
‘garden-adjective dog’). Some alternations that are<br />
challenging for English are not relevant for Bulgarian.<br />
For example: Browne squeezed the cat in and Browne<br />
squeezed in the cat are translated in the same way: Браун<br />
вмъкна котката (Brown put-inside cat-the). The same<br />
holds for the well-known give-alternation: Abrams<br />
handed Browne the cigarette and Abrams handed the<br />
cigarette to Browne. The Bulgarian translation just<br />
‘swaps’ the complements, but does not change them:<br />
Абрамс даде на Браун цигарата (Abrams gave to<br />
Brown cigarette-the) and Абрамс даде цигарата на<br />
Браун (Abrams gave cigarette-the to Brown). At the<br />
same time, the Bulgarian version of the testset provided<br />
examples for aspect/tense combinations, clitic behavior,<br />
verbal complex, agreement patterns, etc.<br />
3.2. The Language Specific Phenomena<br />
Concerning Bulgarian, its rich morphology seems to<br />
conflict with the requirements behind the semantic<br />
approach. Thus, the information has to be often split<br />
between the semantic phenomenon and its realization.<br />
For example, the adjectives, participles, numerals happen<br />
to have morphologically definite forms, while the<br />
definiteness marker is not a semantic property of these<br />
categories. For that reason, the most important thing in<br />
the grammar was to keep Syntactic and Semantic features<br />
separate (for example, agreement, which is separated into<br />
semantic and syntactic ones in accordance with the ideas<br />
in Kathol 1997). In this way, the definiteness operates via<br />
the MOD(ifier) feature. The event selects for a<br />
semantically definite:<br />
[SYNSEM.LOCAL.HOOK.INDEX.DEF+],<br />
but morphologically indefinite noun:<br />
[SYNSEM.LOCAL.AGR.DEF-]<br />
As it can be seen, the semantic feature ‘definiteness’ lies<br />
in the syntactic-semantic area of local features, and more<br />
precisely within the feature INDEX. The<br />
morphosyntactic one follows the same path of locality,<br />
but it is within the feature AGR(eement). For example, in<br />
the phrase старото куче ‘old-the dog’, the adjective<br />
‘old-the’ selects for the semantically definite, but<br />
morphologically indefinite noun ‘dog’. The analysis is<br />
linguistically sound, since the definiteness marker is<br />
considered a phrasal affix in Bulgarian, not a word one.<br />
177
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Other examples are the categories of tense, aspect, mood.<br />
Tense and Mood are currently encoded as a feature of<br />
AGR.E 3 .TENSE or AGR.E.MOOD, while aspect is a<br />
feature of the head HEAD.TAM 4<br />
.ASPECT. However,<br />
in these cases at the moment there is no different<br />
contribution from semantics and morphosyntax. Thus,<br />
Grammar Matrix provides several possibilities to get the<br />
semantic information. For tense and mood the aggregated<br />
one has been chosen (AGR.E) in the current version,<br />
while for aspect – the separated encodings. The<br />
aggregated way is a better choice for unified<br />
syntactic-semantic analysis, while the separated<br />
representation leaves out an opportunity for different<br />
manipulation of syntactic and semantic contribution.<br />
Thus, Bulgarian seems to require a systematic balance<br />
between the semantic contribution and the morphological<br />
marking of the same category within the overall<br />
architecture. This fact posited some difficulties in the<br />
starting design, since the categories had to be considered<br />
whether to be approached separately on both - semantic<br />
and morphological grounds, or not.<br />
Bulgarian has a double negation mechanism (the<br />
so-called negative concord) similarly to other Slavic<br />
languages and in contrast to English. Within the proposed<br />
Grammar Matrix architecture, the negation particle has<br />
been modeled as a verb, since particles had not been<br />
presented in the Grammar Matrix, and there was no<br />
mechanism of introducing semantic relations. It scopes<br />
over the following proposition, and introduces a negation<br />
relation. At the same time the negative pronoun in the<br />
concord introduces a negative relation.<br />
Another area, in which the rich morphology plays role, is<br />
the level of type’s generalization. Very often, in<br />
Bulgarian the generalization cannot be kept at higher<br />
levels, because of the variety in the morphosyntactic<br />
behaviour types within the Bulgarian constructions. Such<br />
examples are the copula constructions. Although<br />
adjectives, adverbs and prepositions have an event index,<br />
they cannot share the same generalized type. Adjectives<br />
structure-share their PNG (person, number and gender)<br />
characteristics with the copula’s XARG – the subject.<br />
The adverbs have to be restricted to intersective<br />
3<br />
E stands for Event.<br />
4<br />
TAM stands for an aggregate feature Tense, Aspect,<br />
Mood.<br />
178<br />
modifiers when taken as complements. The common<br />
behaviour is that all these heads raise their semantic<br />
index to the copula, which is semantically vacuous itself.<br />
The nouns, however, have a referential index. In this<br />
case, the copula behaves like a transitive verb, which<br />
selects for its complement. No index is raised from the<br />
noun complement up to the copula. In this grammar<br />
version, 8 lexical types are introduced: two for present<br />
and past copula forms. Each of the two then is divided<br />
into four subtypes depending on the complement (present<br />
copula –noun; present copula – adjective; present copula<br />
– adverb; present copula - PP; past copula – noun; past<br />
copula – adjective; past copula – adverb; past copula -<br />
PP). The present-past distinction was necessary, because<br />
the past form can be in a sentence initial position, while<br />
the present form cannot.<br />
Localization took into account the relatively free word<br />
order of Bulgarian. Thus, most of the rules include all the<br />
possible orders in spite of the canonical readings. For<br />
example, there are rules for head-modifier and<br />
modifier-head; clitic-head and head-clitic; also for the<br />
head’s complement swap. The order combinations result<br />
into a proliferation of possible analyses, for whose<br />
discrimination an additional mechanism is needed. For<br />
the moment, the BulTreeBank resource is used as a<br />
discriminative tool, because it comprises the canonical<br />
and most preferred analysis per sentence.<br />
Combining the application of the clitic rules which<br />
produce lexical signs, and the complement rules, which<br />
produce phrases, the clitic doubling examples have been<br />
successfully parsed. The incorporation of Bulgarian<br />
argument-related clitics required a new mechanism. The<br />
clitics are viewed as lexical projections of the head (i.e.<br />
operated by special rules), while the regular forms are<br />
treated as head arguments (complements) (i.e. operated<br />
by head-complement principles). The clitic does not<br />
contribute its separate semantics, because it is not a<br />
full-fledged complement. Instead, the verb incorporates<br />
clitic’s contribution in its own semantics. Thus, the<br />
personal pronoun clitic lexemes have an empty relation<br />
list, while the regular pronoun forms have a pronoun<br />
relation.<br />
Another localization, which reflects the modeling of the<br />
lexicon rather than the type hierarchy, is the<br />
representation of the lexical entries. Bulgarian is a<br />
rich-inflected language, but in contrast to other Slavic
Multilingual Resources and Multilingual Applications - Regular Papers<br />
languages, its richness lies in the verbal system, rather<br />
than in the nominal one. Thus, two ways of morphology<br />
incorporation were possible. The first is to re-design the<br />
whole systematic and unsystematic morphology within<br />
the grammar, which would be a linguistically sound, but<br />
time-consuming step. Since Bulgarian verbs show a lot of<br />
alternations and irregularities across their grammatical<br />
categories (conjugation, tense, aspect, finite vs. infinite<br />
forms, other synthetic grammatical categories, such as<br />
imperative, etc.), the full paradigms per conjugation in<br />
the lexical types were abandoned as a generalization<br />
opportunity. Instead, the inflection classes of the<br />
morphological dictionary for Bulgarian (Popov et al.,<br />
2003), have been transferred into the grammar. Each verb<br />
type was viewed as a combination of the appropriate<br />
subparadigms from the given morphological and/or<br />
lexical categories. The set of the respective subparadigms<br />
per category was attached to each verb in the lexicon.<br />
Thus, the lexicon was also “dressed” with the<br />
morphologically specific information for the distinct<br />
verbs. The transfer of the morphosyntactic paradigms<br />
resulted into over 2600 rules for personal verbs only.<br />
Hence, the morphological work has been suppressed in<br />
the name of syntactic and semantic modeling. Also, in a<br />
lexicalist framework, such as HPSG, a large lexicon<br />
could not operate without the complete set of the<br />
morphosyntactic types and rules. Compare the<br />
morphologically poor and morphologically rich<br />
presentation of the verbs in the lexicon in a. and b.:<br />
a.<br />
ima_v1 := v_there-is_le &<br />
[ STEM < "има" >,<br />
SYNSEM.LKEYS.KEYREL.PRED "има_v_1_rel" ].<br />
b.<br />
ima_v1 := v_there-is_le &<br />
[ STEM < "има" >,<br />
SYNSEM [ LKEYS.KEYREL.PRED "има_v_1_rel",<br />
LOCAL.CAT.HEAD.MCLASS<br />
[ FIN-PRESENT finite-present-101,<br />
FIN-AORIST finite-aorist-080,<br />
FIN-IMPERF finite-imperf-025,<br />
PART-IMPERF participle-imperf-024,<br />
PART-AORIST participle-aorist-095] ] ].<br />
In case a. the impersonal verb има ‘there is’ introduces its<br />
type from which inherits the specific template<br />
(v_there-is_le). Then it presents the stem, i.e. word itself,<br />
and the relation. In case b. there is also a morphological<br />
class (MCLASS), which is augmented with the<br />
respective paradigms for the relevant grammatical<br />
categories. The second one is maintained in the grammar<br />
development.<br />
The evaluation of the current grammar version was done<br />
within the system [tdsb] (Oepen, 2001). The coverage of<br />
the first version of the grammar is as follows: 213<br />
sentences, from which 193 grammatical ones. The<br />
average of distinct analyses is 3.73. The ambiguity of<br />
analyses is mainly due to the following factors: 1.<br />
morphological homonymy of the word forms; 2. more<br />
than one possible word order; 3. more than one possible<br />
attachment; 4. competing rules in the grammar (see more<br />
in Osenova & Simov, 2010). The first one concerns forms<br />
like the word form of the verb ‘come’: дойде, which is<br />
ambiguous between present tense and aorist, 2 nd or 3 rd<br />
person. The second one has to do with cases like: The dog<br />
chases Brownie, where Brownie also might be the subject<br />
in some reading. The third one considers the attachment<br />
of adjuncts at the verb level as well as at the sentence<br />
level. The last factor affects mostly the coordination rules,<br />
but also some rules for modification, where the split due<br />
to the specificities of a head allow for duplication in the<br />
remaining cases. Factor 1 requires an external<br />
disambiguation filter, which is typically done by taggers.<br />
Factor 2 also requires an additional filter to pick up the<br />
most typical reading without excluding the rest. Factor 3<br />
considers spurious cases and requires a linguistic<br />
decision for the various types of adjuncts. Factor 4 needs<br />
a change in the grammar architecture by the grammar<br />
writer.<br />
The Grammar Matrix based representation, in which the<br />
Bulgarian Resource Grammar should be compatible with<br />
the other grammars on a semantic level, is MRS.<br />
4. Conclusions and Future Work<br />
The existence of a Core Grammar proved out to be very<br />
useful in the initial steps of the grammar writing, and not<br />
only, since it provides the typological background to start<br />
with and to maintain compatibility with the other<br />
language descriptions. At the same time, depending on<br />
the purposes and tasks, the grammar writer has the<br />
possibility to re-model or even override some parts<br />
within the preliminary structure. Such developments<br />
179
Multilingual Resources and Multilingual Applications - Regular Papers<br />
would give feedback to the Core Grammar developers,<br />
and would contribute to better generalizations over more<br />
languages.<br />
The Bulgarian Resource Grammar together with the<br />
English Resource Grammar are envisaged to be used for<br />
the purposes of Machine Translation within the context<br />
of European EuroMatrixPlus project. We are using the<br />
infrastructure established within DELPH-IN and<br />
LOGON. For this task MRSes of the parallel sentences,<br />
parsed by both grammars, have been aligned on lexical<br />
and phrasal level. Transfer rules are being defined on the<br />
basis of this alignment. For example, in the alignment of<br />
MRSes for the sentence No cat barked in Bulgarian there<br />
will be an additional negation relation, coming from the<br />
negated verb. Otherwise, the arguments of the first<br />
negation relation coincide as well as the argument<br />
structures of the intransitive verbs. At the same time a set<br />
of valency frames for 3000 have been extracted from<br />
BulTreeBank, and will be added to the grammar lexicon.<br />
Additionally, the arguments in the valency frames have<br />
been assigned ontological classes. This step will help in<br />
selecting only one possible analysis in cases like: John<br />
read a book (John as subject and a book as a complement),<br />
and keeping the two possible analyses in cases like:<br />
Abrams chased a dog (Abrams as subject or complement,<br />
and the same for a dog).<br />
180<br />
5. Acknowledgements<br />
The work in this paper has been supported by the<br />
Fulbright Foundation and the EU project EuroMatrix+. It<br />
profited a lot from the collaboration with Dan Flickinger<br />
(Stanford University). The author would like to thank<br />
Kiril Simov (IICT-BAS) for his valuable comments on<br />
the earlier drafts of the paper, and also the two<br />
anonymous reviewers for the very useful critical reviews.<br />
6. References<br />
Bender, E. M., Drellishak, S., Fokkens, A., Poulson, L.,<br />
Saleem, S. (2010): Grammar Customization. In:<br />
Research on Language and Computation, vol. 8 (1),<br />
pp. 23-72.<br />
Bender, E., Flickinger, D., Good, J., Sag, I. (2004):<br />
Montage: Leveraging Advances in Grammar<br />
Engineering, Linguistic Ontologies, and Mark-up for<br />
the Documentation of Underdescribed Languages. In<br />
Proceedings of the Workshop on First Steps for<br />
Language Documentation of Minority Languages:<br />
Computational Linguistic Tools for Morphology,<br />
Lexicon and Corpus Compilation, LREC 2004, Lisbon,<br />
Portugal.<br />
Bender, E. (2008): Evaluating a Crosslinguistic Grammar<br />
Resource: A Case Study of Wambaya. In Proceedings<br />
of ACL08: HLT, Columbus, OH.<br />
Copestake, A., Flickinger, D., Pollard, C., Sag, I. (2005):<br />
Minimal Recursion Semantics: An Introduction. In<br />
Research on Language and Computation (2005) 3,<br />
pp. 281–332.<br />
Flickinger, D. (2000): On building a more efficient<br />
grammar by exploiting types. In: Natural Language<br />
Engineering, 6 (1) (Special Issue on Efficient<br />
Processing with HPSG), pp. 15-28.<br />
Kathol, A. (1997): Agreement and the Syntax-<br />
Morphology Interface in HPSG. In R. Levine and G.<br />
Green (eds.) Studies in Current Phrase Structure<br />
Grammar. Cambridge University Press. pp. 223-274.<br />
Muller, S., Kasper, W. (2000): HPSG analysis of German.<br />
In W. Wachsler (ed.), Verbmobil. Foundations of<br />
speech-to-speech translation. (Artificial Intelligence<br />
ed., pp. 238-253). Berlin, Germany: Springer.<br />
Oepen, St. (2001): [incr tsdb()] – competence and<br />
performance laboratory. User manual. Technical<br />
Report, Saarland University, Saarbruecken, Germany.<br />
Osenova, P. (2010): Bulgarian Resource Grammar.<br />
Modeling Bulgarian in HPSG. Verlag Dr. Muller,<br />
pp. 71.<br />
Osenova, P., Simov, K. (2010): Using the linguistic<br />
knowledge in BulTreeBank for the selection of the<br />
correct parses. In: TLT Proceedings, pp. 163-174.<br />
Popov, D., Simov, K., Vidinska, Sv., Osenova, P. (2003):<br />
Spelling Dictionary of Bulgarian language. Nauka i<br />
izkustvo. Sofia, 2003. (in Bulgarian)<br />
Siegel, M. (2000): HPSG Analysis of Japanese. In:<br />
W. Wahlster (ed.): Verbmobil: Foundations of Speechto-Speech<br />
Translation. Springer Verlag.<br />
Siegel, M., Bender, E. (2002): Efficient deep processing<br />
of Japanese. In Proceedings of the 3rd Workshop on<br />
Asian Language Resources and International<br />
Standardization, Taipei, Taiwan.
Multilingual Resources and Multilingual Applications - Regular Papers<br />
Poster Presentations<br />
181
Multilingual Resources and Multilingual Applications - Posters<br />
182
Multilingual Resources and Multilingual Applications - Posters<br />
Autorenunterstützung <strong>für</strong> die Maschinelle Übersetzung<br />
Melanie Siegel<br />
Acrolinx GmbH<br />
Rosenstr. 2, 10178 Berlin<br />
E-mail: melanie.siegel@acrolinx.com<br />
Abstract<br />
Der Übersetzungsprozess der Technischen Dokumentation wird zunehmend mit Maschineller Übersetzung (MÜ) unterstützt. Wir<br />
blicken zunächst auf die Ausgangstexte und erstellen automatisch prüfbare Regeln, mit denen diese Texte so editiert werden können,<br />
dass sie optimale Ergebnisse in der MÜ liefern. Diese Regeln basieren auf Forschungsergebnissen zur Übersetzbarkeit, auf<br />
Forschungsergebnissen zu Translation Mismatches in der MÜ und auf Experimenten.<br />
Keywords: Machine Translation, Controlled Language<br />
1. Einleitung<br />
Mit der Internationalisierung des Markts <strong>für</strong><br />
Technologien und Technologieprodukte steigt die<br />
Nachfrage nach Übersetzungen der Technischen<br />
Dokumentation. Vor allem in der Europäischen Union<br />
steigt das Bewusstsein, dass es nicht ausreicht,<br />
englischsprachige Dokumentation zu liefern, sondern<br />
dass Dokumentation in die Muttersprache der Kunden<br />
übersetzt werden muss. Diese Übersetzungen müssen<br />
schnell verfügbar, aktualisierbar, in mehreren Sprachen<br />
gleichzeitig verfügbar und von hoher Qualität sein.<br />
Gleichzeitig gibt seit einigen Jahren erhebliche<br />
technologische Fortschritte in der Maschinellen<br />
Übersetzung: Es gibt regelbasierte 1 und statistische<br />
Systeme 2 , aber auch hybride Übersetzungsverfahren 3<br />
.<br />
Diese Situation hat dazu geführt, dass Firmen mehr und<br />
mehr versuchen, ihre Übersetzungsanstrengungen mit<br />
MÜ zu unterstützen. Dabei treten allerdings eine Reihe<br />
von Problemen auf. Die Nutzer kennen die<br />
Möglichkeiten und Grenzen der MÜ nicht gut genug. Sie<br />
werden in ihren Erwartungen enttäuscht.<br />
Um die Systeme zu testen, werden völlig ungeeignete<br />
4<br />
Texte übersetzt, wie z. B. Prosa .<br />
1<br />
Z.B. das System Systran (http://www.systran.de/), das aber<br />
jetzt auch mit statistischen Verfahren angereichert wird<br />
(Callison-Burch et al. 2009)<br />
2 Z.B. das System Moses (Koehn, 2009; Koehn et al., 2007)<br />
oder google translate (translate.google.com)<br />
3 Z.B. Federmann et al., 2010.<br />
4 Beispiel hier: Saarbrücker Zeitung vom 6.10.2009: “Vom<br />
Auch Technische Dokumentation, die an die MÜ<br />
geschickt wird, ist oft nicht von ausreichender Qualität,<br />
ebenso wenig wie Texte, die an humane Übersetzer<br />
geschickt werden. Allerdings können humane Übersetzer<br />
diesen Mangel an Qualität im Ausgangsdokument<br />
ausgleichen, während MÜ-Systeme dazu nicht in der<br />
Lage sind.<br />
Statistische MÜ-Systeme müssen auf parallelen Daten<br />
trainiert werden. Oft werden da<strong>für</strong> TMX-Dateien<br />
verwendet, die aus Translation Memory – Systemen<br />
herausgezogen werden. Da aber diese Daten oft unsauber<br />
sind und fehlerhafte und inkonsistente Übersetzungen<br />
enthalten, ist auch die Qualität der trainierten<br />
Übersetzung schlecht.<br />
Wir haben uns mit der Frage beschäftigt, wie die Autoren<br />
Technischer Dokumentation darin unterstützt werden<br />
können, Dokumente <strong>für</strong> die MÜ optimal vorzubereiten,<br />
um auf diese Weise optimale Übersetzungsergebnisse zu<br />
bekommen. Das Ziel der Untersuchungen ist, die<br />
Möglichkeiten und Grenzen der MÜ genauer zu<br />
spezifizieren, daraus Handlungsoptionen <strong>für</strong> Autoren<br />
abzuleiten und diese durch automatische Verfahren zu<br />
unterstützen. Dabei gehen wir in drei Schritten vor:<br />
1) Wir untersuchen die Schwierigkeiten, die ein<br />
humaner Übersetzer hat, darauf, ob sie auf<br />
MÜ-Systeme übertragbar sind.<br />
2) Wir experimentieren mit automatisch prüfbaren<br />
Leid mit der Übersetzung“, von Michael Brächer. Test hier mit<br />
Auszügen aus Goethes „Erlkönig“.<br />
183
Multilingual Resources and Multilingual Applications - Posters<br />
Regeln der Autorenunterstützung und übersetzen<br />
Texte vor und nach der Umformulierung mit MÜ.<br />
3) Wir ziehen Untersuchungen zu „Translation<br />
Mismatches“ in der MÜ heran, um Strukturen zu<br />
finden, die besonders schwer automatisch<br />
übersetzbar sind.<br />
184<br />
2. Schwierigkeiten von humanen<br />
Übersetzern – Schwierigkeiten<br />
von MÜ-Systemen<br />
Heizmann (1994:5) erläutert den Übersetzungsprozess<br />
<strong>für</strong> humane Übersetzer: "In our opinion, translation is<br />
basically a complex decision process. The translator has<br />
to base his or her decisions upon available information,<br />
which he or she can get from various sources." Diese<br />
Aussage ist auch auf den Übersetzungsprozess in der MÜ<br />
übertragbar und verdeutlicht schon, dass es notwendig ist,<br />
der Maschine möglichst wenige komplexe<br />
Entscheidungsprozesse aufzubürden.<br />
Ausgehend davon, dass ein MÜ-System einem eher<br />
unprofessionellen Übersetzer ähnlich ist, dem die Texte<br />
<strong>für</strong> die Übersetzung so vorbereitet werden sollten, dass<br />
sie einfacher übersetzbar sind, ziehen wir Parallelen vom<br />
unprofessionellen Übersetzer zum MÜ-System. Der<br />
Ausgangstext <strong>für</strong> Übersetzer wie <strong>für</strong> ein MÜ-System<br />
muss so angepasst werden, dass die Probleme möglichst<br />
umgangen werden, die der unprofessionelle Übersetzer<br />
und das MÜ-System haben:<br />
Die Übersetzung einzelner Wörter, Phrasen und Sätze,<br />
ohne die Möglichkeit, größere Übersetzungseinheiten in<br />
Betracht zu ziehen, erfordert, dass satzübergreifende<br />
Bezüge vermieden werden müssen, wie z.B. Anaphern.<br />
Die Unmöglichkeit der Paraphrasierung erfordert<br />
einfache Satzstrukturen ohne Ambiguitäten. Wichtig ist<br />
es auch, metaphorische Sprache zu vermeiden, da diese<br />
oft nicht einfach übersetzt werden kann, sondern<br />
Paraphrasierung erfordert.<br />
Eine Übersetzung ohne Weltwissen führt dazu, dass<br />
Wörter mit unterschiedlichen Bedeutungen in<br />
verschiedenen Domänen (Homonyme) falsch übersetzt<br />
werden. Solche potentiell ambigen Wörter müssen<br />
vermieden werden.<br />
Da das Spektrum von Übersetzungsvarianten potentiell<br />
größer als bei professionellen Übersetzern ist, ist eine<br />
systematische Terminologiearbeit am Ausgangstext<br />
hilfreich, die Terminologievarianten im Ausgangstext<br />
schon mal eliminiert.<br />
Da die MÜ ebenso wie der unprofessionelle Übersetzer<br />
wenige Hilfsmittel hat, die Hintergrundwissen zum<br />
beschriebenen Sachverhalt geben, muss die<br />
Beschreibung möglichst klar und verständlich sein. Das<br />
erfordert einfache Satzstrukturen.<br />
3. Relevanz von automatisch prüfbaren<br />
Regeln der Autorenunterstützung<br />
In einem Experiment haben wir einige Dokumente der<br />
technischen Dokumentation mit dem MÜ-System<br />
Langenscheidt T1 übersetzen lassen. Danach haben wir<br />
die Dokumente mit einer großen Anzahl automatisch<br />
prüfbarer Regeln aus Acrolinx IQ geprüft. Die<br />
Ergebnisse der Prüfungen haben wir umgesetzt, indem<br />
wir die Ausgangstexte umformuliert haben. Diese<br />
umformulierten Texte haben wir dann wieder mit<br />
Langenscheidt T1 automatisch übersetzt und die<br />
Übersetzungen miteinander verglichen. Das Ziel dieses<br />
Experiments ist es, herauszufinden, welche Regeln der<br />
Autorenunterstützung wichtige Effekte auch <strong>für</strong> die MÜ<br />
haben. Einige dieser Regeln haben wir im<br />
vorangegangenen Abschnitt Schwierigkeiten von<br />
humanen Übersetzern – Schwierigkeiten von<br />
MÜ-Systemen schon vorgestellt. Aufgrund dieser<br />
Experimente haben wir ein Regelset zusammengestellt,<br />
das wir im nächsten Abschnitt vorstellen.<br />
4. Erste Ergebnisse der Experimente<br />
Rechtschreibung und Grammatik: Das Regelset <strong>für</strong><br />
die deutschen Ausgangstexte enthält zunächst die<br />
Standard-Grammatik- und Rechtschreibregeln. Die<br />
Experimente haben klar gezeigt, dass ein MÜ-System<br />
keine sinnvollen Ergebnisse liefert, wenn der Eingabetext<br />
Rechtschreib- und Grammatikfehler enthält. Wenn ein<br />
Wort unbekannt ist, weil es falsch geschrieben ist, dann<br />
ist auch keine Übersetzung mit dem MÜ-System möglich.<br />
Allerdings führt nicht jeder Rechtschreibfehler auch zu<br />
Übersetzungsproblemen: Die Experimente haben gezeigt,<br />
dass das untersuchte MÜ-System tolerant zu alter und<br />
neuer deutscher Rechtschreibung ist – beide Varianten<br />
„muß“ und „muss“ wurden korrekt übersetzt.<br />
Regeln zu Formatierung und Zeichensetzung: Der<br />
Gebrauch von Gedankenstrichen führt zu komplexen<br />
Sätzen im Deutschen, die Probleme bei der Übersetzung<br />
bereiten.
Regeln zum Satzbau: Beim Satzbau geht es zunächst<br />
darum, komplexe Satzstrukturen zu vermeiden. Oberstes<br />
Gebot ist hier, zu lange Sätze zu vermeiden. Komplexe<br />
Satzstrukturen entstehen durch die folgenden<br />
Konstruktionen, wie Einschübe, Hauptsatzkoordination,<br />
Trennung von Verben, eingeschachtelte Relativsätze,<br />
Schachtelsätze, Klammern, Häufung von<br />
Präpositionalphrasen, Beschreibung mehrerer<br />
Handlungen in einem Satz, umständliche<br />
Formulierungen und Bedingungssätze, die nicht mit<br />
„wenn“ eingeleitet sind. Ein anderes Problem <strong>für</strong> die MÜ<br />
sind ambige Strukturen, die durch<br />
Substantivkonstruktionen und elliptische Konstruktionen<br />
entstehen.<br />
Regeln zur Wortwahl: Füllwörter und Floskeln sind<br />
deshalb schwierig <strong>für</strong> die MÜ, weil nicht paraphrasiert<br />
werden kann. Das MÜ-System versucht, diese Wörter zu<br />
übersetzen, obwohl ein professioneller Übersetzer sie<br />
weglassen oder umformulieren würde. Umgangssprache<br />
und bildhafte Sprache sind ebenfalls ein großes Problem.<br />
Pronomen sind dann schwierig zu übersetzen, wenn der<br />
Bezug außerhalb des Satzkontexts liegt und unklar ist.<br />
Bei der Verwendung von ambigen Wörtern kann das<br />
MÜ-System in vielen Fällen die Ambiguität nicht<br />
auflösen. Das passiert zum Beispiel bei der Verwendung<br />
von Fragewörtern in anderen Kontexten als einer Frage.<br />
Gerade ausdrucksschwache Verben mit ambigem<br />
Bedeutungsspektrum sind problematisch. Der<br />
Nominalstil, bei dem Verben nominalisiert werden, kann<br />
im Englischen zu komplexen und falschen<br />
Konstruktionen führen.<br />
5. Anwendung der Regeln,<br />
Umformulierungen und Übersetzungen<br />
Ein wichtiger Teil der Fragestellung war aber nun, ob die<br />
Anwendung der implementierten Regeln zur<br />
Autorenunterstützung tatsächlich eine Auswirkung auf<br />
die Ergebnisse der MÜ hat. Im oben beschriebenen<br />
Experiment haben wir die aufgestellten und<br />
implementierten Regeln zur Autorenunterstützung auf<br />
zwei Dokumente angewendet und die Texte nach den<br />
Empfehlungen der Regeln umformuliert. Anschließend<br />
haben wir untersucht, welche der Regeln am häufigsten<br />
auftraten und die meisten Effekte <strong>für</strong> die Qualität der<br />
MÜ-Ausgaben hatten. Hier muss jedoch angemerkt<br />
werden, dass dieses Experiment bisher nur mit zwei<br />
Multilingual Resources and Multilingual Applications - Posters<br />
Dokumenten durchgeführt wurde, einer Anleitung zum<br />
Ausbau von Zündkerzen am Auto und einer Anleitung<br />
zur Installation einer Satellitenschüssel. Ein interessantes<br />
Ergebnis: In fast der Hälfte der Fälle konnte der Satz<br />
anhand von lexikalisch-basierten Regeln so verbessert<br />
werden, dass die Maschinelle Übersetzung gute<br />
Ergebnisse lieferte.<br />
6. Untersuchungen zu Translation<br />
Mismatches und daraus<br />
resultierende Empfehlungen<br />
Kameyama et al. (1991) verwendeten den Begriff<br />
"Translation Mismatches", um ein Schlüsselproblem der<br />
maschinellen Übersetzung zu beschreiben. Bei<br />
Translation Mismatches handelt es sich um Information,<br />
die in der einen am Übersetzungsprozess beteiligten<br />
Sprache explizit nicht vorhanden ist, die aber in der<br />
anderen beteiligten Sprache gebraucht wird. Der Effekt<br />
ist, dass die Information in der einen<br />
Übersetzungsrichtung verloren geht und in der anderen<br />
hinzugefügt werden muss. Das hat - wie Kameyama<br />
beschreibt - zwei wichtige Konsequenzen:<br />
“First in translating a source language sentence,<br />
mismatches can force one to draw upon information not<br />
expressed in the sentence - information only inferrable<br />
from its context at best. Secondly, mismatches may<br />
necessitate making information explicit which is only<br />
implicit in the source sentence or its context.” (S.194)<br />
Translation Mismatches sind <strong>für</strong> die Übersetzung eine<br />
große Herausforderung, weil Wissen, das nicht direkt<br />
sprachlich kodiert ist, inferiert werden muss. Welche<br />
Translation Mismatches relevant sind, das hängt aber<br />
stark von der Information ab, die in den beteiligten<br />
Sprachen kodiert ist. Für das Sprachpaar<br />
Deutsch-Englisch konnten wir in den Experimenten die<br />
folgenden Translation Mismatches identifizieren:<br />
Lexikalische Mismatches. Die Bedeutung ambiger<br />
Wörter in der Ausgangssprache muss in der Zielsprache<br />
aufgelöst werden, wie z.B. bei „über“ -> „about“,<br />
„above“.<br />
Nominalkomposita. Nach den Regeln der deutschen<br />
Rechtschreibung müssen Nominalkomposita entweder<br />
zusammen oder mit Bindestrich geschrieben werden.<br />
Wenn sie zusammengeschrieben werden, muss die<br />
Analyse der MÜ die Teile identifizieren. Das ist aber<br />
nicht immer eindeutig im Deutschen. Wenn andererseits<br />
185
Multilingual Resources and Multilingual Applications - Posters<br />
auch im Deutschen wie im Englischen ein Leerzeichen<br />
zwischen den Teilen des Kompositums steht, dann ist die<br />
MÜ-Analyse of überfordert, weil die Beziehung<br />
zwischen den Nomen unklar bleibt. Z.B.: „bei den<br />
heutzutage verwendeten Longlife Kerzen“ - „at the<br />
nowadays used ones“<br />
Metaphorik. Bildhafte Sprache lässt sich nicht wörtlich<br />
übertragen. Ein Beispiel aus den Experimenten: „Man ist<br />
daher leicht geneigt“ – „One is therefore slightly only<br />
still to“<br />
Pronomen. Das Pronomen „Sie“ meint im Deutschen<br />
sowohl die 3. Person Singular als auch die 2. Person<br />
Singular, abhängig von der Großschreibung. Wenn das<br />
„Sie” aber am Satzanfang steht, bleibt unklar, welche<br />
Variante gemeint ist. Beispiel: „Sie haben es fast<br />
geschafft“ – „her it have created almost“.<br />
7. Zusammenfassung und nächste Schritte<br />
Wir haben ein Regelset <strong>für</strong> die automatische<br />
Autorenunterstützung aufgestellt. Dieses Regelset basiert<br />
auf Untersuchungen zu Problemen humaner Übersetzer,<br />
auf Experimenten mit MÜ und Umformulierungen und<br />
auf Untersuchungen zu Translation Mismatches in der<br />
MÜ. In einem nächsten Schritt haben wir das entstandene<br />
Regelset in Experimenten mit verschiedenen<br />
MÜ-Systemen validiert. Die Übersetzungen wurden<br />
dieses Mal von professionellen Übersetzern und<br />
Übersetzerinnen validiert. Eine erste Auswertung der<br />
Validierungen ergab:<br />
� Umformulierungen durch Regeln hatten keinen<br />
Einfluss auf das Ranking der Ergebnisse<br />
verschiedener MÜ-Systeme.<br />
� Die Anzahl der klassifizierbaren Fehler der<br />
MÜ-Systeme steigt, während die Anzahl der nicht<br />
klassifizierbaren Fehler sinkt. Übersetzungen der<br />
umformulierten Texte enthalten weniger<br />
Grammatikfehler.<br />
� Die Anzahl der korrekten Übersetzungen steigt<br />
stark.<br />
Die Regeln <strong>für</strong> das Pre-Editing können zum Teil<br />
automatische Vorschläge <strong>für</strong> die Umformulierung geben.<br />
Wir suchen nach einem Weg, aus diesen Vorschlägen ein<br />
automatisches Pre-Editing zu erzeugen.<br />
186<br />
8. Acknowledgements<br />
Dieses Vorhaben wird durch die TSB<br />
Technologiestiftung Berlin aus Mitteln des<br />
Zukunftsfonds des Landes Berlin gefördert, kofinanziert<br />
von der Europäischen Union – Europäischer Fonds <strong>für</strong><br />
Regionale Entwicklung. Investition in Ihre Zukunft!<br />
9. Literatur<br />
Callison-Burch, C., Koehn, P., Monz, C., Schroeder, J.<br />
(2009): Findings of the 2009 Workshop on Statistical<br />
Machine Translation. In Proceedings of the Fourth<br />
Workshop on Statistical Machine Translation<br />
(WMT09), March.<br />
Drewer, P., Ziegler, W. (<strong>2011</strong>): Technische<br />
Dokumentation. Übersetzungsgerechte Texterstellung<br />
und Content-Management. Vogel-Verlag Würzburg.<br />
Federmann, C., Eisele, A., Uszkoreit, H., Chen, Y.,<br />
Hunsicker, S., Xu, J. (2010): Further Experiments with<br />
Shallow Hybrid MT Systems. In: Callison-Burch, C.,<br />
Koehn, P., Monz, C., Peterson, K., Zaidan, O. (eds.):<br />
Proceedings of the Joint Fifth Workshop on Statistical<br />
Machine Translation and MetricsMATR, Pages 77-81,<br />
Uppsala, Sweden, ACL, Association for<br />
Computational Linguistics (ACL), 209 N. Eighth<br />
Street Stroudsburg, PA 18360 USA, 7/2010<br />
Heizmann, S. (1994): Human Strategies in Translation<br />
and Interpreting - what MT can Learn from Translators.<br />
Verbmobil Report 43. <strong>Universität</strong> Hildesheim.<br />
Kameyama, M., Ochitani, R., Peters, S. (1991):<br />
Resolving Translation Mismatches With Information<br />
Flow. In: Proceedings of the 29th Annual Meeting of<br />
the Association for Computational Linguistics,<br />
Berkeley: 193-200.<br />
Klausner, K. (<strong>2011</strong>): Einsatzmöglichkeiten kontrollierter<br />
Sprache zur Verbesserung maschineller Übersetzung.<br />
BA-Arbeit, Fachhochschule Potsdam, Januar <strong>2011</strong>.<br />
Koehn, P. (2009): A Web-Based Interactive Computer<br />
Aided Translation Tool. In Proceedings of the<br />
ACL-IJCNLP 2009 Software Demonstrations, Suntec,<br />
Singapore.<br />
Koehn, P., Hoang, H., Birch, A. (2007): ‘Moses: Open<br />
Source Toolkit for Statistical Machine Translation’.<br />
Paper presented at the Annual Meeting of the<br />
Association for Computational Linguistics (ACL),<br />
Prague, Czech Republic.<br />
Siegel, M. (1997): Die maschinelle Übersetzung<br />
aufgabenorientierter japanisch-deutscher Dialoge.<br />
Lösungen <strong>für</strong> Translation Mismatches. Berlin: Logos.
Multilingual Resources and Multilingual Applications - Posters<br />
Experimenting with Corpus-Based MT Approaches<br />
Monica Gavrila<br />
University of Hamburg,<br />
Vogt-Kölln Str. 30, 22527, Hamburg, Germany<br />
E-mail: gavrila@informatik.uni-hamburg.de<br />
Abstract<br />
There is no doubt that in the last years corpus-based machine translation (CBMT) approaches have been in focus. Among them, the<br />
statistical MT (SMT) approach has been by far the more dominant, although the Workshop on example-based MT (EBMT) at the end<br />
of 2009 showed a revived interest the other important CBMT approach: EBMT. In this paper several MT experiments for English and<br />
Romanian are presented. In the experimental settings several parameters have been changed: the MT system, the corpus type and size,<br />
the inclusion of additional linguistic information. The results obtained by a Moses-based SMT system are compared with the ones<br />
given by Lin-EBMT, a linear EBMT system implemented during the research. Although the SMT systems outperforms the EBMT<br />
system in all the experiments, different behaviors of the systems have been noticed while changing the parameters in the<br />
experimental settings, which can be of interest for further research in the area.<br />
Keywords: Machine Translation, SMT, EBMT, Moses, Lin-EBMT<br />
1. Introduction<br />
There is no doubt that in the last years corpus-based<br />
machine translation (CBMT) approaches have been in<br />
focus. Among them, the statistical machine translation<br />
(SMT) approach has been by far the more dominant.<br />
However, the Workshop on example-based MT (EBMT)<br />
at the end of 2009 1<br />
showed a revived interest in the other<br />
important CBMT approach: EBMT.<br />
Between these two MT approaches has always been a<br />
'competition'. The similar and unclear definitions and the<br />
mixture of ideas make the difference between them<br />
difficult to distinguish. In order to show the advantages of<br />
one or another method, comparisons between SMT and<br />
EBMT (or hybrid) systems are found in the literature.<br />
The results, depending on the data type and on the<br />
systems considered, seemed to be positive for both<br />
approaches: (Way & Gough, 2005) and (Smith & Clark,<br />
2009). Considering English-Romanian as language-pair,<br />
results for both SMT and EBMT systems are reported,<br />
although a comparison between the two approaches has<br />
not been made. SMT systems are presented in (Cristea,<br />
2009) and (Ignat, 2009); results of an EBMT system are<br />
1<br />
http://computing.dcu.ie/~mforcada/ebmt3/ - last accessed on<br />
January <strong>2011</strong>.<br />
reported in (Irimia, 2009).<br />
In this paper several MT experiments for English (ENG)<br />
and Romanian (RON) are presented. In the experimental<br />
settings several parameters have been changed: the MT<br />
system (approach), the type and size of the corpus, the<br />
inclusion of additional part-of-speech (POS) information.<br />
The results obtained by a Moses-based SMT system are<br />
compared with the ones given by Lin-EBMT, a linear<br />
EBMT system implemented during the research. The<br />
same training and test data have been used for both MT<br />
systems.<br />
The following section will briefly present both MT<br />
systems. The data used and the translation results will be<br />
described in Section 3. Additionally, a very brief analysis<br />
of the results will be made. The paper will end with<br />
conclusions and some ideas about further work.<br />
2. System Description<br />
In this section the two CBMT systems are briefly<br />
characterized.<br />
The SMT system used follows the description of the<br />
baseline architecture given for the Sixth Workshop on<br />
SMT 2<br />
and it is based on Moses (Koehn et al., 2007).<br />
2<br />
http://www.statmt.org/wmt11/baseline.html - last accessed on<br />
June <strong>2011</strong>.<br />
187
Multilingual Resources and Multilingual Applications - Posters<br />
Moses is an SMT system that allows the user to train<br />
automatically translation models for the language pair<br />
needed, considering that the user has the necessary<br />
parallel aligned corpus. We used in our experiments<br />
SRILM (Stolcke, 2002) for building the language model<br />
and GIZA++ (Och & Ney, 2003) for obtaining the word<br />
alignment. Two changes have been done to the<br />
specifications of the Workshop on SMT: the tuning step<br />
was left out and the language model (LM) order was 3,<br />
instead of 5. Leaving out the tuning step has been<br />
motivated by results we obtained in experiments which<br />
are not the topic of this paper, while comparing different<br />
system settings: not all tests in which tuning was<br />
involved showed an improvement. We changed the LM<br />
order due to results presented in the SMART project 3<br />
.<br />
Lin-EBMT is the EBMT system developed during the<br />
research. It is mainly based on surface forms (linear EMT<br />
system) and uses no additional linguistic resources. Due<br />
to space reasons, the main steps of the Lin-EBMT system<br />
- matching, alignment and recombination - are not<br />
described in detail in this paper. We will just present the<br />
main translation steps.<br />
The test corpus is preprocessed in the same way as in<br />
specification of the Moses-based SMT system:<br />
tokenization and lowercasing. In order to reduce the<br />
search space a word index is used, a method that is often<br />
encountered in the literature, e.g. (Sumita & Iida, 1991).<br />
The information needed in the translation, such as the<br />
4<br />
word-index or the GIZA++ word-alignments, is<br />
extracted prior to the translation process itself.<br />
The main steps in Lin-EBMT, done for each of the input<br />
sentences of the test data, are enumerated below:<br />
5<br />
1) The tokens in the input, excluding punctuation, are<br />
extracted: {token1, token2, ..., tokenn}.<br />
2) Using the word-index, all sentence ids that contain at<br />
least one token from the input are considered:<br />
{sentenceId1, ..., sentenceIdm}. The list of sentence<br />
ids contains no duplicates. The word-index is used in<br />
order to reduce the search space for the matching<br />
step. The matching procedure is run only after the<br />
search space size is decreased, by using this index.<br />
3) Given the preprocessed input sentence and the list of<br />
3 www.smart-project.eu – last accessed on June <strong>2011</strong>.<br />
4 The word-index is in fact a token index, as it contains also<br />
punctuation signs, numbers, etc.<br />
5 A token is a word, a number or a punctuation sign.<br />
188<br />
4)<br />
sentence ids {sentenceId1, ..., sentenceIdm}, the<br />
matching between the input and the 'reduced' source<br />
language (SL) side of the corpus is done. If the input<br />
sentence is encountered in the corpus, the translation<br />
is found and the translation procedure stops. Else, the<br />
most similar sentences are extracted by using a<br />
similarity measure developed during the research.<br />
This measure is based on the longest common<br />
subsequence algorithm found in (Bergroth et al.,<br />
2000).<br />
After obtaining the sentences which maximum cover<br />
the input, the corresponding word alignments are<br />
extracted, by considering the longest aligned target<br />
language (TL) subsequences possible.<br />
5) Using the "bag of TL sequences" obtained from the<br />
alignment the output is generated by making use of a<br />
recombination matrix, a new approach for<br />
implementing this step.<br />
More details about the Lin-EBMT system can be found in<br />
(Gavrila, <strong>2011</strong>).<br />
3. Evaluation<br />
We used for our evaluation two corpora. The first is a<br />
sub-part of the JRC-Acquis version 2.2 (Steinberger et al.,<br />
2006), a freely available parallel corpus in 22 languages,<br />
which is formed from the European Union documents of<br />
mostly legal nature.; the latter is RoGER, a small<br />
technical manual manually created and corrected<br />
(Gavrila & Elita, 2006). The same training and test data<br />
has been used for both SMT and EBMT experiments.<br />
In the EBMT system, matching is done on the corpus for<br />
the translation model in the SMT system and<br />
recombination on the one for the language model. Both<br />
corpora had to be saved in the format which fits the needs<br />
of each of the MT systems.<br />
The tests on the JRC-Acquis data have been run on 897<br />
sentences, which were not used for training. Sentences<br />
were automatically removed from different parts of the<br />
corpus to ensure a relevant lexical, syntactic and<br />
semantic coverage. Three sets of 299 sentences represent<br />
the data sets Test 1, Test 2, and Test 3, respectively. Test<br />
1+2+3 is formed from all 897 sentences. The test data has<br />
no sentence length restriction, as the training data (see<br />
Moses specification).<br />
From RoGER, 133 sentences (Test R) have been<br />
randomly extracted as the test data, the rest of 2200
sentences representing the training data. When using<br />
RoGER, POS information was considered for some of<br />
the experiments: data set Test RwithPOS 6<br />
.<br />
The obtained translations have been evaluated using two<br />
automatic evaluation metrics: BLEU (Papineni et al.,<br />
2002) and TER (Snover et al., 2006). The choice of the<br />
metrics is motivated by the available resources (software)<br />
and, for comparison reason, by the results reported in the<br />
literature. Due to lack of data and further translation<br />
possibilities, we considered the comparison with only<br />
one reference translation.<br />
We present the evaluation scores obtained in Tables 2<br />
and 3.<br />
ENG-RON<br />
SMT Lin-EBMT<br />
Test 1 0.5007 0.8071<br />
Test 2 0.4898 0.6400<br />
Test 3 0.5208 0.7770<br />
Test 1+2+3 0.5023 0.7326<br />
Test R 0.3784 0.5955<br />
Test RwithPOS 0.4748<br />
RON-ENG<br />
0.6402<br />
SMT Lin-EBMT<br />
Test 1 0.5020 0.7041<br />
Test 2 0.3756 -<br />
Test 3 0.4684 -<br />
Test 1+2+3 0.4457 -<br />
Test R 0.3465 0.5443<br />
Test RwithPOS 0.4000 0.5490<br />
Table 2: Evaluation Results (TER scores)<br />
The lower the TER scores, the better the translation<br />
results. For the BLEU score the relationship between the<br />
scores and the translation quality is the opposite.<br />
While analyzing the behavior of each of the MT system,<br />
when changing the test data-set for one corpus (i.e.<br />
JRC-Acquis) several factors have been found with a<br />
direct influence on the results, such as the number of<br />
out-of-vocabulary words, the number of test sentences<br />
directly found in the training data, sentence length or the<br />
way of extracting the training data: see Test 1 – Test 3.<br />
For a specific dataset (Test 2), the obtained BLEU score<br />
for the EBMT system is similar 7<br />
with one presented in<br />
6<br />
The POS information has been provided by the text<br />
processing web services found on:<br />
www.racai.ro/webservices/TextProcessing.aspx - last accesed<br />
on January <strong>2011</strong>.<br />
7<br />
A one-to-one comparison is not possible, as the data is not the<br />
same.<br />
Multilingual Resources and Multilingual Applications - Posters<br />
(Irimia, 2009), where linguistic resources were used.<br />
Considering the analysis of the behavior of each of the<br />
MT system, when changing the corpus (a larger and a<br />
smaller corpus, which fits the SMT and EBMT<br />
framework, respectively), when comparing Test 1+2+3<br />
and Test R, an improvement is found in both cases for the<br />
RoGER corpus, although usually it is stated that a large<br />
corpus is needed for SMT. This result might be in this<br />
specific case so, due to the data type. This shows the high<br />
influence of the data on the empirical approaches.<br />
ENG-RON<br />
SMT Lin-EBMT<br />
Test 1 0.3997 0.1335<br />
Test 2 0.4179 0.3072<br />
Test 3 0.3797 0.1476<br />
Test 1+2+3 0.4015 0.2125<br />
Test R 0.43<strong>96</strong> 0.2689<br />
Test RwithPOS 0.3879 0.2942<br />
RON-ENG<br />
SMT Lin-EBMT<br />
Test 1 0.2545 0.0855<br />
Test 2 0.5628 -<br />
Test 3 0.4271 -<br />
Test 1+2+3 0.4255 -<br />
Test R 0.4765 0.2783<br />
Test RwithPOS 0.4618 0.3624<br />
Table 3: Evaluation Results (BLEU scores)<br />
The results for the data with additional POS information<br />
(Test RwithPOS) are not conclusive, as when<br />
considering the TER scores worse results are obtained for<br />
both MT systems, but when considering the BLEU score<br />
improvement is noticed for the EBMT system.<br />
In terms of overall BLEU and TER scores, the EBMT<br />
system is outperformed by the SMT one. Still, there are<br />
cases where the EBMT system provides a better<br />
translation, as in the example below:<br />
Input: The EEA Joint Committee<br />
Reference: Comitetul mixt al SEE,<br />
SMT output: SEE Comitetului mixt,<br />
(* ENG: EEA of the Joint Committee)<br />
Lin-EBMT output: Comitetului mixt SEE<br />
(* ENG: of the EEA Joint Committee)<br />
4. Conclusions and Further Work<br />
In this framework - system configuration and data -, in a<br />
direct comparison, the EBMT system was not able to<br />
match the performance of the SMT system, but there<br />
189
Multilingual Resources and Multilingual Applications - Posters<br />
were examples when its translation has been more<br />
accurate. The evaluation scores presented in this paper<br />
show how much training and test data influence the<br />
translation results. In this EBMT implementation not all<br />
the power of the approach was used, so there is room for<br />
improvement. As further work, additional information,<br />
e.g. word-order information from the TL sentences is to<br />
be extracted and used in the recombination step.<br />
190<br />
5. References<br />
Bergroth, L., Hakonen, H., Raita, T. (2000): A survey of<br />
longest common subsequence algorithms. In Proc. of<br />
the Seventh International Symposium on String<br />
Processing and Information Retrieval - SPIRE 2000,<br />
pp. 39-48, Spain. ISBN: 0-7695-0746-8.<br />
Cristea, D. (2009): Romanian language technology and<br />
resources go to Europe. Presentated at the FP7<br />
Language Technology Informative Days. URL:<br />
ftp://ftp.cordis.europe.eu/pub/fp7/ict/docs/language-te<br />
chnologies/cristea en.pdf - last accessed on April 10 th ,<br />
2009.<br />
Gavrila, M., Elita, N. (2006): Roger - un corpus paralel<br />
aliniat. In Resurse Lingvistice si Instrumente pentru<br />
Prelucrarea Limbii Romane Workshop Proceedings,<br />
pages 63-67. Workshop held in November 2006,<br />
Publisher: Ed. Univ. Alexandru Ioan Cuza, ISBN:<br />
978-973-703-208-9.<br />
Gavrila, M. (<strong>2011</strong>): Constrained recombination in an<br />
example-based machine translation system. In Vincent<br />
Vondeghinste Mikel L. Forcada, Heidi Depraetere,<br />
editor, Proceedings of the EAMT-<strong>2011</strong> Conference,<br />
pages 193-200, Leuven, Belgium, May <strong>2011</strong>. ISBN:<br />
9789081486118.<br />
Ignat, C. (2009): Improving Statistical Alignment and<br />
Translation Using Highly Multilingual Corpora. PhD<br />
thesis, INSA - LGeco- LICIA, Strasbourg, France.<br />
URL: http://sites.google.com/site/cameliaignat/home/<br />
phd-thesis (last accessed on August 3 rd , 2009).<br />
Irimia, E. (2009): EBMT experiments for the<br />
English-Romanian language pair. In Proceedings of<br />
the Recent Advances in Intelligent Information<br />
Systems, pages 91-102. ISBN 978-83-60434-59-8.<br />
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,<br />
Federico, M., Bertoldi, N., Cowan, B., Shen, W.,<br />
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,<br />
A., Herbst, E. (2007): Moses: Open source toolkit for<br />
statistical machine translation. In Annual Meeting of<br />
the Association for Computational Linguistics (ACL),<br />
demonstration session, Prague, Czech Republic.<br />
Och, F. J., Ney, H. (2003): A systematic comparison of<br />
various statistical alignment models. Computational<br />
Linguistics, 29(1), pp. 19-51.<br />
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J. (2002):<br />
Bleu: a method for automatic evaluation of machine<br />
translation. In Proceedings of the 40th Annual Meeting<br />
on Association for Computational Linguistics, Session:<br />
Machine translation and evaluation, pp. 311-318,<br />
Philadelphia, Pennsylvania. Publisher: Association for<br />
Computational Linguistics Morristown, NJ, USA.<br />
Smith, J., Clark, S. (2009): EBMT for SMT: A new<br />
EBMT-SMT hybrid. In Forcada, M. L. and Way, A.,<br />
editors, Proceedings of the 3rd Intyernational<br />
Workshop on Example-Based Machine Translation,<br />
pp. 3-10, Dublin, Ireland.<br />
Snover, M., Dorr, B., Schwartz, R., Micciulla, L.,<br />
Makhoul, J. (2006): A study of translation edit rate<br />
with targeted human annotation. In Proceedings of<br />
Association for Machine Translation in the Americas.<br />
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C.,<br />
Erjavec, T., Tufis, D., Varga, D. (2006): The<br />
JRC-Acquis: A multilingual aligned parallel corpus<br />
with 20+ languages. In Proceedings of the 5th<br />
International Conference on Language Resources and<br />
Evaluation (LREC'2006), Genoa, Italy.<br />
Stolcke, A. (2002): SRILM - An extensible language<br />
modeling toolkit. In Proceedings of the International<br />
Conference on Spoken Language Processing,<br />
pp. 901-904, Denver, Colorado.<br />
Sumita, E., Iida, H. (1991): Experiments and prospects of<br />
example-based machine translation. In Proceedings of<br />
the 29th annual meeting on Association for<br />
Computational Linguistics, pp. 185-192, Morristown,<br />
NJ, USA. Association for Computational Linguistics.<br />
Way, A., Gough, N. (2005): Comparing example-based<br />
and statistical machine translation. Natural Language<br />
Engineering, 11, pp. 295-309. Cambridge University<br />
Press.
Multilingual Resources and Multilingual Applications - Posters<br />
Method of POS-disambiguation Using Information about Words Co-occurrence<br />
(For Russian)<br />
Edward Klyshinsky 1 , Natalia Kochetkova 2 , Maxim Litvinov 2 , Vadim Maximov 1<br />
1 Keldysh IAM<br />
Moscow, Russia, 125047 Miusskaya sq. 4<br />
2 Moscow State Institute of Electronics and Mathematics<br />
Moscow, Russia, 109029 B. Tryokhsvyatitelsky s. 3<br />
E-mail: klyshinsky@itas.miem.edu.ru, natalia_k_11@mail.ru, promithias@yandex.ru, vadimmax2000@mail.ru<br />
Abstract<br />
The article describes the complex method of part-of-speech disambiguation for texts in Russian. The introduced method is based on<br />
the information concerning the syntactic co-occurrence of Russian words. The article also discusses the method of building such<br />
corpus. This project is partially funded by RFBR grant 10-01-00805.<br />
Keywords: learning corpora, words co-occurrence base, POS-disambiguation<br />
1. Introduction<br />
Part-of-speech disambiguation is an important problem<br />
in automatic text processing. At the time there exist many<br />
systems which solve this problem. The earliest projects<br />
use rule-based methods (see, for example, Tapanainen &<br />
Voutilainen, 1994). This approach is based on the<br />
following ideas: the system is supplied with some<br />
limiting rules which forbid or allow some certain words<br />
combinations. However, this method requires a<br />
time-consuming procedure of writing the rules. Besides,<br />
though these rules provide a good result, they often leave<br />
a considerable part of text not covered. In this connection<br />
there have appeared various statistical methods of<br />
automatic generation of such rules (for example Brill,<br />
1995).<br />
The n-gram method uses the statistical distribution of<br />
word combination in the text. Generally, n-gram model<br />
could be written down as follows:<br />
P w ) = arg max P(<br />
w | w − ) * ... * P(<br />
w | w − ). (1)<br />
( i<br />
i i 1 i i N<br />
P(wi) is the probability of an unknown tag <br />
occurrence, if of the neighbours are known.<br />
In order to avoid the problem of rare data and getting a<br />
zero probability for the occurrence of tag combination<br />
, the smoothed probability can be applied<br />
for trigram model. The smoothed trigram model contains<br />
linear combinations of trigram, bigram and unigram<br />
probabilities:<br />
Psmooth ( wi<br />
| wi−<br />
2 * wi−1<br />
) = λ3<br />
* P(<br />
wi<br />
| wi−<br />
2 * wi−1<br />
) + (2)<br />
+ λ2<br />
* P(<br />
wi<br />
| wi−1<br />
) + λ1<br />
* P(<br />
wi<br />
)<br />
where the sum of coefficients λ1+λ2+λ3 = 1, λ1>0, λ2>0,<br />
λ3>0. The values for λ1, λ2, λ3 are obtained by solving the<br />
system of linear equations.<br />
In (Zelenkov, 2005) the authors in their disambiguation<br />
model had defined an unknown tag wi by involving not<br />
only the information on left neighbours, but also the right<br />
ones. We will use the similar approach when our system<br />
works with the trigram model. In this case the unknown<br />
tag is defined by involving the left neighbours<br />
(3), the right ones (4), and<br />
both the left and the right ones (5).<br />
However, both the rule-based and trigram models require<br />
large tagged corpora of texts. The trigram rules which do<br />
not contain the information on a lexeme reflect specific<br />
language features, but the trigrams themselves (with<br />
lexemes inside) reflect rather the lexis in use. If the texts<br />
from another knowledge domain are given, the trigrams<br />
may show considerably worse results than for the initial<br />
corpus.<br />
According to Google researches the digital collection of<br />
English texts they possess contains 10 12 words. The<br />
British (BNC, <strong>2011</strong>) and America (ANC, <strong>2011</strong>) National<br />
Corpora contain about 10 8 tagged words. According to<br />
191
Multilingual Resources and Multilingual Applications - Posters<br />
the information on January, 2008 Russian National<br />
Corpus (RNC, <strong>2011</strong>) contains about 5.8*10 6<br />
disambiguated words (and still remain). At present the<br />
process of filling up the latest corpora is rather frozen<br />
than active (unlike the situation for the first years of the<br />
project when it was being filled up intensively). The task<br />
of tagging (though automated) 10 12 words, seems to be<br />
economically impracticable, and may be even<br />
unnecessary. The realization of practical applications for<br />
processing 10 9 trigrams (the quantity estimation for<br />
English language could be found in Google (2006)) will<br />
require a considerable amount of computational<br />
resources.<br />
At present there are trigram bases accumulated that solve<br />
the problem with 94-95 % accuracy for Russian (Sokirko,<br />
2004). The additional methods increases the quality of<br />
the disambiguation up to 97,3 % (Lyashevskaya, 2010). It<br />
is worthy to note that the application of rule-based<br />
methods requires essential time expenses. The<br />
application of trigrams demands a well-tagged corpus,<br />
and it is a costly problem too. The rule creating is also<br />
connected with a permanent work of linguists. The<br />
results of such work are never in vain, the output remains<br />
applicable to many other projects, but such results are<br />
helpless to improve the accuracy immediately. In this<br />
connection we had set a goal to develop a new method<br />
which would use results of the previous developments<br />
accumulated in this field and information from partial<br />
syntax analysis.<br />
192<br />
2. Obtaining statistical data on<br />
co-occurrence of words<br />
It is widely acknowledged that a resolution of lexical<br />
ambiguity should be provided before a syntactic analysis.<br />
In this case it is recommended to apply methods like<br />
n-grams. However, n-gram method requires a substantial<br />
preliminary work to prepare a tagged text corpus. We<br />
have decided to develop a disambiguation method, which<br />
uses the syntactic information (obtained in the automatic<br />
mode) without carrying out full syntactical parsing. In<br />
our researches we focused on Russian.<br />
As the practice has shown, full parsing that would<br />
provide full constructing of the tree is not required to<br />
remove the most part of a homonymy (about 90%). As it<br />
happens, it is sufficient to include the rules of words<br />
collocation in nominal and verb phrases, folding of<br />
homogeneous parts of sentence, agreement of subject and<br />
predicate, prepositions and case government and some<br />
others, in total not exceeding 20 rules, which are<br />
described by context-free grammar. It is possible to have<br />
a more detailed look at the methods of formal description<br />
of language, for instance, in Ermakov (2002).<br />
To solve the problems mentioned above, it is necessary to<br />
create a method of getting information on a syntactic<br />
relationship for the words which are obtained from a<br />
non-tagged corpus. Preliminary experiments have shown<br />
that in Russian language approximately 50% of words<br />
appear to be part-of-speech unambiguous (up to 80% in<br />
conversation texts, in comparison with less than 40% for<br />
news in English), i.e. there are no lexical homonyms for<br />
each of such words. So the probability to find a group of<br />
unambiguous words in a text is rather high.<br />
The analysis of Russian sentence structure allows to<br />
determine some of its’ syntactic characteristic features.<br />
1) The noun phrase (NP) which follows the sole verb in<br />
the sentence is syntactically dependent on this verb.<br />
2) The sole NP which opens the sentence and is<br />
followed by a verb, is syntactically subordinated to<br />
this verb.<br />
3) The adjectives that are located before the first noun<br />
in the sentence, or between a verb and a noun, are<br />
syntactically subordinated to this noun.<br />
4) The paragraphs 1-3 could be applied also to<br />
adverbial participles, and it is possible to consider<br />
participles instead of adjectives.<br />
We had applied our method to the processing of several<br />
untagged corpora in Russian language. The total amount<br />
of these corpora included more than 4,2 billion of words.<br />
The text sources contain texts on various themes in<br />
Russian. The used corpora include the sources given in<br />
the Table 1.<br />
The morphological tagging was made with the help of<br />
module of morphological analysis “Crosslator”<br />
developed by our team (Yolkeen, 2003). The volume of<br />
the databases obtained is listed in the table below. The<br />
numerator shows the detected total amount of<br />
unambiguous words with the given fixed type of<br />
syntactic relation. The denominator shows the amount of<br />
unique combinations of words of the given type.<br />
The analysis of the results (Table 2) has shown that the<br />
selected pairs contain 22200 verbs from 26400<br />
represented in the morphological dictionary, 55200 nouns
from 83000 and 27600 adjectives from 45300<br />
represented in the dictionary. Such a significant amount<br />
of verbs could be explained by their low degree of<br />
ambiguity as compared with other parts of speech. A<br />
small number of adjectives could be explained by the fact<br />
that from several adjectives located immediately before a<br />
noun, only the first one was entered into the database. It<br />
should be noted that when the largest corpus had been<br />
integrated into the system, the number of lexemes has not<br />
been changed notably, but at the same time the number of<br />
pairs detected significantly increased. For example, the<br />
number of verbs has increased from 21500 up to 22200,<br />
whereas the number of unique combinations of verb +<br />
noun type has increased from 8,3 mln to 10,9. Moreover,<br />
the amount of such combinations that had occurred more<br />
than twice, has increased from 2.3 to 4 mln. Thus, it is<br />
possible to say that when a corpus contains more than one<br />
billion words, the lexis in use achieves its saturation limit,<br />
while its usage continues to change.<br />
Source Amount Source Amount<br />
mln w/u<br />
mln w/u<br />
WebReading 3049 Lenta.ru 33<br />
Moshkov’s 680 Rossiyskay 29<br />
Library<br />
a Gazeta<br />
RIA News 156 PCWeek<br />
RE<br />
28<br />
Fiction coll. 120 RBC 21<br />
Nezavisima 89 Compulent 9<br />
ya Gazeta<br />
a.ru<br />
Total 4214<br />
Table 1: Used corpora<br />
Pair Total, mln >1, mln >2, mln<br />
V+N 243 / 10.89 237 / 5.27 235 / 4<br />
Ger+N 40.8 / 2.76 39.3 / 1.25 38.7 / 0.91<br />
N+Adj 67 / 2.15 66 / 1.13 65.6 / 0.9<br />
Table 2: Obtained results<br />
About 9 % of all word occurrences from the total amount<br />
of the corpus had been used to build a co-occurrence base.<br />
But even this percentage had appeared to be sufficient to<br />
construct a representative sample for a word<br />
co-occurrence statistics. The estimations have shown that<br />
the received word combinations contain not more than<br />
3% of the errors mostly caused by an improper word<br />
order or neglect of some syntactically acceptable variants<br />
of collocations, deviances in projectivity and mistakes in<br />
Multilingual Resources and Multilingual Applications - Posters<br />
the text. It is necessary to stress that all results had been<br />
obtained in the shortest terms without any manual<br />
tagging of the corpus. Probably the results could be more<br />
representative, if we were to use some methods of<br />
part-of-speech disambiguation. However, the best<br />
methods give a 3-5 % error, and it would affect the<br />
accuracy of results but not noticeably. On the other hand,<br />
the sharp increase in corpus volume will allow to neglect<br />
the false alternatives at a higher level of occurrence and<br />
by these means preserve the quality level.<br />
3. Complex Method of Disambiguation<br />
After we had collected the co-occurrence base, which<br />
was sufficiently large, we have got all that was necessary<br />
to solve the main problem, that is, to create a method of<br />
disambiguation for texts in Russian on the basis of<br />
information on a syntactic co-occurrence of words.<br />
Let us assume that in the sentence, which is being parsed,<br />
there are two words between which there are only several<br />
words or no words at all, and it is known that these two<br />
words could be linked by a syntactical relation. In this<br />
case, if we have other less probable variants of tagging<br />
these words, it is possible to assume that the variant with<br />
such link will be more probable. The most difficult thing<br />
is to collect a representative base of syntactic relations.<br />
In this paper the rules shall be understood as an ordered<br />
set: , where vi = is a short<br />
description of the word, pw is a part of speech of the word,<br />
and {pr} is a set of lexical parameters of the word. Thus,<br />
in such rules the lexemes of a word are not taken into<br />
account in contrast to the lexical characteristics of the<br />
word. A rule may be interpreted in different ways and can<br />
be written down as an occurrence vi with regard for its<br />
right neighbours, as an occurrence vi+2 with regard for its<br />
left neighbours or as an occurrence vi+1 with regard for its<br />
both neighbours. The set of rules has been obtained from<br />
the tagged corpus. Following Zelenkov (2005), we will<br />
make tagging of a word considering its right and left<br />
neighbours. In the mentioned above paper a tag of the<br />
word is defined only with regard for the nearest<br />
neighbours of current word. However, it is not necessary<br />
to produce the result that falls within the global<br />
maximum. The exhaustive search of word tagging<br />
variants is usually avoided, as it takes too much time.<br />
As it already has been noted above, the ratio of<br />
unambiguous tokens is about 50% in Russian. In this<br />
193
Multilingual Resources and Multilingual Applications - Posters<br />
connection there is always a sufficient probability to find<br />
a group of two unambiguous words. Moreover, the<br />
chance grows as the length of the sentence increases. If<br />
such groups are not found while searching a global<br />
maximum, the first word in the sentence will indirectly<br />
influence even the last word. In the case such groups are<br />
present, such relationship is cancelled, and the search of<br />
global criterion can be effected over the separate<br />
fragments of the sentence. It allows to increase<br />
essentially the speed of the algorithm. So the sentence<br />
“Так думал молодой повеса, / Летя в пыли на<br />
почтовых, / Всевышней волею Зевеса / Наследник<br />
всех своих родных.” (Such were a young rake's<br />
meditations – / By will of Zeus, the high and just, / The<br />
legatee of his relations – / As horses whirled him through<br />
the dust.) can be split into three independent parts: “Так<br />
думал молодой повеса, Летя”, “Летя в пыли на<br />
почтовых” and “Всевышней волею Зевеса Наследник<br />
всех своих родных”.<br />
Thus, we no longer consider the problem<br />
Psent = argmax(<br />
∏=<br />
i 1<br />
194<br />
ns<br />
P(vi | vi-1, vi-2) ), where ns is a<br />
number of words in the sentence, but<br />
n f<br />
Psent =<br />
∏=<br />
i 1<br />
n fi<br />
argmax( ∏=<br />
i 1<br />
P(vi | vi-1, vi-2) ), where nf is a<br />
number of fragments, nfi is a number of words in the i-th<br />
fragment. According to formulas (2)-(4), we consider<br />
both left and right neighbours of the word.<br />
We seek the optimum from the edges of the fragment<br />
towards its center. It is obvious that product of the<br />
maximal values of probabilities for each word can give a<br />
global maximum. If this is not the case, but the values<br />
obtained from two sides had come to one and the same<br />
disambiguation of the word in the middle of the fragment,<br />
than we will also consider that we have a good enough<br />
solution. If variants of disambiguation of the word in the<br />
middle of the fragment are different for two solutions, the<br />
optimization is carried out for the accessible variants<br />
until they won't achieve one and the same decision. In<br />
any case, the optimization is not carried out even for an<br />
entire fragment, not mentioning the whole sentence.<br />
The amount of unambiguous fragments can be increased<br />
by a preliminary disambiguation using another method.<br />
We use the described above base of syntactic<br />
dependences. So, let we have a set {},<br />
wi = is a complete description of the word<br />
where lw is a word’s lexeme, w1 is a key word in the<br />
word-group (for example, a verb in the pair<br />
«verb+noun»), w2 is a preposition (if any), w3 is a<br />
dependent word, p is a probability of word combination<br />
w1 + w2 + w3. In this case all rules are searched for every<br />
word of the sentence. It should be noticed that no word<br />
can participate in more than two rules. Thus, for each<br />
word it is necessary to calculate argmax (p1 + p2), where<br />
p1 and p2 are the probabilities of rules containing this<br />
word in dominant and dependent position.<br />
Actually, during the check of compatibility of the words<br />
among themselves, our system uses the following bigram<br />
model P( wi<br />
) = arg max P(<br />
wi<br />
| wi−l<br />
) , where l means the<br />
distance (in number of words), at which the unknown<br />
word may stand from the known one. The rule containing<br />
the given word is selected in the following way. We take<br />
the floating window containing 10 words to the right and<br />
left. The dependent word must be located within this<br />
window, the preposition must be located before the<br />
dependent word, but there must be no main word between<br />
them, the adjective must lexically agree with a noun.<br />
4. Results of experiments and discussion<br />
As a result of our work we had obtain the Corpus of<br />
syntactical combinations of Russian words. The relations<br />
were achieved using untagged corpora of general lexis<br />
texts containing more than 4 bln words. The tagging was<br />
carried out “on the fly”. There had been revealed about 6<br />
mln of authentic unique word combinations which had<br />
occurred in the text more than 340 mln times. According<br />
to our estimations, the amount of errors in the obtained<br />
corpora doesn't exceed 3 %. The number of word<br />
combination can be enlarged by processing the texts of a<br />
given new domain. Though, the investigations had shown<br />
that scientific texts use other constructions which reduce<br />
the amount of sampled combinations, for example, for<br />
speech and cognition verbs. Our method extracts about<br />
9 % of tokens from common lexis texts. But news lines<br />
give us just about 5 %. Moreover, for scientific texts this<br />
number shortened to 3 %. So the method shows different<br />
productivity for different domains. Further experiments<br />
have discovered that the received results can be used for<br />
defining the style of texts.<br />
So the suggested method allows almost automatically
obtaining the information on word compatibility which<br />
further can be used, for instance, for parsing or at other<br />
stages of text processing. The method is also not strictly<br />
tied to the texts of a certain domain and has rather low<br />
cost of enlargement.<br />
The estimation of the efficiency of the system with<br />
various parameters was carried out with carefully<br />
tokenized corpora that contained about 2300 words.<br />
Results were checked using Precision and Accuracy<br />
measures. The mere involving of information on word<br />
compatibility in Russian method had shown 71.98%<br />
Precision ratio and <strong>96</strong>.75% Accuracy. This result is<br />
comparable with best results in selected area (Lee, 2010).<br />
The advantage of this method is in its` ability to be<br />
additionally adjusted to a new knowledge domain<br />
quickly and automatically (that is most important), in<br />
case a sufficiently large text corpus is available. The<br />
method gives an acceptable quality of disambiguation,<br />
unfortunately with not too large Precision.<br />
The coverage ratio can be improved by application of<br />
trigram rules, which can be easily received, for example,<br />
from http://aot.ru, or by analysis of the tagged corpus in<br />
Russian (for example, http://ruscorpora.ru). The<br />
coverage ratio in this case has made 78%, but the<br />
accuracy has fallen to 95.6%. In Sokirko (2004) it is<br />
mentioned that the systems Inxight and Trigram provide<br />
94.5% and 94.6% accuracy accordingly, that is<br />
comparable with the results of our system. Further<br />
improvement of coverage ratio up to 81.3 % is possible in<br />
case of the improvement of optimal decision search<br />
algorithm which is described above, but it slightly brings<br />
down the accuracy. In the current state the method is not<br />
able to show an absolute coverage, because the<br />
part-of-speech list applied in this method was not full, it<br />
contained only the following: a verb, a verbal adverb, a<br />
participle, a noun, an adjective, a preposition and an<br />
adverb. Then, there was no information on some types of<br />
relations, for example, «noun+noun». Furthermore, the<br />
information on a compatibility of some words of Russian<br />
conceptually cannot be obtained because of fundamental<br />
homonymy of certain words. For example, the word<br />
"white" can be used both as an adjective and as a noun.<br />
Our results are applicable to some (but not all) European<br />
languages. So the extremely unambiguous English<br />
doesn’t allow construct the words combinations database.<br />
Method can be applied for German or French but the<br />
Multilingual Resources and Multilingual Applications - Posters<br />
rules should be completely rewritten. Problems like<br />
verbal detachable prefixes in German and reverse words<br />
order should be taken into account.<br />
5. References<br />
Tapanainen P., Voutilainen A. (1994): Tagging<br />
accurately - don‘t guess if you know. In Proc. of conf.<br />
on applied natural language processing, 1994.<br />
Brill E. (1995): Unsupervised learning of disambiguation<br />
rules for part of speech tagging. In Proceedings of the<br />
Third Workshop on Very Large Corpora, p. 1-13, 1995.<br />
Zelenkov Yu.G, Segalovich Yu.A., Titov V.A. (2005):<br />
Вероятностная модель снятия морфологической<br />
омонимии на основе нормализующих подстановок и<br />
позиций соседних слов. Материалы Международной<br />
конференции «Диалог’2005»<br />
British National Corpus (<strong>2011</strong>):<br />
http://www.natcorp.ox.ac.uk/<br />
American National Corpus (<strong>2011</strong>):<br />
http://americannationalcorpus.org/<br />
Russian National Corpus (<strong>2011</strong>):<br />
http://www.ruscorpora.ru/<br />
Google (2006): All Our N-gram are Belong to You,<br />
Google research blog,<br />
http://googleresearch.blogspot.com/2006/08/all-our-ngram-are-belong-to-you.html<br />
Sokirko A.V., Toldova S.Yu. (2004): Сравнение<br />
эффективности двух методик снятия лексической и<br />
морфологической неоднозначности для русского языка.<br />
Материалы конференции «Корпусная лингвистика’2004»<br />
Lyashevskaya O. at all (2010): Оценка методов<br />
автоматического анализа текста: морфологические<br />
парсеры русского языка. Материалы Международной<br />
конференции «Диалог’2010»<br />
Ermakov A.E. (2002): Неполный синтаксический анализ<br />
текста в информационно-поисковых системах.<br />
Материалы Международной конференции «Диалог’2002»<br />
Yolkeen S.V., Klyshinsky E.S., Steklyannikov S.E.,<br />
Проблемы создания универсального<br />
морфосемантического словаря. Сб. трудов<br />
Международных конференций IEEE AIS’03 и CAD-2003,<br />
том 1, 2003. стр. 159-163.<br />
Lee Y.K., Haghighi A., Barzilay R. (2010) Simple<br />
Type-Level Unsupervised POS Tagging. In Proc. of<br />
EMNLP 2010<br />
195
Multilingual Resources and Multilingual Applications - Posters<br />
1<strong>96</strong>
Multilingual Resources and Multilingual Applications - Posters<br />
Von TMF in Richtung UML: in drei Schritten zu einem Modell<br />
des übersetzungsorientierten Fachwörterbuchs 1<br />
Georg Löckinger<br />
<strong>Universität</strong> Wien und Österreichische Akademie der Wissenschaften<br />
Wien, Österreich<br />
georg.loeckinger@univie.ac.at<br />
Abstract<br />
Fachübersetzer(innen) brauchen <strong>für</strong> ihre Tätigkeit maßgeschneiderte fachsprachliche Informationen. Zwischen ihrem Bedarf und<br />
den verfügbaren fachsprachlichen Ressourcen besteht jedoch eine große Diskrepanz. In meinem Dissertationsprojekt gehe ich der<br />
zentralen Forschungsfrage nach, ob sich das Fachübersetzen mit einem idealen übersetzungsorientierten Fachwörterbuch effizienter<br />
gestalten lässt. Zur Beantwortung der zentralen Forschungsfrage werden zuerst mehrere Thesen aufgestellt. Davon wird ein Modell<br />
des übersetzungsorientierten Fachwörterbuchs in zwei Detaillierungsgraden hergeleitet, das später mit „ProTerm“, einem Werkzeug<br />
<strong>für</strong> Terminologiearbeit und Textanalyse, in der Praxis experimentell erprobt werden soll. Der vorliegende Aufsatz soll einen Überblick<br />
über die bisherige Forschungsarbeit geben. Zuerst werden in knapper Form 15 Thesen vorgestellt, die auf der einschlägigen<br />
wissenschaftlichen Literatur und meiner eigenen Berufserfahrung in Fachübersetzen und Terminologiearbeit beruhen. Im Hauptteil<br />
des Aufsatzes kommt ein Modell des übersetzungsorientierten Fachwörterbuchs zur Sprache. Das Modell dient als Bindeglied<br />
zwischen den konkreten Anforderungen, die mit den 15 Thesen ausgedrückt werden, und der praktischen Umsetzung mit „ProTerm“.<br />
Der Aufsatz schließt mit einem Ausblick auf die nächsten Schritte in meinem Dissertationsprojekt ab.<br />
Keywords: übersetzungsorientiertes Fachwörterbuch, übersetzungsorientierte Terminografie, Fachlexikografie, Fachübersetzen<br />
1. Einleitung 1<br />
Fachübersetzer(innen) hegen seit Langem den Traum von<br />
einem übersetzungsorientierten Nachschlagewerk(zeug),<br />
das ihrem Bedarf in maximalem Umfang Rechnung trägt.<br />
Den historischen Ausgangspunkt <strong>für</strong> die Beschäftigung<br />
mit der einschlägigen wissenschaftlichen Literatur bildet<br />
Tiktin (1910) mit dem klingenden Titel „Wörterbücher<br />
der Zukunft“. Auch einige andere Literaturstellen bringen<br />
die noch nicht erfüllten Träume zum Ausdruck; vgl.<br />
Hartmann (1988), Snell-Hornby (19<strong>96</strong>), de Schryver<br />
(2003). Die Diskrepanz zwischen den vorhandenen<br />
fachsprachlichen Ressourcen und dem, was Fachübersetzer(innen)<br />
benötigen, hat unter diesen zu einer gewissen<br />
Unzufriedenheit geführt. Infolgedessen begannen<br />
sie, ihre eigenen terminologischen Datenbestände und<br />
Nachschlagewerk(zeug)e zu erstellen. Somit kam zu<br />
ihrer Tätigkeit der Terminologienutzung jene der Terminologieerarbeitung<br />
hinzu.<br />
1 Beim vorliegenden Aufsatz handelt es sich um eine erweiterte<br />
und überarbeitete deutsche Fassung von Löckinger (<strong>2011</strong>).<br />
2. Anforderungen an das übersetzungsorientierte<br />
Fachwörterbuch: 15 Thesen<br />
Entgegen einer weitverbreiteten Meinung ist das Fachübersetzen<br />
ein komplexer Vorgang; vgl. etwa Wilss<br />
(1997). Daher hat das übersetzungsorientierte Fachwörterbuch<br />
(ü. F.) mannigfaltige Anforderungen zu erfüllen.<br />
Im Folgenden stelle ich diese in Form von 15 Thesen dar,<br />
die sich auf die wissenschaftliche Literatur und/oder<br />
eigene Argumente stützen. Die 15 Thesen leiten sich aus<br />
der empirischen Praxis des Fachübersetzens und der<br />
wissenschaftlichen Beschäftigung mit dieser Praxis ab 2<br />
.<br />
Sie werden einer der Kategorien „methodikbezogen“,<br />
„inhaltsbezogen“ bzw. „Darstellung und Verknüpfung<br />
der Inhalte“ zugeordnet, die sich aber – wie auch die<br />
einzelnen Thesen selbst – ergänzen und zum Teil überschneiden.<br />
2 Eine ausführliche Darstellung der Argumente <strong>für</strong> die einzelnen<br />
Thesen mitsamt den jeweiligen Literaturverweisen würde<br />
den Rahmen dieses Aufsatzes sprengen. Ein Literaturverzeichnis<br />
ist beim Autor erhältlich.<br />
197
Multilingual Resources and Multilingual Applications - Posters<br />
2.1. Methodikbezogene Anforderungen<br />
These 1 (systematische Terminologiearbeit): Das ü. F.<br />
muss nach den Grundsätzen und Methoden der systematischen<br />
Terminologiearbeit erstellt worden sein.<br />
These 2 (Beschreibung der angewandten Methodik):<br />
Das ü. F. muss über die (lexikografische und/oder terminografische)<br />
Methodik Aufschluss geben, die bei seiner<br />
Erstellung zum Einsatz kam.<br />
2.2. Inhaltsbezogene Anforderungen<br />
These 3 (Benennungen und Fachwendungen sowie<br />
ihre Äquivalente): Das ü. F. muss Benennungen,<br />
Fachwendungen und Äquivalente in Ausgangssprache<br />
und Zielsprache(n) enthalten.<br />
These 4 (grammatikalische Informationen): Das ü. F.<br />
muss grammatikalische Informationen zu Benennungen,<br />
Fachwendungen und Äquivalenten bieten.<br />
These 5 (Definitionen): Das ü. F. muss Definitionen der<br />
in ihm beschriebenen Begriffe enthalten.<br />
These 6 (Kontexte): Das ü. F. muss authentische Kontexte<br />
(v. a. in der Zielsprache) bereitstellen.<br />
These 7 (enzyklopädische Informationen): Das ü. F.<br />
muss enzyklopädische Informationen (fachgebietsbezogene<br />
Hintergrundinformationen, z. B. Angaben zur<br />
Verwendung eines bestimmten Gegenstandes) enthalten.<br />
These 8 (multimediale Inhalte): Das ü. F. muss nach<br />
Möglichkeit und Bedarf Gebrauch von multimedialen<br />
Inhalten (Grafiken, Diagrammen, Tondateien usw.) machen.<br />
These 9 (Anmerkungen): Das ü. F. muss mit Anmerkungen<br />
zu der in ihm enthaltenen Terminologie versehen<br />
sein, z. B. mit Hinweisen zu Übersetzungsfehlern.<br />
2.3. Anforderungen an Darstellung und Verknüpfung<br />
der Inhalte<br />
These 10 (elektronische Form): Um den meisten anderen<br />
Anforderungen zu entsprechen, muss das ü. F. in<br />
elektronischer Form vorliegen.<br />
These 11 (begriffssystematische und alphabetische<br />
Ordnung): Das ü. F. muss begriffssystematisch und<br />
alphabetisch geordnet sein, um <strong>für</strong> unterschiedlichste<br />
Übersetzungsprobleme brauchbare Lösungen anzubieten.<br />
These 12 (Darstellung von Begriffsbeziehungen): Das<br />
ü. F. muss aufzeigen, wie die einzelnen Begriffe der<br />
jeweiligen Terminologie zusammenhängen (Begriffsbe-<br />
198<br />
ziehungen in Abhängigkeit von der Strukturierung des<br />
jeweiligen Fachgebiets, z. B. Abstraktionsbeziehungen<br />
oder sequenzielle Begriffsbeziehungen).<br />
These 13 (Nutzung von Textkorpora): Da authentische<br />
Textkorpora wertvolle fachsprachliche Informationen<br />
beinhalten, muss das ü. F. auf geeigneten Textkorpora<br />
basieren und gleichzeitig einen Zugriff auf diese bieten.<br />
These 14 (Ergänzungen und Anpassungen durch<br />
die/den Fachübersetzer(in)): Das ü. F. muss der/dem<br />
Fachübersetzer(in) bedarfsgerechte Ergänzungen und<br />
Anpassungen ermöglichen.<br />
These 15 (einheitliche Benutzeroberfläche): Es muss<br />
der/dem Fachübersetzer(in) möglich sein, auf die Informationen<br />
im ü. F. von einer einzigen Benutzeroberfläche<br />
aus zuzugreifen.<br />
3. Modell des übersetzungsorientierten<br />
Fachwörterbuchs<br />
Die 15 Thesen sollen nun in ein geeignetes Modell<br />
übergeführt werden. Da die Thesen Anforderungen an<br />
das ü. F. darstellen, die ausnahmslos auch der empirischen<br />
Praxis des Fachübersetzens entstammen, wird im<br />
Folgenden induktiv ein Modell des ü. F. entworfen.<br />
Mit Ausnahme der Thesen 10, 14 und 15, die die Umsetzung<br />
des Modells betreffen, lassen sich sämtliche<br />
Thesen in einem Modell zusammenführen, das das ü. F.<br />
mit allen erforderlichen Inhalten beschreibt. Ausgehend<br />
von dem TMF-Modell in der internationalen Norm<br />
ISO 16642 (2003) wird das Modell des ü. F. in zwei<br />
Detaillierungsgraden vorgestellt (vgl. Budin (2002)).<br />
Insgesamt entspricht dies der Drei-Ebenen-Einteilung<br />
nach Budin und Melby (2000), die beim Projekt<br />
„SALT“ zum Einsatz kam.<br />
Die Modellierung dient hier in zweifacher Hinsicht „als<br />
Bindeglied zwischen Empirie und Theorie“ (Budin,<br />
19<strong>96</strong>:1<strong>96</strong>): Einerseits werden die 15 Thesen induktiv<br />
zum Modell in den zwei genannten Detaillierungsgraden<br />
umformuliert, andererseits soll das Modell wiederum<br />
deduktiv in die empirische Praxis übergeführt und dort<br />
experimentell erprobt werden. Diese schrittweise Vorgangsweise<br />
hat den Vorteil, dass man sich bei der Modellierung<br />
ganz dem zu schaffenden abstrakten und<br />
implementierungsunabhängigen Modell widmen kann,<br />
ohne sich um die Einzelheiten seiner späteren technischen<br />
Umsetzung kümmern zu müssen (vgl. etwa Sager<br />
(1990)).
Nachstehend geht es um das TMF-Modell (3.1.), das<br />
Modell im ersten Detaillierungsgrad einschließlich des<br />
Modells des terminologischen Eintrags (3.2.) und das<br />
Modell im zweiten Detaillierungsgrad (Datenmodell,<br />
3.3.). Im Mittelpunkt steht das Modell im ersten Detail-<br />
lierungsgrad, da dieses bereits in ausgereifter Form vorliegt.<br />
3.1. TMF-Modell gemäß ISO 16642 (2003)<br />
Die internationale Norm ISO 16642 (2003) beschreibt<br />
ein Rahmenmodell <strong>für</strong> die Auszeichnung terminologischer<br />
Daten (TMF), mit dem Auszeichnungssprachen <strong>für</strong><br />
terminologische Daten definiert werden können, die sich<br />
wiederum mit einem generischen Abbildungswerkzeug<br />
aufeinander abbilden lassen. Die ISO 16642 (2003) hat<br />
zum Ziel, die Nutzung und Weiterentwicklung von<br />
Computeranwendungen <strong>für</strong> terminologische Daten zu<br />
fördern und den Austausch terminologischer Daten zu<br />
erleichtern. Im Gegensatz dazu ist die Festlegung von<br />
Datenkategorien nicht Gegenstand dieser Norm; vgl.<br />
dazu ISO 12620 (1999).<br />
Schematisch ergibt das TMF-Modell folgendes Bild:<br />
Bild 1: Schematische Darstellung des TMF-Modells aus<br />
ISO 16642 (2003).<br />
Die oben abgebildeten Bestandteile des Modells lassen<br />
sich wie folgt beschreiben (von oben nach unten, von<br />
links nach rechts; vgl. DIN 2330 (1993), DIN 2342<br />
(2004), ISO 16642 (2003)):<br />
TDC (terminological data collection): oberste Stufe, die<br />
alle zu einem terminologischen Datenbestand gehörenden<br />
Informationen umfasst;<br />
GI (global information): globale Informationen = administrative<br />
und technische Angaben, die sich auf den gesamten<br />
terminologischen Datenbestand beziehen;<br />
CI (complementary information): zusätzliche Informa-<br />
Multilingual Resources and Multilingual Applications - Posters<br />
tionen = Angaben, die über jene in den terminologischen<br />
Einträgen hinausgehen und üblicherweise von mehreren<br />
terminologischen Einträgen aus angesprochen werden;<br />
TE (terminological entry): terminologischer Eintrag<br />
(Eintragsebene), d. h., jener Teil eines terminologischen<br />
Datenbestands, der terminologische Daten zu einem<br />
einzigen Begriff oder zu mehreren quasiäquivalenten<br />
Begriffen enthält;<br />
LS (language section): Sprachebene, d. h., jener Teil<br />
eines terminologischen Eintrags, in dem sich terminologische<br />
Daten in einer Sprache befinden;<br />
TS (term section): Benennungsebene, d. h., jener Teil der<br />
Sprachebene, der terminologische Daten zu einer oder<br />
mehreren Benennungen bzw. Fachwendungen umfasst;<br />
TCS (term component section): unterste Stufe, die (nicht)<br />
bedeutungstragende Einheiten von Benennungen bzw.<br />
Fachwendungen beschreibt.<br />
3.2. Das Modell des übersetzungsorientierten<br />
Fachwörterbuchs<br />
Als Grundlage dient das Modell eines terminologischen<br />
Eintrags von Mayer (1998). Dieses wird nach den Erfordernissen<br />
meines Dissertationsprojektes so angepasst<br />
und erweitert, dass daraus ein Modell des ü. F. in zwei<br />
Detaillierungsgraden resultiert (Modell im ersten Detaillierungsgrad,<br />
Modell im zweiten Detaillierungsgrad =<br />
Datenmodell).<br />
3.2.1. Das Modell des terminologischen Eintrags<br />
Gemäß dem derzeitigen Stand der Forschung zur terminografischen<br />
Modellierung muss das Modell des terminologischen<br />
Eintrags folgenden fünf Kriterien entsprechen:<br />
Begriffsorientierung (vgl. etwa ISO 16642 (2003)),<br />
Benennungsautonomie (vgl. etwa Schmitz (2001)),<br />
Elementarität (vgl. etwa ISO/PRF 26162 (2010)), Granularität<br />
(vgl. etwa Schmitz (2001)) und Wiederholbarkeit<br />
(vgl. etwa ISO/PRF 26162 (2010)). Von Belang sind<br />
hier ferner die drei oben genannten Ebenen des<br />
TMF-Modells (Eintragsebene, Sprachebene und Benennungsebene).<br />
Die nachstehenden Datenkategorien leiten sich entweder<br />
aus den 15 Thesen oder aus dem derzeitigen Stand der<br />
Forschung zur terminografischen Modellierung ab (vgl.<br />
insbesondere ISO 12620 (1999) und das<br />
ISO-Datenkategorienverzeichnis „ISOcat“ unter<br />
www.isocat.org). Mit einem hochgestellten Pluszeichen<br />
(„ + “) versehene Bezeichnungen beziehen sich auf Da-<br />
199
Multilingual Resources and Multilingual Applications - Posters<br />
tenkategorien, die auf einer oder mehreren der drei oben<br />
genannten Ebenen Datenelemente enthalten können. Ein<br />
hochgestelltes „ W “ zeigt an, dass die jeweilige Datenkategorie<br />
innerhalb der Ebene, auf der sie genannt wird,<br />
wiederholbar sein muss.<br />
Die Eintragsebene umfasst folgende Datenkategorien:<br />
enzyklopädische Informationen + , multimediale Inhalte W ,<br />
Anmerkung +W , Position des Begriffs (wenn nur ein Begriff),<br />
Quellenangabe +W , administrative Angaben +W . Auf<br />
der Sprachebene befinden sich folgende Datenkategorien:<br />
Definition (wenn nur ein Begriff) bzw. Definition W<br />
(wenn mehrere quasiäquivalente Begriffe), enzyklopädische<br />
Informationen + , Anmerkung +W , Position der<br />
Begriffe W (wenn mehrere quasiäquivalente Begriffe),<br />
Quellenangabe +W , administrative Angaben +W . Die Benennungsebene<br />
schließlich besteht aus den Datenkategorien<br />
Benennung/Fachwendung/Äquivalent W , grammatikalische<br />
Informationen W , Kontext W , enzyklopädische<br />
Informationen + , Anmerkung +W , Quellenangabe +W ,<br />
administrative Angaben +W .<br />
3.2.2. Das Modell im ersten Detaillierungsgrad<br />
Das Modell im ersten Detaillierungsgrad, dessen Herzstück<br />
das oben erläuterte Modell des terminologischen<br />
Eintrags bildet, sieht grob so aus:<br />
Bild 2: Überblicksartige schematische Darstellung des<br />
Modells im ersten Detaillierungsgrad.<br />
Zu den drei bereits erwähnten Ebenen (Eintragsebene,<br />
Sprachebene, Benennungsebene) kommen noch die zwei<br />
Bestandteile „globale Informationen“ und „zusätzliche<br />
Informationen“ aus dem TMF-Modell in ISO 16642<br />
(2003) hinzu. Die erforderlichen Datenkategorien leiten<br />
sich erneut entweder aus den 15 Thesen ab oder ergeben<br />
sich aus dem derzeitigen Stand der Forschung zur terminografischen<br />
Modellierung; vgl. insbesondere<br />
ISO 12620 (1999), ISO 16642 (2003), ISO/PRF 26162<br />
(2010), aber auch ISO 1951 (2007). Folglich handelt es<br />
sich bei den globalen Informationen um administrative<br />
und technische Angaben, während zu den zusätzlichen<br />
200<br />
Informationen Begriffspläne, Meta-Informationen zum<br />
ü. F., multimediale Inhalte, alphabetische Auszüge aus<br />
der terminologischen Datenbasis, bibliografische Angaben,<br />
Textkorpora, Quellenangaben und administrative<br />
Angaben zählen.<br />
Das Modell im ersten Detaillierungsgrad lässt sich im<br />
Einzelnen wie folgt darstellen:<br />
Bild 3: Genauere schematische Darstellung des Modells<br />
im ersten Detaillierungsgrad.<br />
3.2.3. Das Modell im zweiten Detaillierungsgrad<br />
(Datenmodell)<br />
Aus dem oben erörterten und abgebildeten Modell im<br />
ersten Detaillierungsgrad soll ein Datenmodell entwickelt<br />
werden, das später in einer empirischen Untersuchung<br />
mit „ProTerm“ praktisch umgesetzt und experimentell<br />
erprobt wird. Hiebei kommt die objektorientierte<br />
Modellierungssprache „Unified Modeling Language“<br />
(UML) zum Einsatz. Diese wird in den einschlägigen<br />
internationalen Normen verwendet (vgl. ISO 16642<br />
(2003) und ISO/PRF 26162 (2010)) und bietet sich vor<br />
allem dann an, wenn ein Datenmodell in Form einer<br />
relationalen Datenbank umgesetzt werden soll.<br />
UML-Modelle sind jedoch implementierungsunabhängig<br />
und können technisch auch anders umgesetzt werden.
Das UML-Modell befindet sich im Entwurfsstadium und<br />
kann daher an dieser Stelle nicht veröffentlicht werden.<br />
Der aktuelle Entwurf kann auf Anfrage zur Verfügung<br />
gestellt werden.<br />
4. Ausblick<br />
Der nächste Schritt nach einer etwaigen Verfeinerung des<br />
Modells im ersten Detaillierungsgrad wird darin bestehen,<br />
ein Datenmodell in Form eines UML-Diagramms zu<br />
entwerfen, das sich <strong>für</strong> die Umsetzung mit „Pro-<br />
Term“ eignet. Eine empirische Untersuchung wird zeigen,<br />
ob das Modell dem Bedarf von Fachübersetzerinnen und<br />
Fachübersetzern in maximalem Umfang Rechnung tragen<br />
und eine Antwort auf die zentrale Forschungsfrage<br />
geben kann. Das Modell des ü. F. ist unabhängig von<br />
einem bestimmten Fachgebiet oder einer bestimmten<br />
Sprachenkombination. Für die empirische Untersuchung<br />
wird das Fachgebiet Terrorismus, Terrorismusabwehr<br />
und Terrorismusbekämpfung in den Sprachen Deutsch<br />
und Englisch herangezogen. Mit der Terminologie dieses<br />
Fachgebiets habe ich mich sowohl wissenschaftlich als<br />
auch in der Berufspraxis eingehend beschäftigt.<br />
5. Literatur<br />
Budin, G. (19<strong>96</strong>): Wissensorganisation und Terminologie:<br />
Die Komplexität und Dynamik wissenschaftlicher<br />
Informations- und Kommunikationsprozesse. Tübingen:<br />
Narr.<br />
Budin, G. (2002): Der Zugang zu mehrsprachigen terminologischen<br />
Ressourcen – Probleme und Lösungsmöglichkeiten.<br />
In K.-D. Schmitz, F. Mayer &<br />
J. Zeumer (Hg.), eTerminology. Professionelle Terminologiearbeit<br />
im Zeitalter des Internet – Akten des<br />
Symposions, Köln, 12.-13. April 2002. Köln: Deutscher<br />
Terminologie-Tag e.V., S. 185–200.<br />
Budin, G., Melby, A. (2000): Accessibility of Multilingual<br />
Terminological Resources – Current Problems<br />
and Prospects for the Future. In A. Zampolli et al.<br />
(Hg.), Proceedings of the Second International Conference<br />
on Language Resources and Evaluation, volume<br />
II. Athens, S. 837–844.<br />
DIN 2342 (2004). Begriffe der Terminologielehre (Entwurf).<br />
DIN 2330 (1993). Begriffe und Benennungen – Allgemeine<br />
Grundsätze.<br />
Hartmann, R. R. K. (1988): The Learner’s Dictionary:<br />
Multilingual Resources and Multilingual Applications - Posters<br />
Traum oder Wirklichkeit? In K. Hyldgaard-Jensen &<br />
A. Zettersten (Hg.), Symposium on Lexicography III.<br />
Proceedings of the Third International Symposium on<br />
Lexicography, May 14–16, 1986 at the University of<br />
Copenhagen. Tübingen: Niemeyer, S. 215–235.<br />
ISO 12620 (1999). Computer applications in terminology<br />
– Data categories.<br />
ISO 16642 (2003). Computer applications in terminology<br />
– Terminological markup framework.<br />
ISO 1951 (2007). Presentation/representation of entries<br />
in dictionaries – Requirements, recommendations and<br />
information.<br />
ISO/PRF 26162 (2010). Systems to manage terminology,<br />
knowledge and content – Design, implementation and<br />
maintenance of Terminology Management Systems.<br />
Löckinger, G. (<strong>2011</strong>): User-Oriented Data Modelling in<br />
Terminography: State-of-the-Art Research on the<br />
Needs of Special Language Translators. In<br />
T. Gornostay & A. Vasiļjevs (Hg.), NEALT Proceedings<br />
Series Vol. 12. Proceedings of the NODALIDA<br />
<strong>2011</strong> workshop, CHAT <strong>2011</strong>: Creation, Harmonization<br />
and Application of Terminology Resources, May 11,<br />
<strong>2011</strong>, Riga, Latvia. Northern European Association for<br />
Language Technology, S. 44-47.<br />
Mayer, F. (1998): Eintragsmodelle <strong>für</strong> terminologische<br />
Datenbanken. Ein Beitrag zur übersetzungsorientierten<br />
Terminographie. Tübingen: Narr.<br />
Sager, J. C. (1990): A Practical Course in Terminology<br />
Processing. Amsterdam: Benjamins.<br />
Schmitz, K.-D. (2001): Systeme zur Terminologieverwaltung.<br />
Funktionsprinzipien, Systemtypen und<br />
Auswahlkriterien (online edition). technische kommunikation,<br />
23(2), S. 34–39.<br />
de Schryver, G.-M. (2003): Lexicographers’ Dreams in<br />
the Electronic-Dictionary Age. International Journal of<br />
Lexicography, 16(2), S. 143–199.<br />
Snell-Hornby, M. (19<strong>96</strong>): The translator’s dictionary –<br />
An academic dream? In M. Snell-Hornby (Hg.),<br />
Translation und Text. Ausgewählte Vorträge. Wien:<br />
WUV-<strong>Universität</strong>sverlag, S. 90–<strong>96</strong>.<br />
Tiktin, H. (1910): Wörterbücher der Zukunft. Germanisch-romanische<br />
Monatsschrift, II, S. 243–253.<br />
Wilss, W. (1997): Übersetzen als wissensbasierte<br />
Tätigkeit. In G. Budin & E. Oeser (Hg.), Beiträge zur<br />
Terminologie und Wissenstechnik. Wien: TermNet,<br />
S. 151–168.<br />
201
Multilingual Resources and Multilingual Applications - Posters<br />
202
Multilingual Resources and Multilingual Applications - Posters<br />
Annotating for Precision and Recall in Speech Act Variation: The Case of<br />
Directives in the Spoken Turkish Corpus<br />
Şükriye Ruhi a , Thomas Schmidt b , Kai Wörner b , Kerem Eryılmaz c<br />
a, c Middle East Technical University, b Hamburg University<br />
a Dept. of Foreign Language, Education, Faculty of Education, 06800 Ankara<br />
b SFB 538 'Mehrsprachigkeit' Max Brauer-Allee 60, D-22765 Hamburg<br />
c Dept. of Cognitive Science, Graduate School of Informatics, 06800 Ankara<br />
E-mail: sukruh@metu.edu.tr, thomas.schmidt@uni-hamburg.de, kai.woerner@uni-hamburg.de,<br />
keryilmaz@gmail.com<br />
Abstract<br />
Speech act realizations pose special difficulties in search during annotation and pragmatics research based on corpora in spite of the<br />
fact that their various forms may be relatively formulaic. Focusing on spoken corpora, this paper concerns the generation of discourse<br />
analytical annotation schemes that can address not only variation in speech act annotation but also variation in dialog and interaction<br />
structure coding. The major arguments in the paper are that (1) enriching the metadata features of corpus design can act as useful aids<br />
in speech act annotation; and that (2) sociopragmatic annotation and corpus-oriented pragmatics research can be enhanced by<br />
incorporating (semi-)automated linguistic annotations that rely both on bottom-up discovery procedures and the more top-down,<br />
linguistic categorizations based on the literature in traditional approaches to pragmatics research. The paper illustrates<br />
implementations of enriched metadata and pragmatic annotation with examples drawn from directives in the demo version of the<br />
Spoken Turkish Corpus, and presents a qualitative assessment of the annotation procedures.<br />
Keywords: speech act annotation, variation, spoken Turkish, precision, metadata<br />
1. Speech acts as a challenge for<br />
corpus annotation<br />
Speech act realizations are notorious for the special<br />
difficulties they pose in search both during annotation<br />
and pragmatics research based on corpora, in spite of the<br />
fact that their various forms may be relatively formulaic,<br />
hence amenable to (semi-)automatic annotation.<br />
Sociopragmatic annotation involves significant<br />
difficulties in the very process of identifying categories<br />
and units of pragmatic phenomena such as variation in<br />
manifestations of speech acts and the identification of<br />
conversational segments (Archer, Culpeper & Davies,<br />
2008:635). As underscored by Schmidt and Wörner, this<br />
makes pragmatics research conducted on corpora<br />
“heuristic” in nature in that the relationship between<br />
theory and corpus analysis is bi-directional (2009:4).<br />
This is all the more so in the identification of speech acts,<br />
as function only partially follows form.<br />
To illustrate this with a short excerpt from a naturally<br />
occurring speech event, the utterance iki çay ‘two teas’<br />
may be describing the number of cups of tea one has had.<br />
But followed by tamam hocam “okey deferential address<br />
term”, the noun phrase would achieve the illocutionary<br />
force of a request when uttered to a service provider. It<br />
goes without saying that the initial utterance can occur<br />
with please as a politeness marker, which would certainly<br />
increase its chance of being identified as a request.<br />
Communications, however, do not always exhibit such<br />
pre-fabricated forms. Thus their recall in corpora would<br />
require the analyst to increase the number of search<br />
expressions infinitely. Even so, that would not guarantee<br />
full recall; neither would it filter false cases. This<br />
situation goes against the advantage of using corpora for<br />
the study of variation and largely limits the derivation of<br />
qualitative and quantitative conclusions from corpora.<br />
In this paper we argue that annotation for studying<br />
variation in speech act realizations can be improved by (1)<br />
enriching metadata coding during the construction stage<br />
203
Multilingual Resources and Multilingual Applications - Posters<br />
of a corpus; and (2) by implementing (semi-)automated<br />
annotation for sociopragmatic features of<br />
communications that rely both on bottom-up discovery<br />
procedures and top-down, linguistic categorizations<br />
based on traditional approaches to pragmatics research<br />
(e.g. annotation of socially and discursively significant<br />
verbal and non-verbal phenomena and non-phonological<br />
units such as multi-word expressions and changes in tone<br />
of voice. The argumentation is based on insights from<br />
Multidimensional Analysis (Biber, 1995) and<br />
vocabulary-based identification of discourse units<br />
(Csomay, Jones & Keck, 2007), and the fact that<br />
pragmatic phenomena in conversational management<br />
(e.g., illocutionary force indicating devices, address<br />
terms, and politeness formulae) tend to form<br />
constellations of ‘traces’ in discourse. Annotating such<br />
traces can add “precision” and improve “recall”<br />
(Jucker et al. 2008) in searching for variation in speech<br />
acts. The main thrust of the paper is that speech events<br />
and discourse level units exhibit such verbal and<br />
non-verbal clusters, and that annotating such units can<br />
provide insights for further discursive coding. Below, we<br />
explain the procedures for these two approaches to<br />
annotation with illustrations from the demo version of the<br />
Spoken Turkish Corpus (STC), which currently<br />
comprises 44,<strong>96</strong>2 words from a selection of recordings in<br />
conversational settings, service encounters, and radio<br />
archives (STC employs EXMARaLDA corpus<br />
construction tools (Schmidt, 2004), along with a<br />
web-based corpus management system.).<br />
204<br />
2. Metadata construction in the<br />
transcription and annotation<br />
workflow of STC<br />
Besides constructing a metadata system for domain,<br />
interactional goal and speaker features, we maintain that<br />
the inclusion of speech acts and conversational topics as<br />
part of the metadata features of a corpus is a significant<br />
tool for tracing variation in speech acts in a systematic<br />
manner, as topical variation can impact their performance<br />
beyond the influence of domain and setting features.<br />
Viewed from another perspective, spoken texts are<br />
slippery resources of language in terms of domain and<br />
setting categorization such that they are often<br />
characterized by shifts in interactional goals. A service<br />
encounter in a shop, for example, can easily turn into a<br />
chat. Thus, if a communicative event were classified only<br />
for its domain of interaction, one would risk the chance of<br />
tracing subtle differences within the same domain along<br />
several dimensions. The simultaneous annotation of<br />
topics and speech acts during the compilation of the<br />
recordings and during their transcription can address the<br />
concern for achieving maximal retrieval of tokens of a<br />
speech act. It enables a bottom-up approach to search for<br />
variation through control for topic and speech acts, as<br />
manifestations of the act may not exhibit structures noted<br />
in the literature. It also allows for a corpus-driven<br />
categorization of speech acts that may not have been<br />
investigated at all in the particular language. The stages<br />
in this procedure in the construction of STC are outlined<br />
below:<br />
1) Noting of local and global topics, and the<br />
communication related activities by recorders (e.g.<br />
studying for an exam)<br />
2) Checking of topics and additions during the transfer<br />
of the recording to the corpus management system<br />
3) Stages in transcription:<br />
a. Initial step: basic transcription of recording for<br />
verbal and non-verbal; editing of topics and<br />
addition of speech act metadata<br />
b. First check: Checking the transcription for<br />
verbal and non-verbal events; editing of topics<br />
and speech act metadata<br />
c. Second check: Checking the transcription for<br />
verbal and non-verbal events; editing of topics<br />
and speech act metadata<br />
To achieve a higher level of reliability in transcription, a<br />
different transcriber is responsible for the annotation in<br />
each step in (3), and differences in transcription are<br />
handled through consultation. Stages (1) and (3a) ideally<br />
involve the same person so that the transcriber has an<br />
intuitive grasp of the topical content and the affective<br />
tone of the communication. This procedure has the added<br />
advantage of detecting regional variation with more<br />
precision. It also renders possible the construction of<br />
sub-corpora for initial pilot annotation not only through<br />
control for domain but also for topic and speech act, thus<br />
enhancing the likelihood of retrieval of a greater variety<br />
of tokens in a more economical manner. Naturally, this<br />
workflow taps into native speaker intuitions on speech<br />
act performance, but it is a viable methodological
procedure in linguistics because it harnesses intuitions in<br />
a context-sensitive environment during text processing.<br />
Figure 1 displays a select number of the metadata<br />
features of one communication in STC (Note that topics<br />
are written in Turkish, and that the term requests is used<br />
instead of directives because the former was a more<br />
transparent term for the transcribers in step (3a) above:<br />
Figure 1: Partial metadata for a communication in STC<br />
3. Annotation procedure for speech acts<br />
Speech act annotation in STC is being implemented with<br />
Sextant (Wörner, n.d.), which also allows searches to be<br />
conducted with EXAKT. The search for tokens of<br />
directives employs a snowballing technique in<br />
developing regular expressions, and is similar to what<br />
Kohnen (2008:21) describes as “structural eclecticism”.<br />
The annotation procedure starts off with the identification<br />
of forms that have been identified as being representative<br />
of directives in Turkish. Regular expressions based on<br />
these forms have been developed, and the development<br />
of tag sets is done according to the syntactic and/or<br />
lexical features of the head act. But instead of tagging<br />
only the head act, the full act is further coded by placing<br />
opening and closing tags for the relevant head act (see,<br />
Examples 1 and 2). This will allow further detailed<br />
tagging of the act in later stages of annotation.<br />
The regular expressions are enriched based on tokens<br />
detected first in the sub-corpora of service encounters<br />
both by examining the larger context of the tokens<br />
recalled in initial searches and by manually investigating<br />
specific communications that are marked for directives in<br />
the corpus metadata. However, this procedure does not<br />
Multilingual Resources and Multilingual Applications - Posters<br />
allow elliptical directives and hints to be recalled<br />
automatically. Based on the idea that a directive is ideally<br />
part of an adjacency pair, the search for ‘hidden’<br />
manifestations of the act is conducted through the<br />
presence of address terms and a select number of minimal<br />
responses, including lexical and non-lexical<br />
backchannels (e.g. tamam ‘okey/enough/full’, ha?, hm),<br />
which turned out to collate frequently with directives.<br />
Searches were thus conducted separately for these<br />
responses, and tokens that did not collocate with<br />
directives or form the head act itself were eliminated<br />
from the annotation (as is the case with tamam).<br />
Example (1) shows the co-occurrence of tamam with an<br />
elliptical request (tag code: RNp), which could not be<br />
recalled with a regular expression (The head act is<br />
marked in bold). It is noteworthy that the sequence<br />
manifests the presence of the discourse marker şimdi<br />
‘now’, which marks the speech act boundary, and<br />
illustrates how both minimal responses (tamam) and<br />
discourse markers collocate with the head act.<br />
(1)<br />
Speakers Interaction Translation<br />
XAM000066 şimdi ((0.3)) RNp-opennow your<br />
T.C. kimlik Turkish ID<br />
numarası number and first<br />
((0.2)) ve you home address<br />
öncelikli ((XXX))RNp-close<br />
olarak ((0.1))<br />
ev adresinizi<br />
DIL000065 tamam.((0.3)) okey.<br />
((filling in a<br />
form,10.8))<br />
Example (2) is an illustration from a service encounter.<br />
The head act has a verb with the future in the past. In<br />
isolation the utterance could be a manifestation of a<br />
representative. However, the collocation of the utterance<br />
with buyrun ‘lit. command’ (idiomatic equivalent:<br />
Welcome) disambiguates it as a request.<br />
(2)<br />
Speakers Interaction Translation<br />
MEH000222 ((0.3))<br />
buyrun. welcome.<br />
MED000112 iyi günler! good day!<br />
MEH000222 neresi where is it to be?<br />
olacak? (idiomatic equivalent:<br />
where to?)<br />
MED000112 Dikili’ye RImpFuI-openI was going<br />
bilet<br />
alacaktım.<br />
to get a ticket for<br />
DikiliRImpFuI-close<br />
205
Multilingual Resources and Multilingual Applications - Posters<br />
Such collocations allow us to form a list of<br />
(semi-)formulaic conversational management units,<br />
which should be tagged as pragmatic markers for<br />
directives. In the demo version of STC, tamam ‘okey’ is<br />
the item that exhibits the highest frequency. A search on<br />
the occurrence of the item was therefore conducted to<br />
check its collocation with directives. The search yielded<br />
298 tokens, 20 of which were related to directives. In 8<br />
instances, the item is a supportive move for the directive<br />
head act. In 2 recalls it was the head act itself to close off<br />
a conversational topic, while the remaining tokens were<br />
responses to a verbal or non-verbal request or part of the<br />
response to questions asking for advice/opinion.<br />
Amongst these we find the supportive function of tamam<br />
as a compliance gainer to be especially significant since<br />
the literature on directives in Turkish does not identify<br />
this function. Within these recalls, tamam collocates with<br />
6 requests of the kind illustrated in Example (2). This<br />
suggests that tamam can function to disambiguate<br />
representatives from requests and can be used to retrieve<br />
elliptical directives and hints. Although the full<br />
description of the pragmatics of tamam needs to be<br />
refined, we can say that in its semantically bleached use,<br />
it appears in topic closures, it functions as a backchannel<br />
to check comprehension, and is used as an agreement<br />
marker or as a pre-sequence to disagreement. In this<br />
regard, we can say that tamam is a pragmatic marker in<br />
its non-literal use and needs to be tagged accordingly.<br />
206<br />
4. Conclusions<br />
This paper touches only upon the disambiguating<br />
capacity of lexical pragmatic markers, but the<br />
distribution of tamam supports the claim that discourse<br />
segmentation and conversational structure annotation can<br />
use the clues provided by such ‘traces’. The functional<br />
description of tamam naturally raises the question as to<br />
coding principles for such items, including politeness<br />
formulae. While non-lexical backchannels may not be<br />
too problematic, the classification and coding of<br />
pragmatic markers is a fuzzy area. At this stage, we<br />
propose that a semantic-based, broad categorization be<br />
made to distinguish lexical and non-lexical markers,<br />
interjections and discourse markers, and discourse<br />
particles.<br />
Our experience in testing the effect of pragmatic markers<br />
on recall of speech acts suggests that it is possible to<br />
envision generic level schemes for speech act annotation.<br />
These would proceed first with a bottom-up approach, in<br />
which (multi-word) pragmatic markers, backchannels<br />
and non-verbal cues such as a classification of activity<br />
types (e.g., handing over money) are tagged. It is likely<br />
that such a venture will reveal commonalities between<br />
speech acts beyond what may be gleaned from the current<br />
pragmatics literature on speech act manifestations.<br />
5. Acknowledgements<br />
This paper was supported by TÜBİTAK, grant no.<br />
108K283, and METU, grant no. BAP-05-03-<strong>2011</strong>-001.<br />
6. References<br />
Archer, D., Culpeper, J., Davies, M. (2008): Pragmatic<br />
annotation. In A. Lüdeling & M. Kytö, Merja (Eds.),<br />
Corpus Linguistics: An International Handbook, Vol. I.<br />
Berlin/New York: Walter de Gruyter, pp. 613-642.<br />
Biber, D. (1995): Dimensions of Register Variation. New<br />
York: Cambridge University Press.<br />
Csomay, E., Jones, J.K., Keck, C. (2007): Introduction to<br />
the identification and analysis of vocabulary-based<br />
discourse units. In D. Biber, U. Connor & T.A. Upton<br />
(Eds.), Discourse on the Move. Using Corpus Analysis to<br />
Describe Discourse Structure. Amsterdam/ Philadelphia:<br />
John Benjamins, pp. 155-173<br />
Jucker, A., Schneider, G., Taavitsainen, I., Breustedt, B.<br />
(2008): “Fishing” for compliments. Precision and recall<br />
in corpus-linguistic compliment research. In A. Jucker &<br />
I. Taavitsainen (Eds.), Speech Acts in the History of<br />
English. Amsterdam/Philadelphia: Benjamins, pp.<br />
273-294.<br />
Kohnen, T. (2008): Historical corpus pragmatics: Focus on<br />
speech acts and texts. In A. Jucker & I. Taavitsainen<br />
(Eds.), Speech Acts in the History of English.<br />
Amsterdam/Philadelphia: Benjamins, pp. 13-36.<br />
Schmidt, T. (2004): Transcribing and Annotating Spoken<br />
Language with EXMARaLDA. In Proceedings of the<br />
LREC-Workshop on XML Based Richly Annotated<br />
Corpora, Lisbon 2004. Paris: ELRA, pp. 69-74.<br />
Schmidt, T., Wörner, K. (2009): EXMARALDA – creating,<br />
analysing and sharing spoken language corpora for<br />
pragmatic research. Pragmatics, 19(4), pp. 565-582.<br />
Spoken Turkish Corpus. http://std.metu.edu.tr/en/<br />
Wörner, K. n.d. Sextant tagger.<br />
http://www.exmaralda. org/sextant/sextanttagger.pdf
Multilingual Resources and Multilingual Applications - Posters<br />
The SoSaBiEC Corpus:<br />
Social Structure and Bilinguality in Everyday Conversation<br />
Veronika Ries 1 , Andy Lücking 2<br />
1 <strong>Universität</strong> Bielefeld, BMBF Projekt Linguistic Networks<br />
2 Goethe-<strong>Universität</strong> Frankfurt am Main<br />
E-mail: Veronika.Ries@uni-bielefeld.de, Luecking@em.uni-frankfurt.de<br />
Abstract<br />
The SoSaBiEC corpus is comprised audio recordings of everyday interactions between familiar subjects. Thus, the material the<br />
corpus is based on is not gained in task-oriented dialogue under strict experimental control; rather, it is made up of spontaneous<br />
conversations. We describe the raw data and the annotations that constitute the corpus. Speech is transcribed at the level of words.<br />
Dialogue act oriented codings constitute a functional, qualitative annotation level. The corpus so far provides an empirical basis for<br />
studying social aspects of unrestricted language use in a familiar context.<br />
Keywords: bilinguality, social relationships, spontaneous dialogue, annotation<br />
1. Introduction<br />
From the point of view of the methodology of<br />
psycholinguistic research on speech production<br />
unconstrained responding behavior of participants is<br />
problematic: it is known as “the problem of exuberant<br />
responding” and it is to be avoided by means of some sort<br />
of controlled elicitation in an experimental setting (Bock<br />
19<strong>96</strong>:407; see also Pickering & Garrod 2004:169). In<br />
addition, elicitations are usually bound up with a certain<br />
task the participants of the experimental study have to<br />
accomplish. Of course, each experimental set-up<br />
that obeys to the general “avoid-exuberantresponding”-design<br />
and is therefore appropriate to study<br />
and test the conditions underlying speech production in a<br />
controlled way. However, when studying human-tohuman<br />
face-to-face dialogue (or multi-logue, in case of<br />
more than two interlocutors), elicited communication<br />
behavior hinders the unfolding of spontaneous utterances<br />
and task-independent dialogue management. Taskoriented<br />
dialogue is known to be plan-based (Litman &<br />
Allen, 1987). The domain knowledge the interlocutors<br />
have of the task-domain together with the difference<br />
between their current state and the target state (defined in<br />
terms of the task to be accomplished) provides a<br />
structuring of dialogue states: the way from the current<br />
dialogue state to the target state is operationalized as a<br />
sequence of sub-tasks, each of these sub-tasks is part of a<br />
plan that has to be worked off sequentially in order to<br />
reach the target state. Plan-based accounts to dialogue<br />
provide a functional account to dialogue and have been<br />
successfully applied in computational dialogue systems<br />
for, e.g., timetable enquiries (Young & Proctor, 1989). At<br />
least partly due to the neat status of task-oriented<br />
conversational settings, respective study designs have<br />
been paradigmatic in linguistic research on dialogue.<br />
Task-oriented dialogues, inter alia, pre-determine the<br />
following conversational ingredients:<br />
� they define a dialogue goal and thereby a<br />
terminal dialogue state;<br />
� they constrain the topics the interlocutors talk<br />
about to a high degree (up to move type<br />
predictability, modulo repairs etc.);<br />
� they are cooperative rather than competitive;<br />
� the dialogue goal determines the social<br />
relationship of the interlocutors (for instance,<br />
whether they have equal communicative rights<br />
or whether task-knowledge is asymmetrically<br />
distributed) and it does so regardless of the<br />
actual relationships that might obtain between<br />
the interlocutors;<br />
� they are unilingual.<br />
Each of the ingredients above is lacking in spontaneous,<br />
everyday conversation. Does this mean that spontaneous,<br />
207
Multilingual Resources and Multilingual Applications - Posters<br />
everyday conversations also lack any structure of<br />
dialogue management? Answers to this question are in<br />
general given on the grounds of armchair theorizing or<br />
case studies. The feasibility of empirical approaches is<br />
simply hindered by the lack of respective data. The<br />
afore-given list can be extended by a further feature,<br />
namely the fact that it is easier to gather task-oriented<br />
dialogue data in experimental settings than to collect<br />
rampant spontaneous dialogue data. We have some<br />
spontaneous dialogue data that lack each of the<br />
task-based features listed above – see section 2 for a<br />
description. We focus on the latter two aspects here,<br />
namely social structure and bilingualism. The social<br />
dimension of language use, for instance, social deixis, is a<br />
well-known fact in pragmatics (Anderson & Keenan,<br />
1985; Levinson, 2008). The influence of social structure<br />
on the structure of lexica has also been reported (Mehler,<br />
2008). Yet, there is no account that scales the<br />
macroscopic level of language communities down to the<br />
microscopic level of dialogue. The data collected in<br />
SoSaBiEC aims at exactly this level of granularity of<br />
social structure and language structure: how does the<br />
social relationship between interlocutors affect the<br />
structure of their dialogue lexicon?<br />
A special characteristic of SoSaBiEC is bilingualism. The<br />
subjects recorded speak Russian as well as German, and<br />
they make use of both languages in one and same<br />
dialogue. What dialogical functions performed by the<br />
two languages seems to depend at least partly on who the<br />
addresses are, that is, on the social relationship between<br />
the interlocutors (Ries, to appear). This qualitative<br />
observation will be operationalized in terms of<br />
quantitative analyses that focus on the<br />
relationship-dependent, functional use of languages<br />
(cf. the outlook given in section 4).<br />
According to the bi-partition of corpora – primary or raw<br />
data are coupled with secondary or annotation data<br />
(loosely related to Lemnitzer & Zinsmeister, 2006:40) –<br />
the following two sections describe the data material<br />
(section 2) and its annotation (section 3) in terms of<br />
functional dialogue acts. In the last section, we sketch<br />
some research question we will address by means of<br />
SoSaBiEC in the very near future.<br />
208<br />
2. Primary Data<br />
The primary data are made up of audio recordings of<br />
everyday conversations (Ries, to appear). The recorded<br />
subjects all know each other, most of them are even<br />
related. The observations focus on natural language use,<br />
and in particular on bilingual language use. The compiled<br />
corpus is authentic because the researcher, who recorded,<br />
is herself a member of the observed speech community.<br />
The speaker gave their consent for recording at any time<br />
and without prior notice. So the recordings were taken<br />
spontaneously and at real events, such as birthday parties.<br />
For recording a digital recorder without microphone was<br />
used, so that it was without attracting too much attention.<br />
They include telephone calls and face-to-face<br />
conversations. The length of the conversations varies<br />
from about three minutes up to three hours. Depending on<br />
the topic of the conversation the number of the involved<br />
speakers differs: from two up to four speakers. In sum,<br />
there are about 300 minutes of data material covering six<br />
participants. Altogether ten conversations have been<br />
recorded. Four conversations have been analysed in<br />
detail and annotated because the participant constellation<br />
is obvious and definite: the participants come under the<br />
category parent-child or sibling. The six participants<br />
come from two families, not known to each other. As<br />
working basis for the qualitative analysis the recordings<br />
were transcribed. By way of illustration, an excerpt of the<br />
transcribed data is given:<br />
01 F: NAME<br />
A: guten abend.<br />
F: hallo?<br />
A: hallo guten abend<br />
05 F: nabend (.) hallo<br />
A: na wie gehts bei euch?<br />
F: gut<br />
A: gut?<br />
F: ja.<br />
10 A: na что вы смотрели что к чему тама?<br />
F: ja а что там?<br />
This is a sequence of a telephone call between father F<br />
and his daughter A. The conversation starts in German<br />
and initiated by daughter A there is an alternation into<br />
Russian (line 10). The qualitative analysis showed that<br />
through this language switch speaker introduced the first
topic of the telephone call and so managed the<br />
conversation opening. Results such as the described one<br />
are the main content of the annotation.<br />
3. Annotation<br />
The utterances produced by the participants have been<br />
transcribed using the Praat tool<br />
(http://www.fon.hum.uva.nl/praat/) on the level of<br />
orthographic words. That means, that no phonetic<br />
features like accent or defective pronunciations are coded.<br />
However, spoken language exhibits regularities of its<br />
own kind, regularities we accounted for in speech<br />
transcription. Most prominently, words that are separated<br />
in written language might get fused into phonetic word in<br />
spoken language. A common example in German already<br />
part of the standard of the language is “zum”, a melting of<br />
the preposition “zu” and the dative article “dem”.<br />
Meltings of this pattern are much more frequent in<br />
spoken German than acknowledged in standard German.<br />
The English language knows hard-wired combinations<br />
like “I’m” which usually is not resolved to the<br />
full-fledged standard form “I am”. The annotation takes<br />
care for these demands in providing respective<br />
adaptations of annotations to spoken language. In order<br />
to reveal the dialogue-related functions performed by the<br />
utterances, we employed a dialogue act-based coding of<br />
contributions. Here, we follow the ISOCat<br />
(www.isocat.org) initiative for standardization of<br />
dialogue act annotations outlined by Bunt et al. (2010).<br />
To be able to talk about dialogue-related functions and<br />
natural bilingual language use, language alternations<br />
regarding their functions and roles in the current<br />
discourse were annotated. The important factor annotated<br />
is the function of the involved languages and the<br />
observed language alternations: That means to annotate<br />
each language switch and its meaning on the level of<br />
conversation, for example the conversation opening. The<br />
differentiation by speakers is crucial for the examination<br />
of a connection between language use and social<br />
structure. The functional annotation labels have been<br />
derived from qualitative, ethnomethodological analyses<br />
by an expert researcher. The annotations made by this vey<br />
researcher can be regarded as having the privileged status<br />
as “gold standard” since part of the expert’s knowledge is<br />
not only the pre- and posthistory of the data recorded, but<br />
also familiarity with the subjects involved, a kind of<br />
Multilingual Resources and Multilingual Applications - Posters<br />
knowledge rather exclusive to our expert. However, since<br />
the annotation are a compromise between the qualitative<br />
and quantitative methods and methodologies that are<br />
brought together in this kind of research, we want to<br />
assess whether the ethnomethodological, functional<br />
annotation can be reproduced to a sufficient degree by<br />
other annotators. For this reason, we applied a reliability<br />
assessment in term of inter-rater agreement of two raters’<br />
annotations of a subset (one conversation) of the data. We<br />
use the agreement coefficient AC1 developed by Gwet<br />
(2001). The annotation of dialogue acts result in an AC1<br />
of 0.61, the rating of function result in an AC1 of 0.78.<br />
Two observations can be made: firstly, the functional<br />
dialogue annotation is reproducible -- an outcome of 0.78<br />
is regarded as "substantial" by Rietveld and van Hout<br />
(1993); secondly, the standardised dialogue act<br />
annotation scheme tailored for task-oriented dialogues<br />
can be applied with less agreement than the functional<br />
scheme custom-build to more unconstrained everyday<br />
conversations. We take this as further evidence for the<br />
validity of the distinction of different types argued for in<br />
the introduction.<br />
4. Outlook<br />
So far, we finished data collection and annotation of the<br />
subset of SoSaBiEC data that interests us first, namely<br />
the data that involve parent-child and sibling dialogues.<br />
The next step is to test our undirected hypothesis by<br />
means of mapping the annotation data on a variant of the<br />
dialogue lexicon model of Mehler, Lücking, and Weiß<br />
(2010). This model provides a graph-theoretical<br />
framework for classifying dialogue networks according<br />
to their structural similarity. Applying such quantitative<br />
measure onto mostly qualitative data allows not only to<br />
study whether social structure imprints on language<br />
structure in human dialogue, but in particular to measure<br />
if there is a traceable influence at all.<br />
5. Acknowledgments<br />
Funding of this work by the German Federal Ministry of<br />
Education and Research (Bundesministerium <strong>für</strong> Bildung<br />
und Forschung) is gratefully acknowledged. We also<br />
want to thank Barbara Job and Alexander Mehler for<br />
discussion and support.<br />
209
Multilingual Resources and Multilingual Applications - Posters<br />
210<br />
6. References<br />
Anderson, S. R., Keenan, E. L. (1985): “Deixis”. In:<br />
Language Typology and Syntactic Description. Ed. by<br />
Timothy Shopen. Vol. III. Cambridge: Cambridge<br />
University Press. Chap. 5, pp. 259–308.<br />
Bock, K. (19<strong>96</strong>): “Language Production: Methods and<br />
Methodologies”. In: Psychonomic Bulletin & Review<br />
3.4, pp. 395–421.<br />
Bunt, H. et al. (May 21, 2010): “Towards an ISO<br />
Standard for Dialogue Act Annotation”. In:<br />
Proceedings of the Seventh conference on<br />
International Language Resources and Evaluation<br />
(LREC’10). Ed. by Nicoletta Calzolari (Conference<br />
Chair) et al. Valletta, Malta: European Language<br />
Resources Association (ELRA).<br />
Cohen, J. (1<strong>96</strong>0): “A Coeffcient of Agreement for<br />
Nominal Scales”. In: Educational and Psychological<br />
Measurement 20, pp. 37–46.<br />
Gwet, K. (2001): Handbook of Inter-Rater Reliability.<br />
Gaithersburg, MD: STATAXIS Publishing Company.<br />
Lemnitzer, L., Zinsmeister, H. (2006): Korpuslinguistik.<br />
Eine Einführung. Tübingen: Gunter Narr Verlag.<br />
Levinson, S. C. (2008): “Deixis”. In: The Handbook of<br />
Pragmatics. Blackwell Publishing Ltd, pp. 97–121.<br />
Litman, D. J., Allen, J. F. (1987): “A plan recognition<br />
model for subdialogues in conversations”. In:<br />
Cognitive Science 11.2, pp. 163–200.<br />
Mehler, A. (2008): “On the Impact of Community<br />
Structure on SelfOrganizing Lexical Networks”. In:<br />
Proceedings of the 7th Evolution of Language<br />
Conference (Evolang 2008). Ed. By Andrew D. M.<br />
Smith, Kenny Smith, and Ramon Ferrer i Cancho.<br />
Barcelona: World Scienti fic, pp. 227–234.<br />
Mehler, A., Lücking, A., Weiß, P. (2010): “A Network<br />
Model of Interpersonal Alignment in Dialogue”. In:<br />
Entropy 12.6, pp. 1440–1483.<br />
doi: 10.3390/e12061440.<br />
Pickering, M. J. and Garrod, S. (2004): “Toward a<br />
Mechanistic Psychology of Dialogue”. In: Behavioral<br />
and Brain Sciences 27.2, pp. 169–190.<br />
Ries, V. (<strong>2011</strong>): “da=kommt das=so quer rein.<br />
Sprachgebrauch und Spracheinstellungen<br />
Russlanddeutscher in Deutschland”. PhD thesis.<br />
<strong>Universität</strong> Bielefeld.<br />
Rietveld, T. van Hout, R. (1993): Statistical Techniques<br />
for the Study of Language and Language Behavior.<br />
Berlin ; New York: Mouton de Gruyter.<br />
Young, S. J., Proctor, C. E. (1989): “The design and<br />
implementation of dialogue control in voice operated<br />
database inquiry systems”. In: Computer Speech and<br />
Language 3.4, pp. 329–353.<br />
doi: 10.1016/0885-2308(89)90002-8.
Multilingual Resources and Multilingual Applications - Posters<br />
DIL, ein zweisprachiges Online-Fachwörterbuch der Linguistik<br />
(Deutsch-Italienisch)<br />
Carolina Flinz<br />
<strong>Universität</strong> Pisa<br />
E-mail: c.flinz@ec.unipi.it<br />
Abstract<br />
DIL ist ein deutsch-italienisches Online-Fachwörterbuch der Linguistik. Es ist ein offenes Wörterbuch und mit diesem Beitrag wird<br />
<strong>für</strong> eine mögliche Zusammenarbeit, Kollaboration plädiert. DIL ist noch im Aufbau begriffen; zur Zeit ist nur die Sektion DaF<br />
komplett veröffentlicht, auch wenn andere Sektionen in Bearbeitung sind. Die Sektion LEX (Lexikographie), die zur<br />
Veröffentlichung ansteht, wird zusammen mit den wichtigsten Eigenschaften des Wörterbuches präsentiert.<br />
Keywords: Fachwörterbuch, Linguistik, zweisprachig, deutsch-italienisch, Online-Wörterbuch<br />
1. Einleitung<br />
DIL (Dizionario tedesco-italiano di terminologia<br />
linguistica / deutsch-italienisches Fachwörterbuch der<br />
Linguistik) ist ein online Wörterbuch, das Lemmata aus<br />
dem Bereich der Linguistik und einiger ihrer<br />
Nachbardisziplinen auflistet. Es ist ein offenes<br />
Wörterbuch, nach dem Muster von Wikipedia, bzw.<br />
Glottopedia, um eine mögliche Beteiligung von Experten<br />
der unterschiedlichen Disziplinen zu fördern.<br />
Im Handel und im Online-Medium existieren heute<br />
mehrere deutsche 1 und italienische 2 Wörterbücher der<br />
Linguistik aber kein einziges Fachwörterbuch <strong>für</strong> das<br />
Sprachenpaar deutsch-italienisch. Hingegen ist der<br />
Bedarf an einem solchen „Instrument“ in Italien, sowohl<br />
<strong>für</strong> die universitäre Didaktik als auch <strong>für</strong> die Forschung,<br />
sehr stark: in einem Zeitraum, wo das Fach „Deutsche<br />
Linguistik“ als Folge einer <strong>Universität</strong>sreform (1999)<br />
einen starken Aufschwung erlebt hat, könnte DIL <strong>für</strong> die<br />
wissenschaftliche Kommunikation von großer Relevanz<br />
sein 3<br />
. DIL könnte nämlich eine große Hilfe <strong>für</strong> die Suche<br />
1 Vgl. u.a. Bußmann, 2002; Conrad, 1985; Crystal, 1993;<br />
Ducrot & Todorov, 1975; Dubois, 1979; Glück, 2000; Heupel,<br />
1973; Lewandowki, 1994; Meier & Meier, 1979;<br />
Stammerjohann, 1975; Ulrich, 2002.<br />
2 Vgl. u.a. Bußmann, 2007; Cardona, 1988; Casadei, 1991;<br />
Ceppellini, 1999; Courtes & Greimas, 1986; Crystal, 1993;<br />
Ducrot & Todorov, 1972; Severino, 1937; Simone, 1<strong>96</strong>9.<br />
3 Die Relevanz von Fachwörterbüchern <strong>für</strong> die<br />
wissenschaftliche Kommunikation war Thema vieler<br />
lexikographischer Arbeiten: vgl. u.a. Wiegand, 1988; Pileegard,<br />
1994; Schaeder & Bergenholtz, 1994; Bergenholtz & Tarp,<br />
1995; Hoffmann & Kalverkämper & Wiegand, 1998.<br />
nach Äquivalenten von deutschsprachigen linguistischen<br />
Fachtermini sein.<br />
DIL ist ein Projekt des Deutschen Instituts der Fakultät<br />
Lingue e Letterature Straniere der <strong>Universität</strong> Pisa (daf,<br />
2004:37), das 2008 online veröffentlicht worden ist<br />
(http://www.humnet.unipi.it/dott_linggensac/glossword)<br />
und an dem weiterhin gearbeitet wird. Es handelt sich um<br />
ein monolemmatisches Fachwörterbuch 4<br />
(Wiegand,<br />
19<strong>96</strong>:46): die Lemmata sind in deutscher Sprache,<br />
während die Kommentarsprache Italienisch ist.<br />
Ziele dieses Beitrags sind:<br />
1) durch einen kurzen Überblick die wichtigsten<br />
Eigenschaften des Wörterbuches vorzustellen, wie<br />
Makro- und Mikrostruktur des Wörterbuches,<br />
Lemmabestand und Kriterien;<br />
2) zu ähnlichen Arbeiten und zukünftigen<br />
Kollaborationen an diesem Projekt anzuregen,<br />
insbesondere <strong>für</strong> die geplante Sektion der<br />
Computerlinguistik;<br />
3) die gerade neu erstellte Sektion LEX (Lexikographie)<br />
vorzustellen.<br />
2. Makro- und Mikrostruktur<br />
Die Makrostruktur und die Mikrostruktur von DIL<br />
wurden natürlich von der Funktion des Wörterbuches<br />
und der intendierten Benutzergruppe beeinflusst 5<br />
. Die<br />
Erkundung der Benutzerbedürfnisse wurde mit Hilfe von<br />
4<br />
Eine bilemmatische Ergänzung des Wörterbuches ist nicht<br />
ausgeschlossen.<br />
211
Multilingual Resources and Multilingual Applications - Posters<br />
Fragebögen, die sowohl im Printmedium als auch im<br />
Onlineformat gesendet wurden, und einer Analyse der<br />
möglichen Benutzersituationen erforscht 6<br />
. Jeder Benutzer<br />
kann weiterhin den Fragebogen von der Homepage<br />
aufrufen und beantworten, so dass ein ständiger Kontakt<br />
mit dem Benutzer vorhanden ist.<br />
DIL wendet sich im Allgemeinen an ein heterogenes<br />
Publikum: es ist sowohl <strong>für</strong> Experten als auch <strong>für</strong> Laien<br />
gedacht, so dass die potentiellen Benutzer sowohl Lerner<br />
und Lehrender in den Bereichen Germanistik,<br />
Romanistik, Linguistik oder Deutsch / Italienisch als<br />
Fremdsprache sein können als auch Lehrbuchautoren,<br />
Lexikographen oder Fachakademiker. Das Online<br />
Medium, dank seiner Flexibilität, ist von großem Vorteil<br />
in dieser Hinsicht.<br />
DIL kann nämlich in folgenden Benutzungssituationen<br />
verwendet werden:<br />
1) Der Benutzer sucht bestimmte fachliche<br />
212<br />
Informationen, und das Wörterbuch, laut seiner<br />
Werkzeugnatur, erfüllt das Bedürfnis;<br />
2) Der Benutzer greift zum Wörterbuch, um ein<br />
Kommunikationsproblem in der Textproduktion,<br />
Textrezeption oder Übersetzung zu lösen. DIL erfüllt<br />
deswegen mehrere Funktionen: es kann sowohl <strong>für</strong><br />
aktive / produktive als auch passive / rezeptive<br />
Tätigkeiten verwendet werden.<br />
a. Der italophone Benutzer (primärer Benutzer)<br />
wird es als dekodierendes Wörterbuch <strong>für</strong> die<br />
Herübersetzung verwenden, d.h. wenn er ein<br />
deutsches Fachwort verstehen will oder dessen<br />
Übersetzung sucht, oder wenn er spezifischere<br />
Informationen braucht und sich weiter<br />
b.<br />
informieren und weiterbilden möchte;<br />
Der deutschsprachige Benutzer wird es als<br />
enkodierendes Wörterbuch <strong>für</strong> die<br />
Hinproduktion benutzen, d.h. wenn er ins<br />
Italienische übersetzt und Fachtexte in<br />
italienischer Sprache erstellt.<br />
Die Makrostruktur von DIL vereinigt sowohl<br />
Eigenschaften der linguistischen Printwörterbücher (1.)<br />
5 Vgl. u.a. Storrer & Harriehausen, 1998; Barz, 2005.<br />
6 Für einen Überblick über mögliche Techniken zur<br />
(Benutzerbedürfnissen-Erforschung: meglio zur Erforschung<br />
von Benutzerbedürfnissen) vgl. u.a. Barz, 2005; Ripfel &<br />
Wiegand, 1988; Schaeder & Bergenholtz, 1994; Wiegand,<br />
1977.<br />
als auch der Onlinewörterbücher (2.):<br />
1. Die Strukturierung der Umtexte im Printmedium<br />
beeinflusste den aktuellen Stand. DIL verfügt<br />
nämlich über folgende nach wissenschaftlichen<br />
Kriterien verfasste Umtexte: Einleitung,<br />
Abkürzungsverzeichnis, Benutzerhinweise,<br />
Redaktionsnormen, Register der Einträge 7<br />
2.<br />
;<br />
Die Vorteile der Online-Wörterbücher wurden auch<br />
zum größten Teil ausgenutzt:<br />
a. Neue Einträge und neue Sektionen können sehr<br />
schnell veröffentlicht werden;<br />
b. DIL kann ständig erneuert, ergänzt und<br />
korrigiert werden;<br />
c. Es verfügt über ein klar strukturiertes Menu, in<br />
dem die wichtigsten Umtexte verlinkt sind, so<br />
dass der Benutzer schnell die gewünschten<br />
Informationen erreichen kann;<br />
d. Es verwendet sowohl interne 8 als auch externe 9<br />
Hyperlinks;<br />
e. Es bietet dem Benutzer nützliche Informationen,<br />
wie die TOP 10 (vgl. u.a. die „zuletzt gesuchten“<br />
oder „die am meisten geklickten Lemmata“);<br />
f. Es bietet wichtige Instrumente, wie die<br />
Suchmaschine, die Feedbackseite, das Lodgin<br />
Feld etc.<br />
Die Mikrostruktur von DIL bietet sowohl sprachliche als<br />
auch sachliche Informationen und ist auf der Grundlage,<br />
dass der Erst-Adressat der italophone Benutzer ist,<br />
strukturiert worden. Jeder Eintrag wird von folgenden<br />
Angaben komplettiert:<br />
1) grammatische Angaben (Genus und Numerus);<br />
2) das Äquivalent / die Äquivalente in italienischer<br />
Sprache;<br />
3) die Markierung als Information zum fachspezifischen<br />
Bereich des Lemmas;<br />
4) die enzyklopädische Definition;<br />
5) Beispiele 10<br />
;<br />
7<br />
Eine empirische Analyse linguistischer Online-<br />
Fachwörterbücher zeigte, wie „unwissenschaftlich“ oft<br />
Online-Wörterbücher mit Umtexten umgehen. Nur 45% der<br />
analysierten Werkzeuge verfügte über solche Texte und nur in<br />
seltenen Ausnahmen wurden wissenschaftlichen Kriterien<br />
gefolgt (Flinz, 2010:72)<br />
8<br />
Der Benutzer kann von einem Eintrag zu thematisch<br />
verbundenen Lemmata springen.<br />
9<br />
Es sind sowohl sprachliche Wörterbücher, wie Canno.net und<br />
Grammis, als auch sachliche, wie Glottopedia und DLM,<br />
verlinkt.<br />
10<br />
Alle Lemmata folgen im Prinzip dem gleichen Schema, da
6) Angaben zur Paradigmatik, wie Synonyme;<br />
thematisch verbundene Lemmata;<br />
7) bibliographische Angaben.<br />
3. Lemmabestand und Kriterien<br />
Der Lemmabestand von DIL kann nur<br />
„eingeschätzt“ werden. Die Gründe da<strong>für</strong> können wie<br />
folgt zusammengefasst werden:<br />
1) erstens handelt es sich um ein Online-Wörterbuch,<br />
das noch in Projekt und Testphase ist;<br />
2) zweitens soll das Werk, wie es sein Format vorgibt,<br />
nicht als etwas Statisches und Vollendetes gesehen<br />
werden, sondern in ständiger Erweiterung und<br />
Erneuerung. Aus einem Vergleich der existierenden<br />
linguistischen Fachwörterbücher kann aber eine<br />
ungefähre Zahl von ca. 2.000 Lemmata ausgerechnet<br />
werden, die allerdings<br />
geändert werden kann.<br />
ständig erweitert oder<br />
Primärquellen waren allgemeine Wörterbücher der<br />
Linguistik (deutsch- wie italienischsprachige), sowie<br />
spezifische deutsche und italienische Glossare der<br />
Disziplin Lexikographie und Fachlexikographie. Es<br />
wurden Quellen sowohl im gedruckten als auch im<br />
Online-Medium herangezogen. Sekundärquellen waren<br />
Handbücher aus dem Bereich der jeweiligen Disziplin<br />
(<strong>für</strong> die Sektion Lex waren es zum Beispiel<br />
Standardwerke der Disziplin Lexikographie und<br />
Fachlexikographie) sowohl in deutscher als auch in<br />
italienischer Sprache.<br />
Hauptkriterien <strong>für</strong> die Auswahl der Lemmata sind<br />
Frequenz und Relevanz (Bergenholtz, 1989:775) 11<br />
:<br />
1) Es wurde eine entsprechende Analyse der<br />
existierenden lexikographischen Wörterbücher<br />
sowohl im Print- als auch im Onlineformat<br />
2)<br />
hinsichtlich der dort aufgeführten Lemmata des<br />
jeweiligen Bereiches durchgeführt;<br />
Es wurde ein kleiner Korpus von Fachtexten des<br />
betreffenden Faches hergestellt. Die im Endregister<br />
enthaltenen Termini wurden in Excell-Tabellen<br />
die Standardisierung der Mikrostruktur eine wichtige<br />
Voraussetzung war. Da aber Beispiele nur in bestimmten<br />
Kontexten behilflich sind, wurden sie nur gelegentlich<br />
eingefügt.<br />
11 Korpusanalysen, im Sinne von automatischen Analysen von<br />
Textkorpora mit anschließender Korpusauswertung<br />
(Frequenzwerte) wurde bis jetzt ausgeschlossen. Jedoch wäre<br />
es interessant zu sehen, inwiefern eine solche Analyse mit einer<br />
Integrierung des Relevanzkriteriums die erhaltenen Ergebnisse<br />
widerspiegeln könnte oder nicht.<br />
Multilingual Resources and Multilingual Applications - Posters<br />
eingetragen, und die entstehenden Listen wurden auf<br />
Grund von Frequenzkriterien verglichen. Die aus<br />
diesem Prozess entstehende Endliste wurde<br />
zusätzlich auf der Basis des Relevanzkriteriums<br />
ergänzt.<br />
Die Einträge sind in strikt alphabetischer Reihenfolge,<br />
und die typischen Nachteile dieser Ordnung können dank<br />
des Online-Formats teilweise aufgehoben werden, da die<br />
begriffsystematischen Zusammenhänge durch verlinkte<br />
Verweise oft verdeutlich werden.<br />
Das Wörterbuch enthält zurzeit eine vollständige Sektion<br />
(DaF) mit 240 Einträgen, während andere Bereiche in<br />
Erarbeitung sind:<br />
1) Historische Syntax;<br />
2) Wortbildung;<br />
3) Textlinguistik;<br />
4) Fachsprachen.<br />
Eine neu erstellte Sektion LEX (Lexikographie) wurde<br />
gerade fertiggestellt und steht zur Veröffentlichung an.<br />
Sie enthält Lemmata aus dem Bereich der Lexikographie<br />
und Fachlexikographie sowie Metalexikographie und<br />
Metafachlexikographie.<br />
4. Die Sektion: LEX<br />
Die Sektion LEX wird voraussichtlich ca. 120 Einträge<br />
(Stand Juni <strong>2011</strong>) enthalten, die sich auf die wichtigsten<br />
Aspekte des Fachbereiches der Lexikographie<br />
konzentrieren. Es wird dabei auf folgende Themen<br />
Aufmerksamkeit gelegt:<br />
a. Lexikographie;<br />
b. Fachlexikographie;<br />
c. Wörterbuchforschung;<br />
d. Wörterbuchtypologie;<br />
e. Wörterbuchbenutzer und Wörterbuchbenutzung;<br />
f. Wörterbuchfunktionen;<br />
g. lexikographische Kriterien;<br />
h. Makrostrukur;<br />
i. Umtexte;<br />
j. Mediostruktur;<br />
k. Mikrostruktur.<br />
Im Folgenden wird ein Beispiel eines Eintrags aus dem<br />
Bereich Lex gezeigt. Es kann als Muster <strong>für</strong> die<br />
Erarbeitung von neuen Einträgen gelten. Jeder Autor<br />
kann die produzierten Lemmata an die Redaktion des<br />
Wörterbuches senden; nach der redaktionellen Prüfung<br />
wird der Eintrag veröffentlich und mit der Abkürzung des<br />
Autorennamens vermerkt.<br />
213
Multilingual Resources and Multilingual Applications - Posters<br />
214<br />
5. Literatur<br />
Abel, A. (2006): Elektronische Wörterbücher: Neue<br />
Wege und Tendenzen. In San Vincente, F. (Ed.)<br />
Akten der Tagung “Lessicografia bilingue e<br />
Traduzione: metodi, strumenti e approcci attuali”<br />
(Forlì, 17.-18.11.2005). Polimetrica Publisher (Open<br />
Access Publications). S. 35-56.<br />
Almind, R. (2005): Designing Internet Dictionaries.<br />
Hermes, 34, S. 37-54.<br />
Barz, I., Bergenholtz, H., Korhonen, J. (2005): Schreiben,<br />
Verstehen, Übersetzen, Lernen. Zu ein- und<br />
zweisprachigen Wörterbüchern mit Deutsch.<br />
Frankfurt a. M.: Peter Lang.<br />
Bergenholtz, H. (1989): Probleme der Selektion im<br />
allgemeinen einsprachigen Wörterbuch. In Hausmann,<br />
F. J. et al. (Hg). Wörterbücher: ein internationales<br />
Handbuch zur Lexikographie. Band 1. Berlin & New<br />
York: de Gruyter. S. 773-779.<br />
Bergenholtz, H., Tarp, S. (1995): Manuel of LSP<br />
lexikography. Preparation of LSP dictionaries -<br />
problems and suggested solutions. Amsterdam,<br />
Netherlands & Philadelphia: J. Benjamins.<br />
Foschi-Albert, M., Hepp, M. (2004): Zum Projekt:<br />
Bausteine zu einem deutsch-italienischen Wörterbuch<br />
der Linguistik. In daf Werkstatt, 4, S. 43-69.<br />
Hoffmann, L., Kalverkämper, H., Wiegand, H.E (Eds.)<br />
(1999): Fachsprachen. Handbücher zur Sprach- und<br />
Bild 1: Das Lemma “Fachlexikographie”<br />
Kommunikationswissenschaft (HSK 14.2.). Berlin &<br />
New York: de Gruyter.<br />
Pilegaard, M. (1994): Bilingual LSP Dictionaries. User<br />
benefit correlates with elaborateness of „explanation“.<br />
In Bergenholtz, H. & Schaeder, B. S. 211-228.<br />
Schaeder, B., Bergenholtz, H. (1994): Fachlexikographie.<br />
Fachwissen und seine Repräsentation in<br />
Wörterbüchern. Tübingen: G. Narr.<br />
Ripfel M., Wiegand, H.E. (1988): Wörterbuchbenutzungsforschung.<br />
Ein kritischer Bericht. In<br />
Studien zur Neuhochdeutschen Lexikographie VI. 2.<br />
Teilb. S. 482-520.<br />
Storrer, A., Harriehausen, B. (1998): Hypermedia <strong>für</strong><br />
Lexikon und Grammatik. Tübingen: G. Narr.<br />
Wiegand, H.E. (1977): Nachdenken über Wörterbücher.<br />
Aktuelle Probleme. In Drosdowski, H., Henne, H. &<br />
Wiegand, H.E. Nachdenken über Wörterbücher.<br />
Mannheim: Bibliographisches Institut / Dudenverlag.<br />
S. 51-102.<br />
Wiegand, H.E. (Ed.) (19<strong>96</strong>): Wörterbücher in der<br />
Diskussion II. Vorträge aus dem Heidelberger<br />
Lexikographie-Kolloquium. Tübingen: Lexicographica<br />
Series Major 70.<br />
Wiegand, H.E. (1988): Was ist eigentlich<br />
Fachlexikographie?. In Munske, H.H., Von Polenz, P.<br />
& Reichmann, O. & Hildebrandt, R. (Hg.). Deutscher<br />
Wortschatz. Lexikologische Studien. Berlin & New<br />
York: de Gruyter. S. 729-790.
Multilingual Resources and Multilingual Applications - Posters<br />
Knowledge Extraction and Representation: the EcoLexicon Methodology<br />
Pilar León Araúz, Arianne Reimerink<br />
Department of Translation and Interpreting, University of Granada<br />
Buensuceso 11, 18002, Granada, Spain<br />
E-mail: pleon@ugr.es, arianne@ugr.es<br />
Abstract<br />
EcoLexicon, a multilingual terminological knowledge base (TKB) on the environment, provides an internally coherent information<br />
system which aims at covering a wide range of specialized linguistic and conceptual needs. Knowledge is extracted through corpus<br />
analysis. Then it is represented and contextualized in several dynamic and interrelated information modules. This methodology<br />
solves two challenges derived from multidimensionality: 1) it offers a qualitative criterion to represent specialized concepts<br />
according to recent research on situated cognition (Barsalou, 2009), and 2) it is a quantitative and efficient solution to the problem of<br />
information overload.<br />
Keywords: knowledge extraction, knowledge representation, EcoLexicon, multidimensionality, context<br />
EcoLexicon 1<br />
1 http://ecolexicon.ugr.es<br />
1. Introduction<br />
is a multilingual knowledge base on the<br />
environment. So far it has 3,283 concepts and 14,695<br />
terms in Spanish, English and German. Currently, two<br />
more languages are being added: Modern Greek and<br />
Russian. It is aimed at users such as translators, technical<br />
writers, environmental experts, etc., which can access it<br />
through a friendly visual interface with different modules<br />
devoted to both conceptual, linguistic, and graphical<br />
information.<br />
In this paper, we will focus on some of the steps applied<br />
to extract and represent conceptual knowledge in<br />
EcoLexicon. According to Meyer et al. (1992),<br />
terminological knowledge bases (TKBs) should reflect<br />
conceptual structures in a similar way to how concepts<br />
relate in the human mind. The organization of semantic<br />
information in the brain should thus underlie any<br />
theoretical assumption concerning the retrieval and<br />
acquisition of specialized knowledge concepts as well as<br />
the design of specialized knowledge resources (Faber,<br />
2010). In Section 2, we explain how knowledge is<br />
extracted through corpus analysis. In Section 3, we show<br />
how conceptual knowledge is represented and<br />
contextualized in dynamic and interrelated networks.<br />
2. Conceptual Knowledge Extraction<br />
According to corpus-based studies, when a term is<br />
studied in its linguistic context, information about its<br />
meaning and its use can be extracted (Meyer &<br />
Mackintosh, 19<strong>96</strong>). In EcoLexicon, the corpus consists of<br />
specialized (e.g. scientific journal articles, thesis, etc.),<br />
semi-specialized texts (textbooks, manuals, etc.) and<br />
texts for the general public, all in the multidisciplinary<br />
domain of the environment. Each language has a separate<br />
corpus and the knowledge is extracted bottom-up from<br />
each of the corpora. The underlying ontology is language<br />
independent and based on the knowledge extracted from<br />
all the corpora. The extraction of conceptual knowledge<br />
combines direct term searches and knowledge pattern<br />
(KP) analysis. According to many studies on the subject,<br />
KPs are considered one of the most reliable methods for<br />
knowledge extraction (Barrière, 2004). Normally, the<br />
most recurrent knowledge patterns (KPs) for each<br />
conceptual relation identified in previous research are<br />
used to find related term pairs (Auger & Barrière, 2008).<br />
Afterwards, these terms are used for direct term searches<br />
to find new KPs and relations. Therefore, the<br />
methodology consists of the cyclic repetition of both<br />
procedures.<br />
When searching for the term EROSION, conceptual<br />
concordances show how different KPs convey different<br />
215
Multilingual Resources and Multilingual Applications - Posters<br />
relations with other specialized concepts. The main<br />
relations are caused_by, affects, has_location and<br />
has_result, which highlight the procedural nature of the<br />
concept and the important role played by<br />
non-hierarchical relations.<br />
In Figure 1, EROSION is related to its diverse kinds of<br />
This relation can also be conveyed through compound<br />
names such as flood-induced (10) or storm-caused (12)<br />
and any expression containing cause as a verb or noun:<br />
one of the causes of (9), cause (4, 5, 8) and caused by<br />
(14). EROSION is also linked to the patients it affects, such<br />
as WATER (15), SEDIMENTS (16), and BEACHES (17).<br />
However, the affected entities, or patients, are often<br />
equivalent to locations (eg. if EROSION affects BEACHES it<br />
actually takes place at the BEACH). The difference lies in<br />
the kind of KPs linking the propositions. The affects<br />
relation is often reflected through the preposition of (10)<br />
or verbs like threatens (18), damaged by (17) or provides<br />
(19), whereas the has_location relation is conveyed<br />
through prepositions linked to directions (around, 21;<br />
along, 22; downdrift, 23) or spatial expressions such as<br />
takes place (24). In this way, EROSION appears linked to<br />
the following locations: LITTORAL BARRIERS (21),<br />
COASTS (22) and STRUCTURES (23). Result is an essential<br />
216<br />
agents, such as STORM SURGE (1, 7), WAVE ACTION (2,<br />
13), RAIN (3), CONSTRUCTION PROJECTS (6) and<br />
HUMAN-INDUCED FACTORS (11).They can be retrieved<br />
thanks to all KPs expressing the relation caused_by, such<br />
as resultant (1), agent for (2, 3), due to (6, 7), and<br />
responsible for (11).<br />
Figure 1: Non-hierarchical relations associated with EROSION<br />
Figure 2: Hierarchical relations associated with EROSION<br />
dimension in the description of any process, since it also<br />
has certain effects, which can be the creation of a new<br />
entity (SEDIMENTS, 25; MARSHES, 29; BAYS, 31) or the<br />
beginning of another process (SEAWATER INTRUSION, 31;<br />
PROFILE STEEPENING, 32).<br />
All these related concepts are quite heterogeneous. They<br />
belong to different paradigms in terms of category<br />
membership or hierarchical range. For instance, some of<br />
the agents of EROSION are natural (WIND, WAVE ACTION)<br />
or artificial (JETTY, MANGROVE REMOVAL) and others are<br />
general concepts (STORM) or very specific (MEANDERING<br />
CHANNEL). This explains why knowledge extraction must<br />
still be performed manually, but it also illustrates one of<br />
the major problems in knowledge representation:<br />
multidimensionality (Rogers, 2004).<br />
This is better exemplified in the concordances in Figure<br />
2, since multidimensionality is most often codified in the<br />
is_a relation. In the scientific discourse community,
concepts are not always described in the same way<br />
because they depend on perspective and subject-fields.<br />
For instance, EROSION is described as a natural process of<br />
REMOVAL (33), a GEOMORPHOLOGICAL PROCESS (34), a<br />
COASTAL PROCESS (35) or a STORMWATER IMPACT (36).<br />
The first two cases can be considered traditional<br />
ontological hyperonyms. The choice of any of them<br />
depends on the upper-level structure of the<br />
representational system and its level of abstraction.<br />
However, COASTAL PROCESS and STORMWATER IMPACT<br />
frame the concept in more concrete subject-fields and<br />
referential settings. The same applies to subtypes, where<br />
the multidimensional nature of EROSION is clearly shown.<br />
It can thus be classified according to the dimensions of<br />
result (SHEET, RILL, GULLY, 37; DIFFERENTIAL EROSION,<br />
38), direction (LATERAL, 39; HEADWARD EROSION, 49),<br />
agent (WAVE, 41; WIND, 43) and patient (SEDIMENT, 47;<br />
DUNE, 48; SHORELINE EROSION, 49).<br />
3. Dynamic Knowledge Representation<br />
Since categorization is a dynamic context-dependent<br />
process, the representation and acquisition of specialized<br />
knowledge should certainly focus on contextual<br />
variation. Barsalou (2009: 1283) states that a concept<br />
produces a wide variety of situated conceptualizations in<br />
specific contexts. Accordingly, dynamism in the<br />
environmental domain comes from the effects of context<br />
on the way concepts are interrelated. Multidimensionality<br />
is commonly regarded as a way of enriching traditional<br />
static representations (León Araúz and Faber, 2010).<br />
However, in the environmental domain it has caused a<br />
great deal of information overload, which ends up<br />
jeopardizing knowledge acquisition. This is mainly<br />
caused by versatile concepts, such as WATER, which are<br />
usually top-level general concepts involved in a myriad<br />
of events.<br />
Our claim is that any specialized domain contains<br />
sub-domains in which conceptual dimensions become<br />
more or less salient depending on the activation of<br />
specific contexts. As a result, a more believable<br />
representational system should account for<br />
re-conceptualization according to the situated nature of<br />
concepts. In EcoLexicon, this is done by dividing the<br />
global environmental specialized field in different<br />
contextual domains: HYDROLOGY, GEOLOGY,<br />
BIOLOGY, METEOROLOGY, CHEMISTRY,<br />
Multilingual Resources and Multilingual Applications - Posters<br />
ENGINEERING, WATER TREATMENT, COASTAL<br />
PROCESSES and NAVIGATION.<br />
Figure 3: EROSION context free network<br />
Nevertheless, not only versatile concepts, such as WATER,<br />
are constrained, since information overload can also<br />
affect any other concept that is somehow linked with<br />
versatile ones. For instance, Figure 3 shows EROSION in a<br />
context-free network, which appears overloaded mainly<br />
because it is strongly linked to WATER, since this is one of<br />
its most important agents.<br />
Figure 4: EROSION in the GEOLOGY domain<br />
Contextual constraints are neither applied to individual<br />
concepts nor to individual relations, instead, they are<br />
applied to each conceptual proposition. When constraints<br />
are applied, EROSION is just linked to propositions<br />
belonging to the context of GEOLOGY (Figure 4) or<br />
217
Multilingual Resources and Multilingual Applications - Posters<br />
HYDROLOGY (Figure 5).<br />
218<br />
Figure 5: EROSION in the HYDROLOGY domain<br />
Comparing both networks and especially focusing on<br />
EROSION and WATER, the following conclusions can be<br />
drawn. The number of conceptual relations changes from<br />
one network to another, as EROSION is not equally<br />
relevant in both domains. EROSION is a prototypical<br />
concept of the GEOLOGY domain, this is why it shows<br />
more propositions. Nevertheless, since it is also strongly<br />
linked with WATER, the HYDROLOGY domain is also<br />
essential in the representation of EROSION. Relation types<br />
do not substantially change from one network to the<br />
other, but the GEOLOGY domain shows a greater<br />
number of type_of relations. This is due to the fact that<br />
the HYDROLOGY domain only includes types of<br />
EROSION whose agent is WATER, such as FLUVIAL<br />
EROSION and GLACIER EROSION. The GEOLOGY domain<br />
includes those and others, such as WIND EROSION, SHEET<br />
EROSION, ANTHROPIC EROSION, etc. The GEOLOGY<br />
domain, on the other hand, also includes concepts that are<br />
not related to HYDROLOGY such as ATTRITION because<br />
there is no WATER involved.<br />
On the contrary, WATER displays more relations in the<br />
HYDROLOGY domain. This is caused by the fact that<br />
WATER is a much more prototypical concept in<br />
HYDROLOGY. Therefore, its first hierarchical level<br />
shows more concepts. For example, in GEOLOGY, there<br />
are less WATER subtypes because the network only shows<br />
those that are related to the geological cycle (MAGMATIC<br />
WATER, METAMORPHIC WATER, etc.). In HYDROLOGY,<br />
there are more WATER subtypes related to the<br />
hydrological cycle itself (SURFACE WATER,<br />
GROUNDWATER, etc.). Even the shape of each network<br />
illustrates the prototypical effects of WATER or EROSION.<br />
In Figure 4, EROSION is displayed in a radial structure that<br />
shows it as a central concept in GEOLOGY, whereas in<br />
Figure 5, the asymmetric shape of the network implies<br />
that, more than EROSION, WATER is the prototypical<br />
concept of HYDROLOGY.<br />
4. Acknowledgements<br />
This research has been carried out in project<br />
FFI<strong>2011</strong>-22397/FILO funded by the Spanish Ministry of<br />
Science and Innovation.<br />
5. References<br />
Auger, A., Barrière, C. (2008): Pattern-based approaches<br />
to semantic relation extraction: A state-of-the-art.<br />
Special Issue on Pattern-Based Approaches to<br />
Semantic Relation Extraction, Terminology, 14(1),<br />
pp. 1–19<br />
Barrière, C. (2004): Knowledge-rich contexts discovery.<br />
In Proceedings of the 17th Canadian Conference on<br />
Artificial Intelligence (AI’2004). May 17–19, London,<br />
Ontario, Canada.<br />
Barsalou, L.W. (2009): Simulation, situated<br />
conceptualization and prediction. Philosophical<br />
Transactions of the Royal Society of London:<br />
Biological Sciences, 364, pp. 1281–1289.<br />
Faber, P. (2010): Conceptual modelling in specialized<br />
knowledge resources. In XII International Conference<br />
Cognitive Modelling in Linguistics. September,<br />
Dubrovnik.<br />
León Araúz, P., Faber, P. (2010): Natural and contextual<br />
constraints for domain-specific relations. In<br />
Proceedings of Semantic relations. Theory and<br />
Applications. 18–21 May, Valetta, Malta.<br />
Meyer, I., Mackintosh, K. (19<strong>96</strong>): The corpus from a<br />
terminographer’s viewpoint. International Journal of<br />
Corpus Linguistics, 1(2), pp. 257–285.<br />
Meyer, I., Bowker, L., Eck, K. (1992): COGNITERM:<br />
An experiment in building a knowledge-based term<br />
bank. In Proceedings of Euralex ’92, pp. 159–172.<br />
Rogers, M. (2004): Multidimensionality in concepts<br />
systems: a bilingual textual perspective. Terminology,<br />
10( 2), pp. 215–240.
Multilingual Resources and Multilingual Applications - Posters<br />
Processing Multilingual Customer Contacts via Social Media<br />
Michaela Geierhos, Yeong Su Lee, Matthias Bargel<br />
Center for Information and Language Processing (CIS)<br />
Ludwig Maximilian University of Munich<br />
Geschwister-Scholl-Platz 1, D-80539 München, Germany<br />
E-mail: micha@cis.uni-muenchen.de, yeong@cis.uni-muenchen.de, matthias@cis.uni-muenchen.de<br />
Abstract<br />
Within this paper, we will describe a new approach to customer interaction management by integrating social networking channels<br />
into existing business processes. Until now, contact center agents still read these messages and forward them to the persons in charge<br />
of customer’s in the company. But with the introduction of Web 2.0 and social networking clients are more likely to communicate<br />
with the companies via Facebook and Twitter instead of filling data in contact forms or sending e-mail requests. In order to maintain<br />
an active communication with international clients via social media, the multilingual consumer contacts have to be categorized and<br />
then automatically assigned to the corresponding business processes (e.g. technical service, shipping, marketing, and accounting).<br />
This allows the company to follow general trends in customer opinions on the Internet, but also record two-sided communication for<br />
customer relationship management.<br />
Keywords: classification of multilingual customer contacts, contact center application support, social media business integration<br />
1. Introduction<br />
Considering that Facebook alone had more than 750<br />
million active users 1<br />
in August <strong>2011</strong> it becomes apparent<br />
that Facebook currently is the most preferred medium by<br />
consumers and companies alike. Since many businesses<br />
are moving to online communities as a means of<br />
communicating directly with their customers, social<br />
media has to be explored as an additional communication<br />
channel between individuals and companies. While the<br />
English speaking consumers on Facebook are more likely<br />
to respond to communication rather than to initiate<br />
communication with an organization (Browne et al.,<br />
2009), the German speaking community in turn directly<br />
contacts the companies. Therefore, some German<br />
enterprises already have regularly updated Facebook<br />
pages for customer service and support, e.g. Telekom.<br />
Using the traditional communication channels such as<br />
telephone and e-mail, there are already established<br />
approaches and systems to incoming requests. They are<br />
used by companies to manage all client contacts through<br />
a variety of mediums such as telephone, fax, letter, e-mail,<br />
and online live chat. Contact center agents are therefore<br />
1 http://www.facebook.com/press/info.php?statistics<br />
responsible to assign all customer requests to internal<br />
business processes. However, social networking has not<br />
yet been integrated into customer interaction<br />
management tools.<br />
1.1. Related Work<br />
With the growth of social media, companies and<br />
customers now use sites such as Facebook and Twitter to<br />
share information and provide support. More and more<br />
integrated cross-platform campaigns are dealing with<br />
product opinion mining or providing web traffic statistics<br />
to analyze customer behavior. There is a plenty of<br />
commercial solutions, of varying quality, for these tasks,<br />
e.g. GoogleAlerts, BuzzStream, Sysomos, Alterian,<br />
Visible Technologies, and Radian6.<br />
The current trend goes to development of virtual contact<br />
centers integrating company’s fan profiles on social<br />
networking sites. This virtual contact center processes the<br />
customer contacts and forwards them to company’s<br />
service and support team. For instance, Eptica provides a<br />
commercial tool for customer interaction management<br />
via Facebook.<br />
Other monitoring systems try to predict election results<br />
(Gryc & Moilanen, 2010) or success of movies and music<br />
219
Multilingual Resources and Multilingual Applications - Posters<br />
(Krauss et al., 2008) by using scientific analysis of<br />
opinion polls or doing sentiment analysis on special web<br />
blogs or online forum discussions. Another relevant issue<br />
is the topic and theme identification as well as sentiment<br />
detection. Since blogs consist of news or messages<br />
dealing with various topics, blog content has to be<br />
divided into several topic clusters (Pal & Saha, 2010).<br />
1.2. Towards a Multilingual Social Media<br />
Customer Service<br />
Our proposed solution towards a web monitoring and<br />
customer interaction management system is quite simple.<br />
We focus on a modular architecture fully configurable for<br />
all components integrated in its work-flow (e.g. software,<br />
data streams, and human agents for customer service).<br />
Our first prototype, originally designed for processing<br />
customer messages posted on social networking sites<br />
about mobile-phone specific issues, can also deal with<br />
other topics and use different text types such as e-mails,<br />
blogs, RSS feeds etc. Unlike the commercial monitoring<br />
systems mentioned above, we concentrate on a linguistic,<br />
rule-based approach for message classification and<br />
product name recognition. One of its core innovations is<br />
its paraphrasing module for intra- and inter-lingual<br />
product name variations because of different national and<br />
international spelling rules or habits. By mapping product<br />
name variations to an international canonical form, our<br />
system allows for answering questions like Which<br />
statements are made about this mobile phone in which<br />
languages/in which social networks/in which countries?<br />
Its product name paraphrasing engine is designed in such<br />
a way that standard variants are assigned automatically,<br />
regular variants are assigned semi-automatically and<br />
idiosyncratic variants can be added manually. Moreover,<br />
our system can be adapted according to user’s language<br />
needs, i.e. the application can be easily extended on<br />
further natural languages. Until now, our prototype can<br />
deal with three very different languages: German, Greek,<br />
and Korean. It therefore provides maximum flexibility to<br />
service providers by enabling multiple services with only<br />
one system.<br />
220<br />
2. System Overview<br />
Since customers first share their problems with a social<br />
networking community before directly addressing the<br />
company, the social networking site will be the interface<br />
between customer and company. For instance, Facebook<br />
users post on the wall of a telecommunication company<br />
messages concerning tariffs, technical malfunction or<br />
bugs of its products, positive and negative feedback. The<br />
collector should download every n seconds (e.g. 10 sec)<br />
data from the monitored social networking site. Above all<br />
it should be possible to choose the social networking site,<br />
especially the business pages, to be monitored. This can<br />
be configured by updating the collector’s settings. In<br />
order to retrieve data from Facebook, we use its graph<br />
API. Then customer messages will be stored in a database.<br />
After simplifying their structure 2<br />
, the requests have to be<br />
categorized by the classification module. During the<br />
classification process, we assign both content and<br />
semantic tags (cf. Sect. 3.2) as features to the user posts<br />
before re-storing them in a database. According to the<br />
tags the messages are assigned to the corresponding<br />
business process. This n : 1 relationship is modeled in the<br />
contact center interface before passing these messages as<br />
e-mail requests to the customer interaction management<br />
tool used in contact centers. Finally, the pre-classified<br />
e-mails are automatically forwarded to the persons in<br />
charge of customer services. Those agents reply to the<br />
client requests and their responses will be delivered via<br />
e-mail to the contact center before being transformed into<br />
social network messages and sent back to the Facebook<br />
wall. Afterwards, the Facebook user can read his answer.<br />
3. Linguistic Processing of Costumer<br />
Contacts<br />
Within the customer requests, we try to discover<br />
relationships between clients and products, customers<br />
and technical problems, products and features that will be<br />
used for classification purposes. We are aware of the fact<br />
that many products (mobile phones, chargers, headsets,<br />
batteries, software, and operating systems) are sold in<br />
different countries under the same or under different<br />
names. Our system stores a unique international ID for<br />
each product. Product names and their paraphrases are<br />
language specific. Our prototype normalizes found<br />
product names to the international ID.<br />
2<br />
For example, Facebook wall posts are represented as<br />
structured data that can easily retrieved from Facebook graph<br />
API. We simplify this data format before using it for extraction<br />
and classification purposes.
3.1. International Product Name Paraphrasing<br />
Our first approach to product name paraphrasing was to<br />
use paraphrasing classes. Much as verbs are inflected<br />
according to their inflection class, product names were<br />
inflected according to their paraphrasing class. Yet,<br />
paraphrasing classes had to be assigned manually and<br />
quite many classes were needed. Therefore, we decided<br />
to use a simplified system: Each product or manufacturer<br />
name is stored in a canonical form: Thus, a name of the<br />
type glofiish g500 is stored in the form glofiish-g-500,<br />
even if glofiish g-500 or glofiish g500 should be more<br />
frequent. The minus characters tell our system where a<br />
new part of the product name begins. A product or<br />
manufacturer name has permutations: In German o2<br />
online tarif has the permutation tarif o2 online. Standard<br />
permutations are added automatically: A product or<br />
manufacturer name with three parts has the standard<br />
permutation 123. German tariff names of the type o2<br />
online tarif have the standard permutations 312 and 23<br />
von 1 as in online tarif von O2 (online tariff by o2).<br />
Apart from their canonical name and its variants, product<br />
names can also have spelling variants. Thus, android has<br />
the spelling variants androit, antroid, antroit, andorid,<br />
adroid, andoid and andoit. (These are some of the most<br />
frequent ways android is actually spelt in the customer<br />
messages.) For each spelling variant, our system<br />
automatically generates all paraphrases that exist<br />
according to the standard and the manually added<br />
permutations of the canonical name. I.e. the paraphrases<br />
of the mobile phone name e-ten glofiishg-500 include<br />
e-ten klofisch-g-500, e-ten klofisch-g 500, e-ten klofisch<br />
g-500, etc.<br />
Apart from spelling variants, product names can also<br />
have lexical variants. The mobile phone tct mobile<br />
one-touch-v-770-a has the lexical variant playboy-phone.<br />
The regular permutation transformations are not applied<br />
to lexical variants. But lexical variants and their<br />
manufacturer-based variants (e.g. tct playboy-phone and<br />
playboy-phone) are, of course, paraphrases, too.<br />
3.2. Grammar-based classification<br />
Grammar experts can create any number of content and<br />
sentiment classifiers. A classifier’s grammar consists of a<br />
set of positive constraints and a set of negative<br />
constraints. To classify a message, our system simply<br />
applies the grammars of all its classifier objects to the<br />
Multilingual Resources and Multilingual Applications - Posters<br />
message. If a content classifier’s grammar matches, its<br />
tag is added to the message’s content tags. Sentiment<br />
classification works analogously with the exception that<br />
exactly one tag is assigned.<br />
Content and sentiment classificators are language and<br />
URL specific: A classifier has exactly one language and a<br />
set of URLs. It will only be applied to messages that have<br />
the same language and that stem from one of the URLs in<br />
the classifier’s set of URLs. In general, content tags and<br />
product list are independent of each other. But many<br />
classifiers will have constraints that require that a product<br />
(or other entity) of a certain type be mentioned. Thus, a<br />
classifier that assigns the tag phone available? (e.g. to the<br />
message When will the new iPhone be released?) would<br />
probably include the mobile phone grammar in its<br />
constraints by using the special term \mobile_phone.<br />
4. Discussion<br />
4.1. No statistical approach<br />
We think that the fact that contact center agents can<br />
invent new tags and assign new or old tags to (badly)<br />
classified messages, if they mark the strings that are<br />
supposed to justify the assignment of the tag, is a good<br />
reason for not using a statistical approach. If we used a<br />
statistical approach, human work would be necessary at<br />
some point of the development process: Some algorithm<br />
would have to be trained. In our approach, the human<br />
work is done in the customer management process. This<br />
way, two things are achieved in one step: The customer’s<br />
request is answered and the classification algorithm is<br />
enhanced. The system is being enhanced while it is used.<br />
There is no need to interrupt the customer interaction in<br />
order to train it on new data that data specialists have<br />
created. Besides, manual intervention is much more<br />
straightforward and transparent, if a grammar of the type<br />
described above is used than it would be with a statistical<br />
algorithm. Our system is flexible in the sense that it can<br />
easily be modified in such a way that very specific<br />
requirements are met. If, e.g., a future user of our tool (a<br />
company that wants to interact with its customers) should<br />
want to assign every message that has the word hotline in<br />
it a certain tag – such as hotline problem – , then this<br />
requirement can be met by simply adding the line<br />
hotline to the positive constraints of the classifier<br />
called hotline problem.<br />
221
Multilingual Resources and Multilingual Applications - Posters<br />
4.2. Applying the DRY principle<br />
Our prototype follows the DRY principle (Don't repeat<br />
yourself (Murrell, 2009:35)): Changes are only made in<br />
one place. An example: the Korean variants of the mobile<br />
phone name with the international ID google-nexus-s<br />
include google nexus s, google nexus-s, nexus s, nexus-s,<br />
구글 넥서스 에스, 구글 넥서스에스, 구글 넥서스 s, 구글 넥서스-s, 넥서스<br />
에스, 넥서스에스, 넥서스 s, 넥서스-s, 구글 nexus s, 구글 nexus-s,<br />
google 넥서스에스, 구글의 넥서스에스. This phenomenon is<br />
represented in our system as follows: The Korean<br />
producer name corresponding to the international ID<br />
google has the variants google and 구글. The Korean<br />
mobile phone name with the international ID nexus-s has<br />
the variants nexus-s, 넥서스에스 and 넥서스-s. This is the<br />
only information our users have to store in order to make<br />
the system generate these and many other variants. Our<br />
tool generates google nexus s, 구글 넥서스 s and similar<br />
variants using the general rule that in any permutation of a<br />
product name any minus character may be replaced by a<br />
space character. It generates 넥서스에스, 넥서스 s and similar<br />
variants using the general rule that the producer name may<br />
be omitted. And our system generates 구글의 넥서스에스<br />
using the two Korean variants of the producer name and<br />
the general rule that phone names can have the form<br />
의 . (의 is a<br />
genitive affix, i.e. 구글의 넥서스에스 literally means Google's<br />
Nexus S or Nexus S by Google.)<br />
We might, of course, add the general rule to our product<br />
name paraphrasing engine that any part of a Korean<br />
product name may be spelt either with Latin or with<br />
Hangul characters – according to several sets of<br />
transliteration conventions that are used in parallel.<br />
Any change in a producer, tariff or product name object,<br />
such as the Korean mobile phone name with the<br />
international ID nexus-s, has implications for the<br />
grammars of the message classifiers: Newly generated<br />
variants of the product name must be matched by all<br />
instances of \mobile_phone in all grammars. For<br />
efficiency reasons, we compile all product names, tariff<br />
names, producer names, message classification grammars,<br />
sentiment classification grammars, and so on, into one<br />
single function. This function is very efficient, because it<br />
doesn’t do much more than apply one very large,<br />
compiled regular expression. The compiling and<br />
reloading of this function is done in the background, so<br />
222<br />
the users of our tool do not need to know anything about it.<br />
They don’t even have to understand the word compile.<br />
They just need to know that the system sometimes needs a<br />
few seconds to be able to use changed objects.<br />
5. Conclusion and Outlook<br />
Within this paper, we described a new technical service<br />
dealing with the integration of social networking<br />
channels into customer interaction management tools.<br />
Mining social networks for classification purposes is no<br />
novelty; providing an assignment of customer messages<br />
to business processes instead of classifying them in topics<br />
did not exist before. Above all, our system features<br />
effective named entity recognition because of its name<br />
paraphrasing mechanism dealing with different types of<br />
misspellings in both intra- and interlingual names of<br />
tariffs, products, manufacturers and providers. Future<br />
research will expand upon this study, investigating other<br />
social networking sites and additional companies across a<br />
range of non-telecommunication products or services.<br />
6. Acknowledgements<br />
This work was supported by grant no. KF2713701ED0<br />
awarded by the German Federal Ministry of Economics<br />
and Technology.<br />
7. References<br />
Browne, R., Clements, E., Harris, R., Baxter, S. (2009):<br />
Business and consumer communication via online<br />
social networks: a preliminary investigation. In<br />
ANZMAC 2009.<br />
Gryc, W., Moilanen, K. (2010): Leveraging Textual<br />
Sentiment Analysis with Social Network Modeling:<br />
Sentiment Analysis of Political Blogs in the 2008 U.S.<br />
Presidential Election. In Proceedings of the From Text<br />
to Political Positions Workshop (T2PP 2010), Vrije<br />
Universiteit, Amsterdam, April 9–10 2010.<br />
Krauss, J., Nann, S., Simon, D., Fischbach, K., Gloor, P.A.<br />
(2008): Predicting Movie Success and Academy<br />
Awards Through Sentiment and Social Network<br />
Analysis. In ECIS 2008.<br />
Murrell, P. (2009): Introduction to Data Technologies.<br />
Auckland, New Zealand.<br />
Pal, J.K., Saha, A. (2010): Identifying Themes in Social<br />
Media and Detecting Sentiments. Technical Report<br />
HPL-2010-50, HP Laboratories.
Multilingual Resources and Multilingual Applications - Posters<br />
ATLAS – A Robust Multilingual Platform for the Web<br />
Diman Karagiozov*, Svetla Koeva**, Maciej Ogrodniczuk***, Cristina Vertan****<br />
* Tetracom Interactive Solutions Ltd., ** Bulgarian Academy of Sciences,<br />
*** Polish Academy of Sciences, **** University of Hamburg,<br />
*Tetracom LTd. Sofia, Bulgaria, **52 Shipchenski prohod, bl. 17 Sofia 1113 Bulgaria,<br />
***ul. J.K. Ordona 2101-237 Warszawa, Poland, ****Von-Melle Park 6 20146 Hamburg, Germany<br />
E-mail: diman@tetracom.com, svetla@dcl.bas.bg, maciej.ogrodniczuk@gmail.com,<br />
cristina.vertan@uni-hamburg.de<br />
Abstract<br />
This paper presents a novel multilingual framework integrating linguistic services around a Web-based content management system.<br />
The language tools provide semantic foundation for advanced CMS functions such as machine translation, automatic categorization<br />
or text summarization. The tools are integrated into processing chains on the basis of UIMA architecture and using uniform<br />
annotation model. The CMS is used to prepare two sample online services illustrating the advantages of applying language<br />
technology to content administration.<br />
Keywords: content management system, language processing chains, UIMA, language technology<br />
1. Introduction<br />
During the last years, the number of applications which<br />
are entirely Web-based, or offer at least some Web<br />
front-end has grown dramatically. As a response to the<br />
need of managing all this data, a new type of system<br />
appeared: the Web-content management system. In this<br />
article we will refer to these type of system as WCMS.<br />
Existent WCMS focus on storage of documents in<br />
databases and provide mostly full-text search<br />
functionality. These types of systems have limited<br />
applicability, due to two reasons:<br />
� data available online is often multilingual, and<br />
� documents within a CMS are semantically related<br />
(share some common knowledge, or belong to<br />
similar topics)<br />
Shortly currently available CMS do not exploit modern<br />
techniques from information technology like text mining,<br />
semantic Web or machine translation.<br />
The ICT PSP EU project ATLAS 1<br />
– Applied Technology<br />
1 The work reported here was carried out within the Applied<br />
Technology for Language-Aided CMS project co-funded by the<br />
European Commission under the Information and<br />
Communications Technologies (ICT) Policy Support<br />
Programme (Grant Agreement No 250467). The authors would<br />
like to thank all representatives of project partners for their<br />
contribution<br />
for Language-Aided CMS aims at filling this gap by<br />
providing three innovative Web services within a WCMS.<br />
These three Web services: i-Librarian, EUDocLib and<br />
i-Publisher are not only thematically different but offer<br />
also different levels of intelligent information processing.<br />
The ATLAS WCMS makes use of state-of-the art text<br />
technology methods in order to extract information and<br />
cluster documents according to a given hierarchy. A text<br />
summarization module and a machine translation engine<br />
are embedded as well as a cross-lingual semantic search<br />
engine (Belogay et al., <strong>2011</strong>).<br />
The cross-lingual search engine implements Semantic<br />
Web technology: the document content is represented as<br />
RDF triples and the search index is built up from these<br />
triples.<br />
The RDF representation of documents collects not only<br />
metadata information about the whole file but also<br />
exploits linguistic analysis of the document and store as<br />
well the mapping of the file on some ontological concept.<br />
This paper presents the architecture of the ATLAS system<br />
with particular focus on the language processing<br />
components to be embedded aiming to show how robust<br />
NLP (natural language processing) tools can be wrapped<br />
in a common framework.<br />
223
Multilingual Resources and Multilingual Applications - Posters<br />
224<br />
2. Language resources in<br />
the ATLAS System<br />
The linguistic diversity in the project is a challenge not to<br />
be neglected: the languages belong to four language<br />
families and involve three alphabets. To our knowledge it<br />
is the first WCMS which will offer solutions for<br />
documents written in languages from Central and<br />
South-Eastern Europe.<br />
Whilst the standardised development of tools for<br />
widespread languages as English and German is more<br />
common, the situation is quite different when involving<br />
languages from Central and South Eastern Europe (see<br />
http://www.c-phil.uni-hamburg.de/view/Main/LrecWork<br />
shop2010).<br />
Tools with different processing depth, different output<br />
formats and sometimes very particular approach are<br />
current state of the art in the language technology map of<br />
the above-mentioned area (Degórski, Marcińczuk &<br />
Przepiórkowski, 2008). One of the innovative issues in<br />
project ATLAS is the integration of linguistically and<br />
technologically heterogeneous language tools within a<br />
common framework.<br />
The following description presents the steps taken in<br />
order to provide such common representation.<br />
� Starting from the fixed desiderata to include text<br />
summarisation, automatic document classi fication,<br />
machine translation and cross-lingual information<br />
retrieval the minimal list of tools required by such<br />
engines which can be provided by all languages<br />
involved in the project has been collated and<br />
includes:<br />
o tokeniser,<br />
o sentence boundary detector,<br />
o paragraph boundary detector,<br />
o lemmatizer,<br />
o PoS Tagger,<br />
o NP (noun phrase) chunker,<br />
o NE (named entity) extractor.<br />
Some of these tools are not completely available for<br />
particular languages (e.g. NP chunker for Croatian) but<br />
can be developed within the project. Regarding the NE<br />
extractor the following entities have been agreed upon:<br />
persons, dates, time, location and currency.<br />
� The annotation levels in the texts and the minimal<br />
features to be annotated have been defined:<br />
Paragraph, Sentence, Token, NP and NE. In order to<br />
provide a common representation all linguistic<br />
information regarding lemma, PoS etc. have been<br />
agreed to be provided at the token level. For a token<br />
following features are retained:<br />
o begin – an integer representing the offset of the<br />
first character of the token,<br />
o end – an integer representing the offset of the<br />
last character of the token,<br />
o pos – a string representing the morphosyntactic<br />
tag (PoS, gender, number) associated with the<br />
token,<br />
o lemma – a string containing the lemma of the<br />
token.<br />
� For each of the above-mentioned tools the list of<br />
additional linguistic features to be represented (if<br />
necessary and available) have been defined, e.g.<br />
antecedentBegin and antecedentEnd representing<br />
the offset of the first and respectively the last<br />
character of the referent in an NP. This feature is<br />
necessary for processing German NPs and is<br />
therefore included as optional in the NP annotation<br />
frame.<br />
A glossary of tagsets delivered by each tool is also<br />
maintained, ensuring cross-lingual processing.<br />
Each of the language tools can be included as primitive<br />
engine, i.e. part of an UIMA aggregate engine, but also as<br />
an aggregate engine. In this way any language<br />
component can reuse results produced by a particular tool<br />
and exploit its full functionality if required.<br />
3. Language Processing chains<br />
One of the goals of the ATLAS WCMS is to offer<br />
documented language processing chains (LPCs) for text<br />
annotation. A processing chain for a given language<br />
includes a number of existing tools, adjusted and/or<br />
fine-tuned to ensure their interoperability. In most<br />
respects a language processing chain does not require<br />
development of new software modules but rather<br />
combining existing tools.<br />
Most of the basic linguistic tools (sentence splitters,<br />
stopword filters, tokenizers, lemmatizers, part-of-speech<br />
taggers) for languages in scope of our interest have<br />
already existed as standalone offline applications.<br />
The multilinguality of the system services requires high<br />
level of accuracy of each monolingual language chain –<br />
simple example is that a word with part-of-speech tag
ambiguity in one language may correspond to an<br />
unambiguous word in the other language.<br />
The complexity grows at the level of structure and sense<br />
ambiguity differs among languages. Thus the high<br />
precision and performance of language specific chains<br />
predefines to the great extend the quality of the system as<br />
a whole.<br />
For example the Bulgarian PoS tagger has been<br />
developed as a modified version of the Brill tagger<br />
applying a rule-based approach and techniques for the<br />
optimization leading to the 98.3% precision (Koeva,<br />
2007). The large Bulgarian grammar dictionary used for<br />
the lemmatization is implemented as acyclic and<br />
deterministic finite-state automata to ensure a very fast<br />
dictionary look-up.<br />
The language processing chains have been fine-tuned and<br />
adjusted to facilitate integration into a common UIMA<br />
framework. Other tools (such as noun phrase extractors<br />
or named entity recognizers) had to be implemented or<br />
multilingually ported.<br />
The annotation produced by the chain along with<br />
additional tools (e.g. frequency counters) results in<br />
higher-level functions such as detection of keywords and<br />
phrases along with improbable phrases from the analyzed<br />
content, and utilisation of more sophisticated user<br />
functionality deserves complex linguistic functions as<br />
multilingual text summarisation and machine translation.<br />
UIMA is a pluggable component architecture and<br />
software framework designed especially for the analysis<br />
of unstructured content and its transformation into<br />
structured information. Apart from offering common<br />
components (e.g. the type system for document and text<br />
annotations) it builds on the concept of analysis engines<br />
(in our case, language specific components) taking form<br />
of primitive engines which can wrap up NLP (natural<br />
language processing) tools adding annotations aggregate<br />
engines which define the sequence of execution of<br />
chained primitives.<br />
Making the tools chainable requires ensuring their<br />
interoperability on various levels. Firstly, compatibility<br />
of formats of linguistic information is maintained within<br />
the defined scope of required annotation (Ogrodniczuk &<br />
Karagiozov, <strong>2011</strong>).<br />
The UIMA type system requires development of a<br />
uniform representation model which helps to normalize<br />
heterogeneous annotations of the component NLP tools.<br />
Multilingual Resources and Multilingual Applications - Posters<br />
With ATLAS it covers properties vital for further<br />
processing of the annotated data, e.g. lemma, values for<br />
attributes such as gender, number and case for tokens<br />
necessary to run coreference module to be subsequently<br />
used for text summarisation, categorization and machine<br />
translation.<br />
To facilitate introduction of further levels of annotation a<br />
general markable type has been introduced, carrying<br />
subtype and reference to another markable object. This<br />
way new annotation concepts can be tested and later<br />
included into the core model.<br />
4. Integration of language processing chains<br />
in ATLAS<br />
The language chains are used in order to extract relevant<br />
information such as named entities and keywords from<br />
the documents stored within the ATLAS WCMS.<br />
Additionally they provide the baseline for further engines:<br />
Text summarization, Clustering and Machine translation<br />
(Koehn et al., 2007) and as such they are the foundation<br />
of the enhanced ATLAS platform.<br />
The core online service of the ATLAS platform is<br />
i-Publisher, a powerful Web-based instrument for<br />
creating, running and managing content-driven Web sites.<br />
It integrates the language-based technology to improve<br />
content navigation e.g. by interlinking documents based<br />
on extracted phrases, words and names, providing short<br />
summaries and suggested categorization concepts.<br />
Currently two different thematic content-driven Web<br />
sites, i-Librarian and EUDocLib, are being built on top of<br />
ATLAS platform, using i-Publisher as content<br />
management layer. i-Librarian is intended to be a<br />
user-oriented web site which allows visitors to maintain a<br />
personal workspace for storing, sharing and publishing<br />
various types of documents and have them automatically<br />
categorized into appropriate subject categories,<br />
summarized and annotated with important words,<br />
phrases and names.<br />
EUDocLib is planned as a publicly accessible repository<br />
of EU legal documents from the EUR-LEX collection<br />
with enhanced navigation and multilingual access.<br />
An important aspect of ATLAS System is that all three<br />
services operate in a multilingual setting. Similar<br />
functionality will be implemented within the project for<br />
Bulgarian, Croatian, German, English, German, Greek,<br />
Polish and Romanian. The architecture of the system is<br />
225
Multilingual Resources and Multilingual Applications - Posters<br />
modular and allows anytime a new language extension. It<br />
is an aynchronous architecture based on queue<br />
processing of requests (see Figure 1)<br />
226<br />
Figure 1: Linguistic processing support<br />
in ATLAS System<br />
5. Conclusions<br />
In this paper we present an architecture which opens the<br />
door to standardized multilingual online processing of<br />
language and it offers localized demonstration tools built<br />
on top of the linguistic modules.<br />
The framework is ready for integration of new types of<br />
tools and new languages to pro- vide wider online<br />
coverage of the needful linguistic services in a<br />
standardized manner. New versions of the online services<br />
are planned to be launched in the beginning of 2012.<br />
6. References<br />
Belogay, A., Ćavar, D., Cristal, D., Karagiozov, D.,<br />
Koeva, S., Nikolov, R., Ogrodniczuk, M.,<br />
Przepiórkowski, A., Raxis P., Vertan C. (to appear):<br />
i-Publisher, i-Librarian and EUDocLib – linguistic<br />
services for the Web. In: Proceedings of the 8th<br />
Practical Applications in Language and Computers<br />
(PALC <strong>2011</strong>) conference. University of Łódź, Poland,<br />
13-15 April <strong>2011</strong><br />
Degórski, Ł., Marcińczuk, M., Przepiórkowski, A.<br />
(2008): Definitio n extraction using a sequential<br />
combination of baseline grammars and machine<br />
learning classi fiers. In: Proceedings of the 6th<br />
International Conference on Language Resources and<br />
Evaluation, LREC 2008. ELRA, Marrakech,<br />
http://nlp.ipipan.waw.pl/~adamp/Papers/2008-lreclt4el/213_paper.pdf<br />
Koehn, P., Hoang H., Birch A., Callison-Burch, C.,<br />
Federico M., Bertoldi, N., Cowan, B., Shen, W.,<br />
Moran, C., Zens, R., Dyer, C., Bojar O., Constantin,<br />
A., Herbst, E. (2007): Moses: Open Source Toolkit for<br />
Statistical Machine Translation. In: ACL (ed.) Annual<br />
Meeting of the Association for Computational<br />
Linguistics, (ACL), demonstration session. Prague,<br />
http://acl.ldc.upenn.edu/P/P07/ P07- 2045.pdf<br />
Koeva, S. (2007): Multi-word Term Extraction for<br />
Bulgarian. In: Piskorski, J., Pouliquen, B., Steinberger,<br />
R., Tanev, H. (eds.) Proceedings of the Workshop on<br />
Balto-Slavonic Natural Language Processing, pp.<br />
59–66. Association for Computational Linguistics,<br />
Prague, Czech Republic, June 2007.<br />
http://www.aclweb.org/anthology/W/W07/W07-1708<br />
Ogrodniczuk, M., Karagiozov, D. (to appear): ATLAS –<br />
The Multilingual Language Processing Platform. In:<br />
Proceedings of the 27th Conference of the Spanish<br />
Society for Natural Language Processing. University<br />
of Huelva, Spain, 5-7 September <strong>2011</strong>
Multilingual Resources and Multilingual Applications - Posters<br />
Multilingual Corpora at the Hamburg Centre for Language Corpora<br />
Hanna Hedeland, Timm Lehmberg, Thomas Schmidt, Kai Wörner<br />
<strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong> (HZSK)<br />
Max Brauer-Allee 60<br />
D-22765 Hamburg<br />
E-mail: hanna.hedeland@uni-hamburg.de, timm.lehmberg@uni-hamburg.de, thomas.schmidt@uni-hamburg.de,<br />
kai.wörner@uni-hamburg.de<br />
Abstract<br />
We give an overview of the content and the technical background of a number of corpora which were developed in various projects of<br />
the Research Centre on Multilingualism (SFB 538) between 1999 and <strong>2011</strong> and which are now made available to the scientific<br />
community via the Hamburg Centre for Language Corpora.<br />
Keywords: corpora, spoken language, multilingualism, digital infrastructures<br />
1. Introduction<br />
In this paper, we give an overview of the content and the<br />
technical background of a number of corpora which were<br />
developed in various projects of the Research Centre on<br />
Multilingualism (SFB 538) between 1999 and <strong>2011</strong> and<br />
which are now made available to the scientific<br />
community via the Hamburg Centre for Language<br />
Corpora.<br />
Between 1999 and <strong>2011</strong>, the Research Centre on<br />
Multilingualism (SFB 538) brought together researchers<br />
investigating various aspects of multilingualism<br />
focussing either on the language development of<br />
multilingual individuals, on communication in<br />
multilingual societies, or on diachronic change of<br />
languages in multilingual settings. Without exception, the<br />
projects of the Centre worked empirically, basing their<br />
analyses on corpora of spoken or written language. Over<br />
the years, an extensive and diverse data collection was<br />
thus built up consisting of language acquisition and<br />
attrition corpora, interpreting corpora, parallel<br />
(translation) corpora, corpora with a sociolinguistic<br />
design and historical corpora.<br />
Since corpus creation, management and analysis were<br />
thus crucial to the work of the Research Centre, a project<br />
was set up in June 2000 with the aim of designing and<br />
implementing methods for the computer-assisted<br />
processing of multilingual language data. One major<br />
result of that project is EXMARaLDA, a system for<br />
setting up and analysing spoken language corpora<br />
(Schmidt & Wörner, 2009, Schmidt et al., this volume).<br />
The focus of this paper will be on the spoken language<br />
corpora of the Research Centre which were either created<br />
or curated with the help of EXMARaLDA.<br />
2. Overview of corpora<br />
As the list of resources in the appendix shows, altogether<br />
31 resources constructed at the SFB 538 were transferred<br />
to the inventory of the Hamburg Centre for Language<br />
Corpora. 27 of these are spoken language corpora, 3 are<br />
corpora of modern written language, and one is a corpus<br />
of historical written language. More specifically, we are<br />
dealing with the following resource types:<br />
� Language acquisition corpora which document the<br />
acquisition of two first languages or a second<br />
language. Most of these corpora are longitudinal<br />
studies of child language in different bilingual<br />
language combinations (German-French, German-<br />
Portuguese, German-Spanish, German-Turkish), but<br />
other corpus designs (e.g. cross-sectional studies)<br />
and other speaker types (e.g. adult learners or<br />
monolingual children) are also present.<br />
� Language attrition corpora which document the<br />
development of a “weaker” language in adult<br />
bilinguals. Three different language combinations<br />
227
Multilingual Resources and Multilingual Applications - Posters<br />
228<br />
(German-Polish, German-Italian, German-French)<br />
are involved.<br />
� Interpreting corpora which document consecutive<br />
and simultaneous interpreting involving trained and<br />
ad-hoc interpreters for different language<br />
combinations (German-Portuguese, German-<br />
�<br />
Turkish, German-Russian, German-Polish, German-<br />
Romanian) and in different settings (doctor-patient<br />
communication and expert discussion).<br />
Corpora with a sociolinguistic corpus design whose<br />
data are stratified according to biographic<br />
�<br />
characteristics (e.g. age) of the speakers and/or their<br />
regional provenance. This comprises a corpus<br />
documenting Faroese-Danish bilingualism on the<br />
Faroese Islands and a corpus documenting the use of<br />
Catalan in different districts of Barcelona.<br />
Parallel and comparable corpora in which originals<br />
and translations of texts are aligned or which consist<br />
of original texts from specific genres in different<br />
languages.<br />
The entirety of spoken language resources amounts to<br />
approximately 5500 transcriptions with approximately<br />
5.5 million transcribed words (not counting secondary<br />
annotations).<br />
3. Data model<br />
The spoken language corpora, while sharing the common<br />
theme of multilingualism, are still highly heterogeneous<br />
with respect to many parameters. As far as their content is<br />
concerned, they do not only cover a spectrum of fourteen<br />
different languages, but also greatly differ with respect to<br />
the recorded discourse types (e.g. interviews, free<br />
conversation, expert discussion, classroom discourse,<br />
semi-controlled settings, and institutional discourse).<br />
Even more variation is to be found with respect to the<br />
research interests pursued with the help of the corpora<br />
and, consequently, the methodology used to record,<br />
transcribe and annotate the data. To begin with, either<br />
only audio or both video and audio data are recorded,<br />
depending on whether or not non-verbal behavior plays a<br />
role for analysis (as is the case, for example, for data of<br />
young children). As some projects focused their research<br />
on syntactic aspects of language, while others where<br />
interested in phonological properties or discourse<br />
structures, different systems where applied in<br />
transcribing (e.g. orthographic vs. phonetic transcription<br />
or complete vs. selective transcription) and annotating<br />
(e.g. prosodic annotations, annotation of code switches)<br />
the data.<br />
The challenge in representing the corpora on a common<br />
technical basis was thus to find a degree of abstraction<br />
which, on the one hand, allows operations common to all<br />
resources (such as time alignment of transcription and<br />
media) to be carried out efficiently on a unified structure,<br />
but, on the other hand, also makes it possible to apply<br />
theory or resource specific functions (such as<br />
segmentation according to a specific model) to the data.<br />
A data model based on annotation graphs (Bird &<br />
Liberman, 2001), but supplemented with additional<br />
semantic specifications and structural constraints, turned<br />
out to be suitable for this task (Schmidt, 2005).<br />
4. Data curation<br />
The construction of a non-negligible part of the resources<br />
had been completed or started before EXMARaLDA was<br />
available as a working system. A number of legacy<br />
software tools (syncWriter, HIAT-DOS, LAPSUS,<br />
WordBase) was used for the construction of these corpora<br />
resulting in data for which there was hardly a chance of<br />
sustainable maintenance. The resources therefore had to<br />
be converted to EXMARaLDA in a laborious process<br />
described in detail in Schmidt & Bennöhr (2007).<br />
From about 2003 onwards, all projects used<br />
EXMARaLDA or other compatible tools (e.g. Praat) for<br />
corpus construction. Although these resources were<br />
much easier to process once they had been completed,<br />
there was still a considerable amount of data curation to<br />
be done before they could be published. This involved<br />
various completeness and consistency checks on the<br />
transcription and annotation data and the construction of<br />
valid metadata descriptions for all parts of a resource.<br />
5. Data dissemination<br />
Completed resources are made available to interested<br />
users via the WWW 1<br />
through several methods:<br />
� A hypermedia representation of transcriptions,<br />
annotations, recording and metadata allows users to<br />
browse corpora online (see figure 1).<br />
1 http://www.corpora.uni-hamburg.de
� Resources can be downloaded in the EXMARaLDA<br />
format and then edited and queried with the system’s<br />
tools (Partitur-Editor for editing transcriptions,<br />
Coma for editing and querying metadata, EXAKT<br />
for querying transcription and annotation data).<br />
� Queries via EXAKT can also be carried out on<br />
remote data, i.e. without downloading the resource<br />
first, or through a web interface, i.e. without the need<br />
to install local software first.<br />
� A number of export formats are offered for each<br />
annotation file making it possible to edit or query the<br />
data also with non-EXMARaLDA tools. Most<br />
importantly, most data are also available in the<br />
CHAT format of the CHILDES system, as ELAN<br />
annotation files, as Praat TextGrids and as TEI files.<br />
Access to all corpora is password protected. The process<br />
for obtaining a password varies from resource to resource,<br />
but always requires the data owner’s consent. Due to<br />
privacy protection issues, a part of the spoken resources<br />
can only be made accessible in the form of transcriptions,<br />
not audio or video recordings.<br />
6. Future plans<br />
In order to cater for the long term archiving and<br />
availability of the data beyond the finite funding period<br />
of the Research Centre, in January <strong>2011</strong> the<br />
Hamburg Centre for Language Corpora (HZSK,<br />
http://www.corpora.uni-hamburg.de) was set up. This<br />
institution is intended to provide a permanent basis not<br />
Multilingual Resources and Multilingual Applications - Posters<br />
Figure 1: Hypermedia representation of a transcription from the Hamburg Map Task Corpus (HAMATAC)<br />
only for the corpora and tools referred to in this paper, but<br />
also for further resources existing or under construction<br />
at the University of Hamburg. The HZSK is part of the<br />
CLARIN-D network and will, in the years to come,<br />
integrate its resources into this infrastructure by<br />
providing protocols for metadata harvesting, assigning<br />
PIDs to resources, allowing for single-sign-on<br />
mechanisms and implementing interfaces as defined by<br />
CLARIN for access to metadata and annotations.<br />
7. References<br />
Bird, S., Liberman, M. (2001): A formal framework for<br />
linguistic annotation. In: Speech Communication (33),<br />
pp. 23-60.<br />
Schmidt, T. (2005): Computergestützte Transkription -<br />
Modellierung und Visualisierung gesprochener<br />
Sprache mit texttechnologischen Mitteln. Frankfurt a.<br />
M.: Peter Lang.<br />
Schmidt, T., Bennöhr, J. (2008): Rescuing Legacy Data.<br />
In: Language Documentation and Conservation (2),<br />
pp. 109-129.<br />
Schmidt, T., Wörner, K. (2009): EXMARaLDA –<br />
Creating, analysing and sharing spoken language<br />
corpora for pragmatic research. In: Pragmatics 19(4),<br />
pp. 565-582.<br />
229
Multilingual Resources and Multilingual Applications - Posters<br />
Appendix: List of resources<br />
Spoken resources<br />
230<br />
Corpus name<br />
Project / Data Owner<br />
Type<br />
HABLA (Hamburg Adult Bilingual LAnguage)<br />
E11 / Tanja Kupisch<br />
spoken/audio/exmaralda<br />
DUFDE (Deutscher und Französischer<br />
doppelter Erstspracherwerb)<br />
E2 / Jürgen Meisel<br />
spoken/video/exmaralda<br />
BIPODE (Bilingualer Portugiesisch-Deutscher<br />
Erstpracherwerb)<br />
E2 / Jürgen Meisel<br />
spoken/video/exmaralda<br />
CHILD-L2<br />
E2 / Jürgen Meisel<br />
spoken/video/exmaralda<br />
ZISA (Zweitspracherwerb Italienischer und<br />
Spanischer Arbeiter)<br />
E2 / Jürgen Meisel<br />
spoken/audio/exmaralda<br />
BUSDE (Baskischer und Spanischer doppelter<br />
Erstspracherwerb)<br />
E2 / Jürgen Meisel<br />
spoken/video/other<br />
PAIDUS (Parameterfixierung im Deutschen<br />
und Spanischen)<br />
E3 / Conxita Lleó<br />
spoken/audio/exmaralda<br />
PHONBLA Longitudinalstudie Hamburg<br />
E3 / Conxita Lleó<br />
spoken/audio+video/exmaralda<br />
PHONBLA Querschnittsstudie Madrid<br />
E3 / Conxita Lleó<br />
spoken/audio+video/exmaralda<br />
PEDSES (Phonologie-Erwerb<br />
Deutsch-Spanisch als Erste Sprachen)<br />
E3 / Conxita Lleó<br />
spoken/audio/exmaralda<br />
PHON-CL2<br />
E3 / Conxita Lleó<br />
spoken/audio/exmaralda<br />
PHONMAS<br />
E3 / Conxita Lleó<br />
spoken/audio/exmaralda<br />
TÜ_DE-cL2-Korpus<br />
E4 / Monika Rothweiler<br />
spoken/video/exmaralda<br />
TÜ_DE-L1-Korpus<br />
E4 / Monika Rothweiler<br />
spoken/audio/exmaralda<br />
Short description Language(s) Size<br />
Audio recordings of semi-spontaneous interviews<br />
(elicited grammaticality judgments and<br />
production data are collected from the same<br />
speakers)<br />
Video recordings (longitudinal study) of seven<br />
French-German bilingual children aged between 1<br />
year;6 months and 6 years;11 months (+some<br />
later recordings).<br />
Video recordings (longitudinal study) of three<br />
Portuguese-German bilingual children aged<br />
between 1 year;6 months and 5 years;6 months.<br />
Video recordings of children which start acquiring<br />
French or German as a second language at the<br />
age of three or four years.<br />
deu, fra, ita 169 communications<br />
127 speakers<br />
737797 transcribed words<br />
169 transcriptions<br />
deu, fra 562 communications<br />
14 speakers<br />
ca. 1000000 transcribed<br />
words<br />
849 transcriptions<br />
deu, por 250 communications<br />
48 speakers<br />
ca. 250000 transcribed<br />
words<br />
227 transcriptions<br />
deu, fra 181 communications<br />
69 speakers<br />
376114 transcribed words<br />
181 transcriptions<br />
Recordings of adult L2-German-learners deu 101 communications<br />
5 speakers<br />
11<strong>96</strong>67 transcribed words<br />
100 transcriptions<br />
Longitudinal language aqcuisition study on<br />
bilingual Basque-Spanish children<br />
eus, spa unknown<br />
Audio recordings of monolingual children. deu, spa 253 communications<br />
66 speakers<br />
166976 transcribed words<br />
253 transcriptions<br />
Longitudinal data of Spanish/German bilingual<br />
children<br />
Cross sectional study of bilingual German-Spanish<br />
L1 acquisition<br />
Longitudinal data of Spanish/German bilingual<br />
children<br />
Recordings of German subjects/children who<br />
have learned (or are learning) Spanish after the<br />
age of two<br />
Recordings of monolingual Spanish children (as<br />
comparable data for Madrid-PhonBLA)<br />
Video recordings of (spontaneous and elicited<br />
language) of eight bilingual children with Turkish<br />
as their first language<br />
Video recordings of (spontaneous and elicited<br />
language) of twelve bilingual children with<br />
Turkish as their first language<br />
deu, spa 413 communications<br />
61 speakers<br />
303792 transcribed words<br />
413 transcriptions<br />
deu, spa 113 communications<br />
34 speakers<br />
56722 transcribed words<br />
113 transcriptions<br />
deu, spa 127 communications<br />
21 speakers<br />
101292 transcribed words<br />
127 transcriptions<br />
deu, spa 26 communications<br />
22 speakers<br />
17412 transcribed words<br />
26 transcriptions<br />
spa 49 communications<br />
4 speakers<br />
3067 transcribed words<br />
49 transcriptions<br />
deu 112 communications<br />
19 speakers<br />
348292 transcribed words<br />
112 transcriptions<br />
tur 12 communications<br />
22 speakers<br />
13 transcriptions
Rehbein-ENDFAS/Rehbein-SKOBI-Korpus<br />
E5 / Jochen Rehbein<br />
spoken/audio/exmaralda<br />
ENDFAS/SKOBI Gold Standard<br />
E5 / Jochen Rehbein<br />
spoken/audio/exmaralda<br />
Catalan in a bilingual context<br />
H6 / Conxita Lleó<br />
spoken/audio/exmaralda<br />
Hamburg Corpus of Polish in Germany<br />
H8 / Bernhard Brehmer<br />
spoken/audio/exmaralda<br />
Hamburg Corpus of Argentinean Spanish<br />
(HaCASpa)<br />
H9 / Christoph Gabriel<br />
spoken/audio/exmaralda<br />
Dolmetschen im Krankenhaus<br />
K2 / Kristin Bührig Bernd Meyer<br />
spoken/audio/exmaralda<br />
SkandSemiko (Skandinavische<br />
Semikommunikation)<br />
K5 / Kurt Braunmüller<br />
spoken/audio/exmaralda<br />
CoSi (Consecutive and Simultaneous<br />
Interpreting)<br />
K6 / Bernd Meyer<br />
spoken/audio+video/exmaralda<br />
FADAC Hamburg (Faroese Danish Corpus<br />
Hamburg)<br />
K8 / Kurt Braunmüller<br />
spoken/audio/exmaralda<br />
ALCEBLA<br />
T4 / Conxita Lleó<br />
spoken/audio/exmaralda<br />
Simuliertes Dolmetschen im Krankenhaus<br />
T5 / Kristin Bührig, Bernd Meyer<br />
spoken/audio+video/exmaralda<br />
EXMARaLDA Demo Corpus<br />
Z2 / <strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong><br />
spoken/audio+video/exmaralda<br />
Hamburg Map Task Corpus<br />
Z2 / <strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong><br />
spoken/audio/exmaralda<br />
Multilingual Resources and Multilingual Applications - Posters<br />
Audio recordings of evocative field experiments<br />
with Turkish and German monolingual and<br />
Turkish/German bilingual children.<br />
Audio recordings of Turkish and German<br />
monolingual and Turkish/German bilingual<br />
children. Demo Excerpt from the larger<br />
Rehbein-ENDFAS/Rehbein-SKOBI-Korpus<br />
Prompted, read and spontaneous speech data of<br />
Catalan speakers from Barcelona, stratified<br />
according to district and age of speakers<br />
Audio recordings of bilingual (Polish and German)<br />
and monolingual (Polish) adults (16-46 years).<br />
Recordings of semi-spontaneous data (3 topics)<br />
and renarration of a picture story (from 'Vater<br />
und Sohn')<br />
Recordings of spontaneous speech and laboratory<br />
data of speakers of Porteño Spanish in Argentina<br />
(read speech, story retelling, read<br />
question-answer pairs, intonation questionnaires,<br />
free interviews); 7 experiments altogether.<br />
Monolingual and interpreted doctor-patient<br />
communication in hospitals<br />
Radio recordings, recordings of group discussions<br />
and classroom discourse with speakers of two or<br />
more Scandinavian languages (Swedish, Danish,<br />
Norwegian) interacting.<br />
Recordings of simultaneously and consecutively<br />
interpreted lectures<br />
Recordings of semi-structured interviews in<br />
Faroese and Danish with bilingual speakers living<br />
on the Faroe Islands.<br />
Recordings of Spanish-German bilingual children<br />
living in Germany and attending the Spanish<br />
complementary school at the first level<br />
Simulations of interpreted doctor-patient<br />
communication.<br />
A selection of short audio and video recordings in<br />
different languages for demonstration of the<br />
EXMARaLDA system<br />
Audio recordings of map tasks with advanced<br />
learners of German<br />
deu, tur 1017 communications<br />
523 speakers<br />
289012 transcribed words<br />
836 transcriptions<br />
deu, tur 3 communications<br />
8 speakers<br />
4862 transcribed words<br />
3 transcriptions<br />
cat 225 communications<br />
234 speakers<br />
187<strong>96</strong>7 transcribed words<br />
875 transcriptions<br />
pol 354 communications<br />
94 speakers<br />
ca. 350000 transcribed<br />
words<br />
358 transcriptions<br />
spa 259 communications<br />
63 speakers<br />
141321 transcribed words<br />
261 transcriptions<br />
deu, por, tur 91 communications<br />
189 speakers<br />
165689 transcribed words<br />
92 transcriptions<br />
dan, nor, swe 162 communications<br />
515 speakers<br />
269945 transcribed words<br />
74 transcriptions<br />
deu, por 3 communications<br />
8 speakers<br />
35432 transcribed words<br />
5 transcriptions<br />
dan, fao 92 communications<br />
82 speakers<br />
440194 transcribed words<br />
92 transcriptions<br />
deu, spa 66 communications<br />
23 speakers<br />
36717 transcribed words<br />
66 transcriptions<br />
deu, pol, ron, rus 4 communications<br />
12 speakers<br />
4018 transcribed words<br />
4 transcriptions<br />
deu, eng, fra, ita,<br />
nor, pol, spa,<br />
swe, tur, vie<br />
19 communications<br />
50 speakers<br />
11659 transcribed words<br />
19 transcriptions<br />
deu 24 communications<br />
26 speakers<br />
24409 transcribed words<br />
24 transcriptions<br />
231
Multilingual Resources and Multilingual Applications - Posters<br />
Written resources<br />
HaCOSSA (Hamburg Corpus of Old Swedish<br />
with Syntactic Annotations)<br />
H3 / Kurt Braunmüller<br />
written/tei<br />
Covert translation: popular science<br />
K4 / Juliane House<br />
written/tei<br />
Covert Translation: business communication<br />
(old)<br />
K4 / Juliane House<br />
written/tei<br />
Covert Translation: business communication<br />
(new)<br />
K4 / Juliane House<br />
written/tei<br />
232<br />
Bible translations, religious and secular prose, law<br />
texts, non-fiction literature (geographical,<br />
theological, historic, natural science), diploma.<br />
Translation corpora of original texts with<br />
translations and comparable texts from the genre<br />
popular scientific prose<br />
Translation corpora of original texts with<br />
translations and comparable texts from the genre<br />
external business communication<br />
Translation corpora of original texts with<br />
translations and comparable texts from the genre<br />
external business communication<br />
dan, deu, isl, lat,<br />
nob, swe<br />
35 texts<br />
deu, eng 114 texts<br />
500446 words<br />
deu, eng 119 texts<br />
169154 words<br />
deu, eng 198 texts
Multilingual Resources and Multilingual Applications - Posters<br />
The English Passive and the German Learner –<br />
Compiling an Annotated Learner Corpus<br />
to Investigate the Importance of Educational Settings<br />
Verena Möller, Ulrich Heid<br />
<strong>Universität</strong> Hildesheim<br />
Institut <strong>für</strong> Informationswissenschaft und Sprachtechnologie<br />
- Sprachtechnologie / Computerlinguistik -<br />
Marienburger Platz 22<br />
31141 Hildesheim<br />
E-mail: verena.moeller@uni-hildesheim.de, ulrich.heid@uni-hildesheim.de<br />
Abstract<br />
In the south of Germany, a number of changes have recently been effected with respect to the possible environments in which pupils<br />
in primary and secondary schools learn/acquire English. The current co-existence of various educational settings allows for<br />
investigation of the effects that each of these settings has on the structure of learners' interlanguage. As different text types are used as<br />
input in the various educational environments which have been created in secondary schools, the English passive has been chosen as<br />
a diagnostic criterion for the analysis of the learners' production of written text. The present article describes the compilation of a<br />
corpus of teaching materials and a learner corpus. It outlines the procedures involved in annotating metadata, esp. those obtained<br />
from questionnaires and psychological tests. Tools for linguistic annotation (POS-taggers and a parser) are compared with respect to<br />
their effectiveness in dealing with data from students after 6-10 years of instruction and/or immersion.<br />
Keywords: second language acquisition, learner corpus, metadata, POS-tagging, parsing<br />
1. Co-Existence of Educational Settings<br />
In recent years, a number of changes in the educational<br />
system in Baden-Württemberg (Germany) have been<br />
effected, some of them directly related to language<br />
learning and acquisition. In addition to English as a<br />
Foreign Language (EFL) lessons in secondary schools,<br />
more and more CLIL (content and language integrated<br />
learning) programmes have been established. CLIL<br />
learners are taught History and Biology, as well as a<br />
combination of Geography, Economics and Politics in<br />
English during certain years specified by the curriculum.<br />
In addition, 'immersive-reflective' lessons (IRL) have<br />
been introduced at the primary level. These focus on<br />
situational context and communication, while at the same<br />
time allowing for reflection on language whenever this is<br />
deemed necessary.<br />
Due to the current co-existence of various educational<br />
settings, it is timely to compile a learner corpus in order<br />
to investigate the effects of educational settings on the<br />
interlanguages of the following four groups of learners:<br />
1) participants in EFL, but neither IRL nor CLIL;<br />
2) participants in EFL and IRL, but not CLIL;<br />
3) participants in EFL and CLIL, but not IRL;<br />
4) participants in EFL, CLIL and IRL.<br />
All learners participating in the study described below are<br />
in Year 11, i. e. they have entered the final stage of their<br />
school career.<br />
2. The Passive and the German Learner<br />
To test the impact of educational settings on the learner<br />
groups outlined above, grammatical structures need to be<br />
analysed with respect to the question which ones will<br />
most likely occur with different frequency in the types of<br />
input that is available to these learners. For the purpose of<br />
the present study, the English passive has been chosen as<br />
an indicator.<br />
Being exposed to scientifically-oriented writing, CLIL<br />
learners receive input from a genre that differs from those<br />
233
Multilingual Resources and Multilingual Applications - Posters<br />
used in EFL classes. Based on the findings of Svartvik<br />
(1<strong>96</strong>6), this genre may be assumed to contain a relatively<br />
larger number of passive structures. This will be tested on<br />
a corpus of teaching materials. It is likely that passive<br />
constructions will also occur with higher frequency in the<br />
written output of CLIL learners.<br />
Different types of be Ved constructions, i. e. central<br />
passives with solely verbal features and semi-passives<br />
carrying verbal as well as adjectival characteristics, are<br />
included into an analysis of teaching materials and of<br />
written learner language. Questions of verb valency are<br />
also taken into account.<br />
234<br />
3. The Teaching Materials Corpus<br />
3.1. Input and Norm: TMCinp and TMCref<br />
To determine whether or not the various groups of<br />
learners are indeed exposed to different types of written<br />
input, a corpus of teaching materials (TMC) is being<br />
compiled. It includes written material for learners from<br />
Year 7 onwards, as both CLIL and the treatment of the<br />
English passive start in that year.<br />
The TMC serves two purposes: On the one hand, it<br />
compares input from EFL lessons to input from CLIL by<br />
means of an input subcorpus (TMCinp). An analysis of<br />
Year 7-10 materials for both groups will establish<br />
whether or not passive structures do indeed occur with<br />
higher frequency in CLIL materials than in EFL<br />
materials.<br />
Secondly, the TMC represents a reference norm. All four<br />
groups of learners take the same EFL exams at the end of<br />
their school career. Hence a target norm, against which<br />
the learners' written performance at that stage can be<br />
measured, is defined by compiling a reference subcorpus<br />
(TMCref). TMCref comprises Year 11-12 materials<br />
designed for use in the EFL classroom.<br />
The overall structure of the TMC is presented in Fig. 1.<br />
Year<br />
7<br />
8<br />
9<br />
10<br />
11<br />
12<br />
TMC<br />
TMCinp TMCref<br />
EFL CLIL<br />
EFL<br />
Figure 1: Teaching Materials Corpus (TMC)<br />
3.2. Metadata<br />
The TMC is, amongst others, annotated with the<br />
following metadata to enable efficient querying:<br />
� learning environment (EFL vs. CLIL);<br />
� publisher and title;<br />
� targeted age group;<br />
� type of material (textbook, workbook, newspaper,<br />
fiction not included into textbooks etc.);<br />
� genre.<br />
The TMC includes written text as well as supplementary<br />
information and instructions referring to written text only,<br />
rather than to film sequences or listening comprehension<br />
exercises that may accompany textbooks. Skills files,<br />
which are used to acquire the techniques and vocabulary<br />
needed for various types of text production, are also<br />
excluded from the TMC.<br />
3.3. POS-Tagging and Parsing<br />
To linguistically annotate the TMC, the English versions<br />
of TreeTagger (Schmid, 1994) and of the MATE parser<br />
(Bohnet, 2010) were used. TreeTagger is a stochastic<br />
part-of-speech (POS) tagger that uses annotated<br />
reference texts, lexical entries (word form, lemma, POS),<br />
word endings and three-word windows (two items left of<br />
the candidate) as an input. It performs lemmatization<br />
together with tagging (U Penn Treebank tagset, 36 tags).<br />
MATE is a trainable dependency parser (trained on the U<br />
Penn Treebank). Both tools also perform sentence<br />
tokenization. Having the TMC tagged, lemmatized and<br />
parsed, we expect to be able to extract occurrences of<br />
passives with good precision and recall.<br />
4. Learner Corpus: Data Elicitation<br />
4.1. Personal Data<br />
If a difference in the use of passive constructions in<br />
learner text is to be attributed to a specific educational<br />
setting, it is inevitable to make sure that all groups of<br />
learners are comparable with respect to a number of<br />
individual parameters. The collection of these personal<br />
data centres around two methods – a questionnaire and<br />
psychological testing. In the questionnaire, learners are<br />
asked to provide information e. g. on age, sex, mother<br />
tongue, learning environment, etc. (cf. sec. 5.1.).<br />
Moreover, information on cognitive capacities and<br />
motivation needs to be gathered by means of
psychological testing. Participation in CLIL lessons is<br />
not compulsory and there is room for the possibility that<br />
learners opt for these programmes because they possess<br />
better overall or language-related cognitive skills, or a<br />
higher level of motivation.<br />
The intelligence test used in this study (PSB-R 6-13,<br />
Horn, 2003) provides information on the two cognitive<br />
factors mentioned above, along with individual scales on<br />
lexical fluency in German and language-related logical<br />
thinking. Data from a pilot study with 28 subjects<br />
(cf. Table 1) suggest that the most reasonable procedure<br />
will be to sort participants into two groups according to<br />
the scores attained on the scales for overall and<br />
language-related cognitive capacities (SW 100-109/IQ<br />
100-114 vs. SW 110-119/IQ 115-129).<br />
SW General Lang.-related<br />
(PSB-R 6-13 GL) (PSB-R 6-13 V)<br />
100-109 14 19<br />
110-119 10 9<br />
119<br />
4 0<br />
Table 1: Pilot study – cognitive skills<br />
The psychological test related to motivational factors<br />
(FLM 7-13, Petermann & Winkel, 2007) provides,<br />
amongst others, information on orientation towards<br />
performance and success as well as perseverance and<br />
effort. The study aims at learners with an average<br />
motivation (T-score 40-60), allowing for a margin on<br />
both sides (T-score 36-64). The results of the pilot study<br />
show that 23/24 out of 28 learners fall into this category<br />
for the two scales.<br />
4.2. Learner Text Data<br />
Learners are invited to write two short argumentative<br />
essays within a time frame of about 70 minutes. Students<br />
at this level are used to this kind of task, as it is widely<br />
practised throughout the years preceding their final<br />
exams. Learners key in their texts using a simple editor<br />
without a spellchecker. However, they are allowed to use<br />
a printed version of a monolingual dictionary.<br />
Some of the essay topics to choose from involve passive<br />
constructions, others do not. The following enumeration<br />
lists the topics most frequently chosen:<br />
1) In order to fight teenage drinking, the legal drinking<br />
age should be raised to 21. (18 essays)<br />
Multilingual Resources and Multilingual Applications - Posters<br />
2) In Germany, the education system offers equality of<br />
opportunity to everyone, rich or poor. (9 essays)<br />
3) Privacy is a thing of the past. (9 essays)<br />
4) The death penalty should be reintroduced in<br />
Germany. (9 essays)<br />
In the pilot study, the average number of words produced<br />
in one essay was 308, resulting in a corpus of slightly<br />
more than 17,000 words.<br />
4.3. Experimental Data<br />
A study on the International Corpus of Learner English<br />
(ICLE) has revealed a marked underuse of the English<br />
passive even in more advanced German learners<br />
(cf. Granger, 2009). It can therefore be assumed that this<br />
will be the case with less advanced learners as well. Thus,<br />
to make sure that additional information is available as a<br />
backup, text data elicitation is supplemented with an<br />
experimental task to find out whether or not learners are<br />
able to transform active sentences into their passive<br />
counterparts. Not only are learners tested on the<br />
morphology of the English passive in various tenses<br />
(cf. sentences 1 and 2), but the task also involves<br />
ditransitive verbs to find out which object is most likely<br />
to be moved to the subject position of the passive<br />
sentence (cf. sentence 3). Moreover, learners are<br />
presented with constructions that have not or only<br />
marginally been part of their EFL instruction<br />
(e.g. prepositional verbs or complex-transitive verbs,<br />
cf. sentences 4 and 5).<br />
1) My sister's friends often invite me to parties.<br />
2) The teams will play the last match of the season next<br />
Friday.<br />
3) My grandparents have promised me a new computer.<br />
4) People look upon the construction of the railroad as<br />
a fantastic achievement.<br />
5) Everyone considered Pat a nice person.<br />
In the experimental task, learners respond to 12 sentences<br />
in about 20 minutes. In addition, they are asked to rate the<br />
reliability of their own responses on a 5-point Likert scale.<br />
These reliability scores are included into the learner<br />
corpus as metadata.<br />
5. Learner Corpus: Annotation<br />
5.1. Metadata<br />
As a result of the procedures described in sec. 4, the<br />
235
Multilingual Resources and Multilingual Applications - Posters<br />
learner corpus comprises information on the following<br />
aspects, annotated as metadata:<br />
� age and sex;<br />
� mother tongue and languages spoken at home;<br />
� other second and foreign languages, duration of<br />
acquisition and self-rated competence;<br />
� duration of the learner's longest stay in an<br />
English-speaking country;<br />
� number of school years skipped or doubled;<br />
� attendance of German primary school and<br />
�<br />
participation in immersive-reflective lessons;<br />
textbooks used in the EFL classroom;<br />
� participation in CLIL programmes and school<br />
subjects affected;<br />
� exposure to English during the learner's spare time;<br />
� aspects of cognitive capacities;<br />
� aspects of motivation;<br />
� self-rated reliability of responses in the experimental<br />
task;<br />
� essay topic.<br />
5.2. POS-Tagging<br />
The Learner Corpus was POS-tagged by means of<br />
TreeTagger, the same way as TMC. In addition, the<br />
CLAWS4 tagger was applied, a hybrid tagger that<br />
involves both probabilistic and rule-based procedures<br />
(Garside & Smith, 1997). For the purpose of the present<br />
pilot study, we have used the C7 tagset, which amounts to<br />
a number of 146 tags. CLAWS4 provides probability<br />
scores for tags assigned to potentially ambiguous word<br />
forms. For the 17,000 word pilot learner corpus,<br />
CLAWS4 lists 5,255 ambiguities; of these, 88.4 % are<br />
assigned a first tag alternative with 80 % probability or<br />
more.<br />
TreeTagger assigned an tag to 423 words<br />
that were misspelled. Slightly more than half of these<br />
nevertheless received a correct POS-tag. When CLAWS4<br />
was used, only two items received the unknown tag,<br />
. These were misspellings identified as truncations.<br />
However, 51 misspelled words received an <br />
tag in addition to their POS-tag. 16 of these were<br />
correctly POS-tagged despite their spelling error. It is<br />
remarkable that 19 of the 35 mistagged words involved<br />
proper nouns or adjectives denoting nationalities, spelt<br />
without a capital letter. In seven cases, the omission of<br />
apostrophes to mark either a genitive or a clitic made it<br />
236<br />
impossible to assign a correct POS-tag. As CLAWS4<br />
operates using the probability of POS-tags for both<br />
individual words and tag sequences, this had rather<br />
far-reaching consequences for the tagging of the<br />
preceding and following units.<br />
5.3. Parsing<br />
As is the case for the TMC, the Learner Corpus was also<br />
parsed by means of MATE. As the parser assigns<br />
POS-tags to the word forms analysed, a comparison with<br />
TreeTagger and CLAWS4 was performed (cf. sec. 6.2.<br />
for details). Tested on the misspelled words tagged<br />
by TreeTagger, MATE performed<br />
slightly better on the assignment of correct POS-tags<br />
(245 vs. 219). MATE and CLAWS4 were almost equally<br />
successful on partly erroneous occurrences of be Ved<br />
(cf. Table 3). To retrieve English passive constructions<br />
from the Learner Corpus, in principle no parsing would<br />
be needed. Correct syntagms can be found by means of<br />
patterns formulated in terms of POS and lemmas; most<br />
erroneous occurrences are not classifiable for the parser<br />
and thus need to be searched with partial patterns<br />
(e.g. participle alone).<br />
6. Retrieval of Passive Constructions<br />
6.1. Manual Analysis<br />
Before an automatic analysis was undertaken, instances<br />
of English be Ved constructions were retrieved manually<br />
from the pilot corpus. 151 occurrences were found, 22 of<br />
which being erroneous. The following types of error<br />
occurred:<br />
� Omission of be (6 instances): *Should the death<br />
penalty reintroduced in Germany?<br />
� Morphological and/or orthographic errors in the<br />
form of be or related clitics (3 instances): *You arent<br />
forced to post anything in the internet.<br />
� Morphological and/or orthographic errors in the past<br />
participle (11 instances): *[...] if the alcohol can just<br />
be buyed by 21 old people.<br />
� Lexical errors (1 instance): *[...] so he is already<br />
prisoned by the police.<br />
� A combination of these (1 instance): *[...] because<br />
it´s forbideden. 1<br />
1 The fact that learners frequently use accents on the keyboard<br />
instead of apostrophes presents POS-taggers with problems.<br />
However, this will be solved by combining automatic
In addition, 9 instances of get-passives were retrieved,<br />
three of which were ungrammatical.<br />
6.2. Automatic Analysis<br />
An analysis of which POS-tags TreeTagger (TT),<br />
CLAWS4 (CL) and the tagger integrated into the MATE<br />
parser (MA) assign to the learners' grammatical be Ved<br />
and get Ved constructions has shown that only<br />
TreeTagger was able to find all instances 2 (cf. Table 2).<br />
TT CL MA<br />
be + past participle<br />
(n=129)<br />
129 128 123<br />
get + past participle (n=6) 6 4 5<br />
Table 2: Retrieval of be Ved and get Ved<br />
An analysis of how the three taggers deal with erroneous<br />
occurrences of be Ved constructions has revealed that<br />
both CLAWS4 and MATE seem to have less difficulty in<br />
dealing with ungrammatical past participles than<br />
TreeTagger (cf. Table 3).<br />
TT CL MA<br />
correct tag for be (n=16) 12 12 11<br />
correct tag for the past<br />
participle (n=22)<br />
11 15 3<br />
15<br />
corrects tags for be and<br />
past participle (n=16)<br />
4 8 8<br />
Table 3: Tags in erroneous occurrences of be Ved<br />
7. Conclusion<br />
In this paper, work towards a richly annotated corpus of<br />
teaching materials (TMC) and of learner text was<br />
described. The corpora are particularly rich in metadata<br />
(both on the sources of TMC and on learner parameters),<br />
and they have been processed with two POS-taggers<br />
(TreeTagger and CLAWS4) and a dependency parser<br />
annotation with manual editing (cf. Granger 1997).<br />
2<br />
MATE had some difficulty processing said as a participle in<br />
passive constructions (4 instances).<br />
3<br />
It is interesting to note that in some cases in which learners<br />
overgeneralize the –ed suffix for the formation of past<br />
participles (e. g. *buyed, *payed, *splitted), CLAWS4 will add<br />
to the POS-tag of the respective form, indicating that<br />
occurrence is deemed unlikely.<br />
Multilingual Resources and Multilingual Applications - Posters<br />
(MATE). Metadata and linguistic annotations can be<br />
queried together.<br />
As of summer <strong>2011</strong>, the corpora are still very small<br />
(TMC: 420,000 words, LC: 17,000 words); they will<br />
gradually be enlarged. Both TreeTagger and CLAWS4<br />
will continue to be used concurrently, as TreeTagger<br />
seems to perform better on correct forms, and CLAWS4<br />
to be more robust towards erroneous ones. All relevant<br />
passive constructions will be extracted from the enlarged<br />
corpora, with pattern-based search for the correct forms<br />
and semi-automatic procedures for erroneous ones. The<br />
retrieved data, together with the pertaining metadata,<br />
should allow for an interpretation in terms of the impact<br />
of educational settings on the interlanguage of learners.<br />
8. Acknowledgements<br />
The authors would like to thank following companies:<br />
Alfred Kärcher Vertriebs-GmbH, Cornelsen Verlag<br />
GmbH, Ernst Klett Verlag GmbH, SWN Kreissparkasse,<br />
Pearson Assessment & Information GmbH.<br />
9. References<br />
Bohnet, B. (2010): Very High Accuracy and Fast<br />
Dependency Parsing is not a Contradiction. In<br />
Proceedings of the 23rd International Conference on<br />
Computational Linguistics (Coling 2010), Beijing,<br />
pp. 89–97.<br />
Garside, R., Smith, N. (1997): A hybrid grammatical<br />
tagger: CLAWS4. In R. Garside, G. Leech & A.<br />
McEnery (Eds.), Corpus Annotation: Linguistic<br />
Information from Computer Text Corpora. London:<br />
Longman, pp. 102-121.<br />
Granger, S. (2009): More lexis, less grammar? What does<br />
the (learner) corpus say? Paper presented at the<br />
Grammar & Corpora conference, Mannheim,<br />
pp. 22-24 September 2009.<br />
Granger, S. (1997): Automated Retrieval of Passives<br />
from Native and Learner Corpora. Precision and<br />
Recall. In Journal of English Linguistics 25(4),<br />
pp. 365-374.<br />
Horn, W. (2003): PSB-R 6-13. Prüfsystem <strong>für</strong> Schul- und<br />
Bildungsberatung <strong>für</strong> 6. bis 13. Klassen – revidierte<br />
Fassung. Göttingen: Hogrefe.<br />
Petermann, F. & Winkel, S. (2007): FLM 7-13.<br />
Fragebogen zur Leistungsmotivation <strong>für</strong> Schüler der 7.<br />
bis 13. Klasse. Frankfurt/Main: Harcourt.<br />
237
Multilingual Resources and Multilingual Applications - Posters<br />
Schmid, H. (1994): Probabilistic Part-of-Speech Tagging<br />
Using Decision Trees. In Proceedings of International<br />
Conference on New Methods in Language Processing.<br />
Svartvik, J. (1<strong>96</strong>6): On Voice in the English Verb. The<br />
Hague/Paris: Mouton.<br />
238
Multilingual Resources and Multilingual Applications - Posters<br />
Register, Genre, Rhetorical Functions:<br />
Variation in English Native-Speaker and Learner Writing<br />
Ekaterina Zaytseva<br />
Johannes Gutenberg-<strong>Universität</strong> Mainz, Department of English and Linguistics<br />
Jakob-Welder-Weg 18, 55099 Mainz<br />
E-mail: zaytseve@uni-mainz.de<br />
Abstract<br />
The present paper explores patterns and determinants of variation found in the writing of two groups of novice academic writers:<br />
advanced learners of English and English native speakers. It focuses on lexico-grammatical means for expressing the rhetorical<br />
function of contrast in academic and argumentative writing. The study’s aim is to explore and to compare stocks of meaningful ways<br />
of expressing the rhetorical function of contrast employed by native and learner novice academic writers in two different written<br />
genres: argumentative essays and research papers. The following corpora are used for that purpose: the Louvain Corpus of Native<br />
English Essays (LOCNESS), the Michigan Corpus of Upper-level Student Papers (MICUSP), the British Academic Written English<br />
corpus (BAWE) and two corpora of learner English, i.e. the International Corpus of Learner English (ICLE) and the Corpus of<br />
Academic Learner English (CALE) – the latter being a corpus of advanced learner academic writing, currently being compiled at<br />
Johannes Gutenberg-<strong>Universität</strong> Mainz, Germany. The study adopts a variationist perspective and a functional-pedagogical<br />
perspective on learner writing, aiming at contributing to the field of second language acquisition (SLA), by focusing on advanced<br />
stages of acquisition and teaching English for academic purposes.<br />
Keywords: novice academic writing, rhetorical function of contrast, variation, function-oriented annotation<br />
1. Introduction<br />
The branch of the SLA focusing on advanced levels of<br />
proficiency puts forward issues that are problematic for<br />
researchers, EAP teachers, and foreign language learners<br />
alike. Those include the need for an exhaustive<br />
description of language performance on an advanced<br />
level and a set of defining characteristics which could be<br />
further developed into assessment criteria.<br />
One of the factors responsible for the problematic nature<br />
of “advancedness” is a somewhat narrow view of this<br />
stage of language acquisition as on the one hand, “no<br />
more than ‘better than intermediate level’ structural and<br />
lexical ability for use”, as pointed out by Ortega and<br />
Byrnes (2008:283); and yet, on the other hand, as<br />
language performance, not “flawless” enough to be<br />
considered native-like.<br />
2. Theoretical Background<br />
Advanced learner writing has recently been the object of<br />
a number of corpus-based studies (cf. e.g. Callies, 2008;<br />
Gilquin & Paquot, 2008; Paquot, 2010). It has generally<br />
been analysed from a pedagogical perspective, i.e.<br />
against the yardstick of English native-speakers’ writing,<br />
where features of learner writing have often been<br />
characterized as non-native-like. Among the areas<br />
identified as problematic for advanced learners are most<br />
notably accurate and appropriate use of lexis, register<br />
awareness, and information structure management. Yet,<br />
studies adopting a variationist perspective on advanced<br />
learners’ output and considering a possible influence of<br />
different kinds of variables are still scarce (cf., however,<br />
Ädel, 2008; Paquot, 2010; Wulff & Römer, 2009). One of<br />
the reasons for this could be the lack of corpora<br />
representing advanced academic learner writing (Granger<br />
& Paquot, forthcoming), which makes it difficult, for<br />
example, to analyse the importance of genre and writer’s<br />
genre (un)awareness as possible determinants of<br />
variation. The existing corpora include the following<br />
projects in progress: the ‘Varieties of English for Specific<br />
Purposes’ database (VESPA) (cf. Granger, 2009), the<br />
239
Multilingual Resources and Multilingual Applications - Posters<br />
Corpus of Academic Learner English (CALE) 1<br />
, and the<br />
Cologne-Hanover Advanced Learner Corpus (CHALC)<br />
(Römer, 2007).<br />
The pedagogical approach to learners’ language<br />
production has brought forward particular kinds and<br />
methods of learner data analysis. One of them is<br />
annotating a learner corpus for errors (cf. Granger, 2004).<br />
Valuable as it is, this kind of corpus annotation, however,<br />
does not allow for a truly usage-based perspective on<br />
learner language production, where learners’ experience<br />
with language in particular social settings is the focus of<br />
attention.<br />
Corpus-based analyses of native English academic<br />
writing, meanwhile, have revealed that this register is<br />
characterised by a specific kind of vocabulary on the one<br />
hand (Biber et al., 1999; Coxhead, 2000; Paquot, 2010)<br />
and by certain kinds of grammatical structures on the<br />
other hand (e.g. Biber, 2006; Kertz & Haas, 2009). In<br />
addition, it has been pointed out that the register of native<br />
English academic writing displays a certain degree of<br />
variation as well, e.g. there is discipline- and genre-based<br />
variation in the form and use of lexico-grammatical<br />
structures used in written discourse (Hyland, 2008).<br />
However, there is little information on possible variation<br />
in different genres produced by novice native English<br />
academic writers (cf., however, Wulff & Römer, 2009).<br />
240<br />
3. Project Aims and Objectives<br />
The present paper reports on work in progress exploring<br />
patterns and determinants of variation found in the<br />
writing of two groups of novice academic writers:<br />
advanced learners of English and English native speakers.<br />
It focuses on lexico-grammatical ways for expressing the<br />
rhetorical function of contrast in academic and<br />
argumentative writing. The study’s aim is to explore and<br />
subsequently to compare stocks of meaningful ways of<br />
expressing contrast employed by native and learner<br />
novice academic writers in two different written genres:<br />
argumentative essays and research papers. For that<br />
purpose the following corpora are used: three corpora of<br />
native English corpora: the Louvain Corpus of Native<br />
English Essays (LOCNESS) (Granger, 19<strong>96</strong>), the<br />
Michigan Corpus of Upper-level Student Papers<br />
1 http://www.advanced-learner-varieties.info<br />
(MICUSP) 2 , the British Academic Written English corpus<br />
(BAWE) (Nesi, 2008) as well as two corpora of learner<br />
English, i.e. the International Corpus of Learner English<br />
(ICLE) (Granger, 2003) and the Corpus of Academic<br />
Learner English (CALE) 3<br />
- a corpus of advanced learner<br />
academic writing, currently being compiled at<br />
Johannes-Gutenberg-<strong>Universität</strong> Mainz, Germany.<br />
Another aim of the study is to investigate to what extent<br />
the influence of the variable ‘genre’ is a possible<br />
determinant of variation in the written production of<br />
various groups of academic writers. In this respect, it is<br />
important to address the issue of novice writers’ genre<br />
awareness and to discuss the question of native-speaker<br />
norm. In addition, the paper explores the existence of<br />
interlanguage (IL)-specific strategies used by advanced<br />
learners to express rhetorical functions in writing.<br />
The latter will be achieved by annotating both corpora of<br />
advanced learner writing for the rhetorical function of<br />
contrast. This kind of function-oriented annotation,<br />
though still rare in English learner corpus research,<br />
presents researchers with a valuable opportunity to view<br />
learners as active language users, rather than learners<br />
demonstrating deficient knowledge of the target language.<br />
In addition, the potential of multidimensional corpus<br />
analysis (Biber & Conrad, 2001) is currently being<br />
considered as a highly useful method of distinguishing<br />
between different registers and genres.<br />
The study, thus, adopts a variationist perspective to<br />
novice academic writing, considering advanced<br />
interlanguage as a variety in its own right. At the same<br />
time, a functional-pedagogical perspective allows for a<br />
further analysis of those areas of language use that are<br />
still problematic for advanced learners, and reveals<br />
meaningful ways in which learners cope with<br />
writing-related tasks.<br />
4. Function-oriented annotation<br />
The advantage of adding a function-driven annotation is<br />
that it makes it possible to generally identify contrast in<br />
learner writing and to pin down an extensive stock of<br />
language means, treated as writers’ lexico-grammatical<br />
preferences for signaling this rhetorical function in<br />
written discourse.<br />
2<br />
http://micusp.elicorpora.info/www.micusp.org<br />
3<br />
http://www.advanced-learner-varieties.info
Further on, the encoded information allows for<br />
function-driven, together with form-driven searches in<br />
learner writing, resulting in a comprehensive and<br />
accurate picture of the variety of lexico-grammatical<br />
means for expressing contrast used by two groups of<br />
(advanced) German learners in their writing. In addition,<br />
a subsequent quantitative analysis can provide valuable<br />
insights into general and individual preferences of<br />
learners in terms of which items are particularly favoured<br />
in the context of a specific writing-related task set in a<br />
specific situation of language use. Moreover, its<br />
combination with a qualitative analysis of patterns and<br />
determinants of variation in the ways of expressing<br />
contrast in writing promises to shed more light on general<br />
written argumentation strategies employed by (advanced)<br />
German learners.<br />
In order for this kind of annotation to be reliable, several<br />
conditions have to be met, which when applied to the<br />
present project, imply clarification of the concept of a<br />
rhetorical function and a clear definition of the rhetorical<br />
function of contrast in terms of its aim and distinctive<br />
characteristics, complemented by a list of possible<br />
language items for its realization in writing.<br />
The next step involves annotating each instance of<br />
contrast being expressed in written discourse in both<br />
corpora of (advanced) German learner writing (i.e.<br />
CALE-GE and ICLE-GE). This stage is followed by a<br />
detailed description and categorization of the<br />
lexico-grammatical means for expressing contrast in<br />
learner writing. Subsequently, comparative analyses,<br />
quantitative as well as qualitative, are carried out, in<br />
order to reveal possible patterns and determinants of<br />
variation that exist in the novice academic writing.<br />
Preliminary findings reveal a slight degree of<br />
genre-induced variation in German learners’ writing in<br />
terms of sentence placement of the contrastive item<br />
however, see Table 1 below.<br />
Corpus Corpus<br />
size,<br />
N of<br />
tokens<br />
Multilingual Resources and Multilingual Applications - Posters<br />
Initial Non-initial Total<br />
ICLE-GE 234.423 103 125 228<br />
% 45 55<br />
CALE-GE 55.000 49 27 76<br />
% 64 36<br />
Table 1: Position of the contrastive item however<br />
As the table shows, German learners seem to prefer the<br />
initial sentence positioning of however in academic<br />
(CALE-GE), rather than in argumentative (ICLE-GE)<br />
writing. Thus, the item however found in the sentence<br />
initial position is almost 1,5 times more frequent in term<br />
papers than in argumentative essays. This seems to tie in<br />
well with one of the findings recently reported by Wagner<br />
(<strong>2011</strong>). In her empirical study, she points out a tendency<br />
for however to take up the initial sentence position in<br />
literature and cultural studies texts, rather than in<br />
linguistic texts and general corpora (<strong>2011</strong>:43). Due to a<br />
modest number of words contained in the version of the<br />
CALE corpus used at the time of analysis (see Table 1),<br />
the preliminary finding reported in the current paper<br />
should be treated with caution. A further analysis of a<br />
greater number of occurrences in a bigger corpus is<br />
needed in order to provide more empirical evidence for<br />
supporting and accounting for this finding.<br />
5. Conclusion<br />
The project presented in the present paper sets out to<br />
explore advanced IL-specific strategies for coping with a<br />
writing-related task in the context of English academic<br />
and argumentative writing. This is achieved by<br />
combining a functional-pedagogical view with a<br />
variationist perspective on learner writing and annotating<br />
the rhetorical function of contrast in the two corpora of<br />
learner writing. At the same time, the findings of the<br />
project will contribute to the area of variation in novice<br />
native English academic writing and will further a<br />
definition of the native speaker norm, which advanced<br />
learners are generally expected to aim at.<br />
6. References<br />
Ädel, A. (2008): Involvement features in writing: do time<br />
and interaction trump register awareness? In G.<br />
Gilquin, S. Papp. & M. B. Díez-Bedmar (Eds.),<br />
Linking up Contrastive and Learner Corpus Research.<br />
Amsterdam, Atlanta: Rodopi, pp. 35-53.<br />
Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan,<br />
E. (1999): Longman Grammar of Spoken and Written<br />
English. Harlow: Pearson Education.<br />
Biber, D., Conrad, S. (2001): Introduction:<br />
Multidimensional analysis and the study of register<br />
variation. In S. Conrad & D. Biber (Eds.), Variation in<br />
English: Multidimensional Studies. London: Longman,<br />
241
Multilingual Resources and Multilingual Applications - Posters<br />
pp. 3-13.<br />
Biber, D. (2006): University Language: A Corpus-Based<br />
Study of Spoken and Written Registers. Amsterdam:<br />
John Benjamins.<br />
Callies, M. (2008): Easy to understand but difficult to use?<br />
Raising constructions and information packaging in<br />
the advanced learner variety. In G. Gilquin, S. Papp. &<br />
M. B. Díez-Bedmar (Eds.), Linking up Contrastive and<br />
Learner Corpus Research. Amsterdam, Atlanta:<br />
Rodopi, pp. 201-226.<br />
Coxhead, A. (2000): A new academic word list. TESOL<br />
Quarterly, 34(2), pp. 213-238.<br />
Gilquin, G., Paquot, M. (2008): Too chatty: Learner<br />
academic writing and register variation. English Text<br />
Construction, 1(1), pp. 41-61.<br />
Granger, S. (19<strong>96</strong>): From CA to CIA and back: An<br />
integrated approach to computerized bilingual and<br />
learner corpora. In K. Aijmer, B. Altenberg & M.<br />
Johansson (Eds.), Languages in Contrast. Text-Based<br />
Cross-Linguistic Studies. Lund Studies in English 88.<br />
Lund: Lund University Press, pp. 37-51.<br />
Granger, S. (2003): The international corpus of learner<br />
English: A new resource for foreign language learning<br />
and teaching and second language acquisition research.<br />
TESOL Quarterly, 37(3), pp. 538-546.<br />
Granger, S. (2004): Computer learner corpus research:<br />
Current status and future prospects. In U. Connor & T.<br />
Upton (Eds.), Applied Corpus Linguistics: A<br />
Multidimensional Perspective. Amsterdam, Atlanta:<br />
Rodopi, pp. 123-145.<br />
Granger, S. (2009): In search of a general academic<br />
vocabulary: A corpus-driven study. Paper Presented at<br />
the International Conference ‘Options and Practices of<br />
L.S.A.P Practitioners’, 7-8 February 2009. University<br />
of Crete, Heraklion, Crete.<br />
Granger, S., Paquot, M. (Forthcoming): Language<br />
for Specific Purposes. Retrieved from<br />
http://sites.uclouvain.be/cecl/archives/GRANGER_P<br />
AQUOT_Forthcoming_Language_for_Specific_Purp<br />
oses_Learner_Corpora.pdf , 17.12.2010.<br />
Hyland, K. (2008): As can be seen: lexical bundles and<br />
disciplinary variation. English for Specific Purposes,<br />
27(1), pp. 4-21.<br />
Kerz, E., Haas, F. (2009): The aim is to analyse NP: the<br />
function of prefabricated chunks in academic texts. In<br />
R. Corrigan, E. Moravcsik, H. Ouali & K. Wheatley<br />
242<br />
(Eds.), Formulaic Language: Volume 1. Distribution<br />
and historical change. Amsterdam, Philadelphia: John<br />
Benjamins, pp. 97-117.<br />
Nesi, H. (2008): BAWE: An introduction to a new<br />
resource. In A. Frankenberg-Garcia, T. Rkibi, M.<br />
Braga da Cruz, R. Carvalho, C. Direito & D.<br />
Santos-Rosa (Eds.), Proceedings of the 8th Teaching<br />
and Language Corpora Conference. Held 4-6 July<br />
2008 at the Instituto Superior de Línguas e<br />
Administração. Lisbon, Portugal: ISLA, pp. 239-246.<br />
Ortega, L., Byrnes, H. (2008): Theorizing advancedness,<br />
setting up the longitudinal research agenda. In L.<br />
Ortega & H. Byrnes (Eds.), The Longitudinal Study of<br />
Advanced L2 Capacities. New York: Routledge/Taylor<br />
& Francis, pp. 3-20.<br />
Paquot, M. (2010): Academic Vocabulary in Learner<br />
Writing: From Extraction to Analysis. United States:<br />
Continuum Publishing Corporation.<br />
Römer, U. (2007): Learner language and the norms in<br />
native corpora and EFL teaching materials: a case<br />
study of English conditionals. In: S. Volk-Birke & J.<br />
Lippert (Eds.), Anglistentag 2006 Halle. Proceedings.<br />
Trier: Wissenschaftlicher Verlag, pp. 355–63.<br />
Wagner, S. (<strong>2011</strong>): Concessives and contrastives in<br />
student writing: L1, L2 and genre differences.<br />
In J. Schmied (Ed.), Academic Writing in Europe:<br />
Empirical Perspectives. Göttingen: Cuvillier,<br />
pp. 23-49.<br />
Wulff, S. & Römer, U. (2009): Becoming a proficient<br />
academic writer: Shifting lexical preferences in the use<br />
of the progressive. Corpora, 4(2), pp. 115-133.
Multilingual Resources and Multilingual Applications - Posters<br />
Tools to Analyse German-English Contrasts in Cohesion<br />
Kerstin Kunz, Ekaterina Lapshinova-Koltunski<br />
<strong>Universität</strong> des Saarlandes<br />
<strong>Universität</strong> Campus, 66123 Saarbrücken<br />
E-mail: k.kunz@mx.uni-saarland.de, e.lapshinova@mx.uni-saarland.de<br />
Abstract<br />
In the present study, we elaborate resources to semi-automatically analyse German-English contrasts in the area of cohesion. This<br />
work is an example of applications for corpus data extraction that is designed for the analysis of cohesion from both a system-based<br />
and a text-based contrastive perspective<br />
Keywords: cohesion, contrastive analysis, corpus linguistics, extraction of linguistic knowledge, German-English contrasts<br />
1. Introduction<br />
To obtain empirical evidence of cohesion in English and<br />
German texts we carry out a corpuslinguistic analysis,<br />
which includes investigating a broad range of cohesive<br />
phenomena. We particularly focus on the analysis of<br />
various types of cohesive devices, the linguistic<br />
expressions to which they connect (the antecedents), the<br />
nature of the semantic ties established as well as the<br />
properties of cohesive chains. Our main research<br />
questions are 1) Which cohesive resources provided by<br />
the language systems of English and German are<br />
instantiated in different registers? 2) How frequent are<br />
they? 3) Which cohesive meanings do they express?<br />
Substantial research gaps in these areas justify such an<br />
enterprise: On the one hand, comprehensive accounts of<br />
cohesion are only existent from a monolingual<br />
perspective, e.g. in (Halliday & Hasan, 1976), (Schubert,<br />
2008), (Linke et al., 2001), (Brinker, 2005). On the other<br />
hand, empirical monolingual or contrastive analyses on<br />
the level of text and discourse mainly deal with<br />
individual phenomena, cf. (Fabricius-Hansen, 1999) and<br />
(Doherty, 2006) for certain aspects of information<br />
packaging and (Bosch et al., 2007), (Gundel et al., 2004)<br />
for the investigation of particular cohesive devices.<br />
Thus, both system-based and text-based contrastive<br />
methods to compare English and German in terms of<br />
textuality have to our knowledge not received much<br />
attention so far, cf. table 1.<br />
With our research, we intend to focus on cohesion as one<br />
particular aspect of textuality. As a starting point for our<br />
empirical analysis, we take the classification by (Halliday<br />
& Hasan, 1976), according to which cohesion mainly<br />
includes five categories: reference, substitution, ellipsis,<br />
conjunctive relations and lexical cohesion.<br />
Table 1: Contrastive system- and text-based studies available for English and German<br />
2. Corpus Resources<br />
In this contribution, we describe our tools to extract<br />
evidence for these categories from the English- German<br />
corpus GECCo, cf. (Amoia et al., submitted). Currently<br />
there are no comprehensive resources known to us that<br />
offer a repository of the coherence building systems of<br />
one or more language(s) 1<br />
. Our analysis design permits<br />
1 We can only name some resources providing annotations of<br />
individual cohesive phenomena, e.g. pronoun coreference in the<br />
BBN Pronoun Coreference and Entity Type Corpus, cf.<br />
(Weischedel and Brunstein 2005), verbal phrase ellipsis in (Bos<br />
and Spenader <strong>2011</strong>) or conjunctive relations in PDTB, cf.<br />
(Prasad et al. 2008) for English, or annotation of anaphora in<br />
(Dipper and Zinsmeister 2009) for German.<br />
243
Multilingual Resources and Multilingual Applications - Posters<br />
new insights into cohesive phenomena across languages,<br />
contexts and registers. The elaboration of the procedures<br />
to extract such phenomena includes compilation,<br />
annotation and exploitation of GECCo, which consists of<br />
10 registers of both written and spoken texts, as shown in<br />
table 2. The written part of GECCo includes 8 registers 2<br />
which are based on the CroCo corpus, cf. (Neumann,<br />
2005).<br />
languages<br />
EO,<br />
GO,<br />
Etrans,<br />
Gtrans<br />
EO,<br />
GO<br />
244<br />
registers<br />
Written (imported from CRoCo)<br />
FICTION, ESSAY,<br />
INSTR, POPSCI,<br />
SHARE, SPPECH,<br />
TOU, WEB<br />
spoken<br />
INTERVIEW<br />
ACADEMIC<br />
Table 2: Registers in GECCo<br />
The spoken part contains interviews (INTERVIEW) and<br />
academic speeches (ACADEMIC) produced by native<br />
speakers of the two languages 3 . We have chosen such a<br />
corpus constellation as we expect considerable<br />
differences in frequency and function of cohesive devices<br />
between written and spoken registers. Moreover, we<br />
depart from the assumption that there is a continuum<br />
from written to spoken mode rather than a clear dividing<br />
line.<br />
The written part of the multilingual corpus is already<br />
annotated with information on lemma, morphology, pos<br />
on the word level; sentences, grammatical functions,<br />
predicate-argument structures on the chunk level;<br />
2 popular-scientific texts (POPSCI), tourism leaflets (TOU),<br />
prepared speeches (SPEECH), political essays (ESSAY),<br />
fictional texts (FICTION), corporal communication (SHARE),<br />
instruction manuals (INSTR) and websites (WEB).<br />
3 This corpus part will be public and available on the web.<br />
registers and metadata on the text level as shown in figure<br />
1. It additionally contains clause-based alignment of<br />
originals and translation 4<br />
. We intend to semi-<br />
automatically annotate spoken registers with the<br />
information available for the written part, developing a<br />
set of automatic procedures for this task. The annotation<br />
layer on text level will be also enhanced with metadata<br />
information on language variation, speaker age, etc.<br />
Further annotations such as coreference, lexical chaining<br />
and cohesion disambiguation based on the analyses in<br />
(Kunz & Steiner, in progress)’s and (Kunz 2010) will be<br />
integrated into both parts of GECCo.<br />
3. Procedures to Analyse Cohesion<br />
The annotated corpus is encoded to be queried with CQP<br />
(Corpus Query Processor) 5<br />
. We also plan to encode it for<br />
further existing query engines, e.g. ANNIS2 described in<br />
(Zeldes et al., 2009). The extracted information on<br />
cohesion will be imported into semiautomatic annotation<br />
tools in order to refine the corpus annotations on different<br />
levels, cf. figure 2.<br />
Figure 1: Annotation layers in GECCo<br />
As mentioned above, the annotated corpus can already be<br />
queried with CQP, which allows two types of attributes:<br />
positional (e.g. for part-of-speech and morphological<br />
features) and structural (e.g. for clauses or metadata).<br />
With the help of CQP-based queries that include string,<br />
part-of-speech, text and register constraints we are able to<br />
extract linguistic items expressing the cohesion<br />
categories introduced in section 1. above and classify<br />
them according to their specific textual functions. We use<br />
our linguistic knowledge on cohesive devices to develop<br />
sets of complex queries with CQP that enable the<br />
extraction of cohesion from GECCo. The obtained data<br />
are subject to statistical validation (e.g. significance tests<br />
4 EO=English originals, GO=German originals, ETrans=<br />
English translations, Gtrans=German translations in table 2.<br />
5 cf. in (Christ 1994).
or variation and cluster analysis) with R, with the help of<br />
which we can disambiguate and classify cohesive<br />
devices.<br />
Moreover, CQP can also be employed to incrementally<br />
improve the corpus annotations, which allows us to<br />
semi-automatically enrich the corpus with the<br />
annotations on the information extracted as shown in<br />
figure 3. However, our observations show that<br />
representing nested structures or constituents containing<br />
gaps (necessary for annotation of coreference or ellipsis)<br />
within CQP is rather problematic, cf. (Amoia et al.,<br />
submitted). As mentioned above, we therefore attempt to<br />
exploit GECCo with further query engines available, e.g.<br />
ANNIS2.<br />
4. Preliminary Results<br />
Our preliminary extraction results already show that<br />
there exist systematic regularities of language- and<br />
register-dependent contrasts in frequency with respect to<br />
personal reference. As an example, consider our findings<br />
for the distribution of neuter forms of third person<br />
pronouns at sentence-initial position in figure 4 (EO =<br />
English Original, GO = German Original, ETrans =<br />
English Translation, GTrans = German Translation, cf.<br />
figure 2). The left side shows the distribution in<br />
percentage of sentence initial occurrences of cohesive<br />
it/es. The right side displays the total numbers for all<br />
instances and cohesive instances of sentence initial it/es.<br />
In addition, we could already show in the analysis of the<br />
German demonstrative pronouns der, die, das that there<br />
Multilingual Resources and Multilingual Applications - Posters<br />
Figure 2: Procedures to analyse Cohesion in GECCo<br />
is a heterogeneity in frequency and function across<br />
registers which goes beyond assumptions drawn in the<br />
frame of earlier systemic and also textual accounts. For<br />
instance, the findings displayed in table 3 suggest a<br />
written-spoken continuum, with the register INSTR at<br />
one end and INTERVIEW at the other end of the<br />
continuum, rather than a clear-cut distinction between<br />
written and spoken registers (as already postulated<br />
above). Moreover, the differences in numbers between<br />
das and der, die call for an in-depth analysis with respect<br />
to distinct functions.<br />
der die das<br />
GO_SPEECH 4 4 173<br />
Gtrans_SPEECH 3 - 38<br />
GO_FICTION 15 12 113<br />
Gtrans_FICTION 10 7 100<br />
GO_POPSCI 4 1 110<br />
Gtrans_POPSCI 3 1 44<br />
GO_TOU 9 2 31<br />
Gtrans_TOU 2 1 14<br />
GO_SHARE 3 1 44<br />
Gtrans_SHARE 3 - 46<br />
GO_ESSAY 1 3 90<br />
Gtrans_ESSAY - - 49<br />
GO_INSTR - - 20<br />
Gtrans_INSTR - - 18<br />
GO_WEB 1 2 31<br />
Gtrans_WEB 1 - 27<br />
GO_INTERVIEW 19 47 506<br />
Table 3: Occurrences of der, die, das in German<br />
subcorpora<br />
245
Multilingual Resources and Multilingual Applications - Posters<br />
246<br />
5. Conclusion<br />
The described resources to extract comprehensive<br />
linguistic knowledge on cohesion will find application in<br />
various linguistic areas. First, they should provide us with<br />
evidence for our hypotheses on English-German<br />
contrasts in cohesion described in (Kunz & Steiner, in<br />
progress). Second, they should yield an initial<br />
understanding of how contrast and contact phenomena on<br />
the level of cohesion affect language understanding and<br />
language production. Furthermore, the obtained<br />
information on cohesive mechanisms of English and<br />
German will provide valuable insights for language<br />
teaching, particularly for translator/ interpreter training.<br />
Our tools will also offer new incentives for the automatic<br />
exploitation of cohesion, e.g. in machine translation, as<br />
they permit extraction from parallel corpora.<br />
6. Acknowledgements<br />
The authors thank the DFG (Deutsche Forschungsgemeinschaft)<br />
and the whole GECCo team for supporting<br />
this project.<br />
7. References<br />
Brinker, K. (2005): Linguistische Textanalyse: Eine<br />
Einfuhrung in Grundbegriffe und Methoden. 6 edition.<br />
Berlin: Erich Schmidt.<br />
Christ, O. (1994): A modular and flexible architecture for<br />
an integrated corpus query system. In Proceedings of<br />
the 3rd Conference on Computational Lexicography<br />
and Text Research. Budapest, Hungary.<br />
Dipper, S., Zinsmeister, H. (2009): Annotation discourse<br />
anaphora. In Proceedings of the Workshop "Third<br />
Linguistic Annotation Workshop", LAW III,<br />
ACL-IJCNLP 2009. Suntec, Singapore, pp. 166169.<br />
Doherty, M. (2006): Structural Propensities. Translating<br />
nominal word groups from English into German.<br />
Amsterdam/ Philadelphia: Benjamins.<br />
Fabricius-Hansen, C. (1999): Information packaging and<br />
translation: Aspects of translational sentence splitting<br />
(German - English/ Norwegian). In Studia Grammatica,<br />
47, pp. 175-214.<br />
Gundel, J. K., Hedberg, N., Zacharski, R. (2004):<br />
Demonstrative pronouns in natural discourse. In<br />
Proceedings of the Fifth Discourse Anaphora and<br />
Anaphora Resolution Colloquium. Sao Miguel, Portugal.<br />
pp. 81-86.<br />
Halliday, M.A.K., Hasan, R. (1976): Cohesion in English.<br />
London, New York: Longman.<br />
Kunz, K., Steiner, E. (in progress): Towards a<br />
comparison of cohesion in English and German -<br />
contrasts and contact. Submitted for Functional<br />
Linguistics.<br />
London: Equinox Publishing Ltd.<br />
Kunz, K. (2010): Variation in English and German<br />
Nominal Coreference. A Study of Political Essays.<br />
Frankfurt am Main: Peter Lang.<br />
Linke, A., Nussbaumer, M., Portmann, P.R. (2001):<br />
Studienbuch Linguistik. 4 edition. Tubingen:<br />
Niemeyer.<br />
Neumann, S. (2005): Corpus Design. Deliverable No. 1<br />
of the CroCo Project.<br />
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo,<br />
L., Joshi, A., Webber, N. (2008): Penn Discourse<br />
Treebank Version 2.0. In Proceedings of the 6th<br />
International Conference on Language Resources and<br />
Evaluation (LREC 2008). Marrakech.<br />
Schubert, C. (2008): Englische Textlinguistik. Eine<br />
Einfuhrung. Berlin: Schmidt.<br />
Weischedel, R. Brunstein, A. (2005): BBN Pronoun<br />
Coreference and Entity Type Corpus. Linguistic Data<br />
Consortium, Philadelphia.<br />
Zeldes, A., Ritz, J., Ludeling, A., Chiarcos, C. (2009):<br />
Annis: A search tool for multi-layer annotated corpora.<br />
In Proceedings of Corpus Linguistics 2009, Liverpool,<br />
July 20-23, 2009.
Multilingual Resources and Multilingual Applications - Posters<br />
Comparison and Evaluation of ontology extraction systems<br />
Stefanie Reimers<br />
University of Hamburg<br />
E-mail: 4reimers@informatik.uni-hamburg.de<br />
Abstract<br />
This paper presents the results of an evaluation and comparison of the two semi-automatic, corpus based ontology extraction systems<br />
OntoLT and Text2Onto. Both systems were applied to a German corpus and their outputs were evaluated in two steps. First, the<br />
Text2Onto-Ontology was evaluated against a Gold Standard ontology, represented by a manually created ontology. Second, both<br />
automatically extracted ontologies of the systems were compared to each other. Additional to this, the usability of the tools has been<br />
discussed, in order to provide some hints to improve the design of future ontology extracting systems.<br />
Keywords: ontology, ontology learning, ontology extraction, ontology evaluation<br />
1. Introduction<br />
During the last years the application area of ontologies<br />
has massively been enlarged. They are not longer only a<br />
part of the vision of the semantic web but also used in<br />
intelligent search engines, information systems and in the<br />
field of model based system engineering. Therefore the<br />
need of ontologies increases similarly. But the creation of<br />
ontologies is often accompanied by a huge manual effort<br />
so that this process remains very time- and cost-intensive.<br />
Existing editors like Protégé 1<br />
support the work of<br />
ontology developers and make it more comfortable but<br />
they only can reduce a little bit of the needed effort.<br />
Hence, techniques which reduce the manual part of the<br />
process by employing automatic methods are desirable.<br />
Corpus based ontology extraction tools seem to be the<br />
solution. They take as input a domain specific text corpus<br />
and output a domain ontology. Text is especially<br />
nowadays an excellent data source because of its<br />
permanently updated availability on the web. Under ideal<br />
circumstances the ontology extraction process should be<br />
fully automatic and produce a domain ontology of good<br />
quality. Up to date this remains unrealizable, because an<br />
important part of knowledge can't be inferred from text<br />
corpora: the commonsense knowledge. Consequently<br />
semi-automatic extraction systems present the maximal<br />
degree of support during the ontology engineering<br />
process. Several tools have been developed and are partly<br />
1 http://protege.stanford.edu/<br />
free available on the web. But how well do they perform?<br />
At last they are only useful, if they heavily reduce the<br />
manual effort compared to the traditional ontology<br />
engineering process. This aspect includes on the one<br />
hand that the tool should be easy to use and on the other<br />
hand that the resulting ontology should be of good<br />
quality, comparable to a manual created one. Another<br />
interesting question is, how the outputs of systems differ,<br />
if they are applied to the same corpus.<br />
Several works on the evaluation of ontology extraction<br />
systems have been published during the last years. But<br />
none of them considered a Gold Standard Evaluation<br />
against a manually created ontology. Furthermore, there<br />
hasn't been an attempt which used a German text corpus<br />
as data source. This work aims on exploring these<br />
missing aspects by figuring out, how great the advantage<br />
of OntoLT 2 and Text2Onto 3 is compared to a manual<br />
creation process of an ontology. Therefore both systems<br />
were applied to the German text corpus of the Language<br />
Technology for eLearning project (LT4eL 4<br />
), their outputs<br />
were compared to each other and finally, the<br />
Text2Onto-ontology was evaluated against the manually<br />
created LT4eL-ontology.<br />
Section 2 introduces the ontology extraction systems<br />
OntoLT and Text2Onto as well as the LT4eL-corpus and<br />
the LT4eL-ontology. Section 3 gives a short review about<br />
2 http://olp.dfki.de/OntoLT/OntoLT.htm<br />
3 http://code.google.com/p/text2onto<br />
4 http://www.let.uu.nl/lt4el<br />
247
Multilingual Resources and Multilingual Applications - Posters<br />
current studies dealing with the evaluation of tools of this<br />
kind. Section 4 deals with the actual evaluation of the<br />
systems and the produced ontologies.<br />
248<br />
2. Presentation of the used systems<br />
and the data resources<br />
The used ontology extraction systems were, because they<br />
are freely available and because they are able to process<br />
German texts.<br />
2.1. OntoLT<br />
OntoLT is a java based Protégé-Plugin. The in this work<br />
employed version 2.0 is exclusively compatible with<br />
Protégé 3.2, which is also freely online available. It takes<br />
as input a corpus of linguistically annotated texts in<br />
XML 5 format. There are no requirements for a specific<br />
XML format, because the user can customize the tool for<br />
various formats. This takes place by changing the<br />
implemented XPath 6<br />
-expressions which allow addressing<br />
specific linguistic elements (like sentences, noun phrases,<br />
head nouns, etc). They are needed for the extraction<br />
process which is performed via so called mapping rules.<br />
Those rules determine which concepts, instances and<br />
relations will be automatically extracted. Some rules are<br />
already implemented but the user has also the possibility<br />
to integrate new ones by using the OntoLT native<br />
precondition language. Rules consist of two parts:<br />
constraints and operators. If certain constraints are<br />
satisfied, one or more operators take effect. Operators can<br />
create concepts and concept properties as well as attach<br />
instances to existing concepts.<br />
In this work, only the implemented rules were used. They<br />
specify that concepts will be created according to all<br />
heads of noun phrases in the corpus. If there exist<br />
adjectives, which belong to the nouns, they will be<br />
combined with the concept and result in a subconcept.<br />
Another rule effects the extraction of relations, which are<br />
inferred from the predicates – together with the subject<br />
and its direct objects - of sentences. After the application<br />
of the rules, the extracted concepts, relations and<br />
instances can be manipulated with the help of Protégé<br />
(Buitelaar et al., 2004).<br />
5 http://www.w3.org/standards/xml<br />
6 http://www.w3.org/TR/xpath<br />
2.2. Text2Onto<br />
Text2Onto is also a java based application and realized as<br />
a standalone system. It requires the prior installation of<br />
Gate 4.0 7 and WordNet 2.0 8 , which are both open source.<br />
The input consists of a corpus in text, html or pdf format.<br />
No linguistic preprocessing is required because the<br />
system provides its own preprocessing. Supported<br />
languages are English, partially also Spanish and<br />
German. The extraction process consists of several steps,<br />
which itself consists of different implemented<br />
algorithms. The user can chose between the algorithms or<br />
employ a combination of them. For example, there are<br />
three different methods for identifying concept<br />
candidates: rtf 9 , tf-idf 10<br />
and C/NC-value. The results of<br />
the algorithms are saved in a so called probabilistic<br />
ontology model (POM). It consists of a set of instantiated<br />
modeling primitives, which are independent of a specific<br />
ontology representation language. Each instance gets a<br />
numerical value between 0 and 1 (computed by the<br />
algorithms), indicating the probability, that it deals with a<br />
for the ontology relevant element. The elements, together<br />
with its values, are then presented to the user, who shall<br />
be supported in the selection process by the assigned<br />
values. The instantiation of the primitives takes place by<br />
accessing the declarative definition in the modeling<br />
primitive library (MPL).<br />
Modeling primitives are: concepts, subconcepts, instances<br />
and relations. Ontology writers are responsible for the<br />
translation of the POMs into a specific ontology language<br />
11 12<br />
like OWL or RDFS (Cimiano & Völker, 2005).<br />
2.3. The LT4eL-corpus<br />
The corpus originates from the LT4eL project and<br />
consists of 69 German texts. They were selected by the<br />
project participants and belong to the domain<br />
Information Technology for End Users & eLearning. All<br />
texts deal with introductions about how to use programs<br />
(like Excel and Word), internet and eLearning. The<br />
corpus includes 69 files, on average 5732 words per file<br />
and a total of 395547 words. 752 different domain<br />
relevant keywords were (manually) identified, which are<br />
7 http://gate.ac.uk/download/index.html<br />
8 http://wordnet.princeton.edu<br />
9 relative term frequency<br />
10 Term Frequency Inverse Document Frequency<br />
11 http://www.w3.org/TR/2004/REC-owl-features-20040210<br />
12 http://www.w3.org/TR/2004/REC-rdf-concepts-20040210
all covered by the LT4eL-ontology. Ideally, an on the<br />
basis of this corpus automatically extracted ontology<br />
should also semantically cover all keywords.<br />
The files of the corpus are available in two formats: in<br />
text format and in a linguistically annotated xml format,<br />
all encoded in utf-8 13 . The text files serve as input for<br />
Text2Onto, the xml files for OntoLT. The xml format was<br />
determined by the LT4eL members. Sentence structure,<br />
noun phrases and tokens as well as corresponding<br />
lemmas, parts of speech and some morpho-syntactic<br />
information (person, number, gender,case) are annotated.<br />
A snippet of an annotated file is presented in figure 1.<br />
Figure 1 Sample linguistic annotation<br />
The linguistic information is located in the values of the<br />
attributes of the token tags. base references the lemma,<br />
ctag the part of speech 14<br />
and msd contains morphosyntactic<br />
data. Those are the for OntoLT relevant aspects.<br />
The complete corpus of 69 files is used for the Gold<br />
Standard evaluation of the Text2Onto-Ontology.<br />
Unfortunately, not all files could be processed by OntoLT.<br />
The reason for this circumstance could not be detected<br />
during this work. Therefore it was not possible to perform<br />
a Gold Standard evaluation of the OntoLT-ontology,<br />
because the Gold Standard ontology was generated on the<br />
basis of the whole corpus, so that a comparison would be<br />
unfair. Alternatively, the OntoLT-ontology was compared<br />
to a Text2Onto-ontology, extracted on the basis of a<br />
reduced form of the corpus. This reduced corpus consists<br />
of the files, which could be processed by OntoLT. It<br />
contains 43 files, on average 4760 words per file and a<br />
total of 204378 words (Mossel, 2007).<br />
2.4. The LT4eL-ontology<br />
The LT4eL-ontology was created on the basis of<br />
manually annotated keywords of the corpus. The project<br />
members modeled adequate concepts, corresponding to<br />
those keywords. They also added further sub- and<br />
superconcepts (for example: if Notepad was identified as<br />
13 Universal Character Set Transformation Format-8-bit<br />
14 STTS (Stuttgart-Tübingen-TagSet)<br />
Multilingual Resources and Multilingual Applications - Posters<br />
concept, also text editor and editor were added as<br />
superconcepts). Finally, the ontology was connected to<br />
the upper ontology DOLCE Ultralite 15<br />
. All in all the<br />
ontology contains 1275 concepts – 1002 of them are<br />
domain concepts – 1612 subconcept-relations, 116<br />
further relations, including 42 subrelations. Each concept<br />
comes with an English definition and a natural language<br />
representation. The ontology is available as an owl-file in<br />
xml representation (Mossel, 2007).<br />
3. State of the art<br />
During the last two years there were amongst others three<br />
publications of studies in the field of evaluation of<br />
semi-automatic ontology extraction tools, which used<br />
OntoLT and/or Text2Onto.<br />
Hatala et al. (2009) tested the systems OntoGen 16<br />
and<br />
Text2Onto mainly according to their usability but also in<br />
relation to the quality of the produced ontologies. They<br />
used English corpora. 28 participants used the tools and<br />
answered questionnaires afterwards. The evaluation<br />
showed that the ontology extraction process via<br />
Text2Onto was accompanied by two central issues: 1)<br />
Due to a missing user guide the participants were not able<br />
to preview, what kind of effects the different algorithms<br />
or their combination would have on the resulting<br />
ontology. 2) The integrated extraction methods identified<br />
an enormous amount of concept candidates (several<br />
thousand) and the user was supposed to review all items<br />
according to their adequacy. Furthermore the quality of<br />
the produced ontologies was categorized as very poor,<br />
because they were flat and not appropriate to represent<br />
the demanded domain knowledge. The OntoGen-Tool<br />
was judged as more comfortable and user-friendly than<br />
Text2Onto. The participants felt to be more involved into<br />
the extraction process and were satisfied with the well<br />
structured ontologies, which included several relations<br />
(Hatala et al., 2009).<br />
Ahrens published her studies of OntoLT in 2010. She<br />
implemented her own extraction rules and applied them<br />
to an English corpus. Since the extracted ontology was<br />
very flat, additional superconcepts were inserted. Finally<br />
the ontology was adequate enough to represent the<br />
domain of the corpus. Ahrens concluded that OntoLT –<br />
though having some issues – would be a good support<br />
15 http://wiki.loa-cnr.it/index.php/LoaWiki:DOLCE-UltraLite<br />
16 http://ontogen.ijs.si/<br />
249
Multilingual Resources and Multilingual Applications - Posters<br />
during the ontology engineering process (Ahrens, 2010).<br />
Also 2010 Park et al. made their work public. They<br />
evaluated the systems OntoLT, Text2Onto,<br />
OntoBuilder 17 and DODDLE 18<br />
by applying them to an<br />
English corpus. They took the usability as well as the<br />
quality of the produced ontologies into account. They<br />
considered OntoLT to be less user-friendly because the<br />
input corpus has to be linguistically preprocessed. After<br />
all, Text2Onto was judged as the best tool because of its<br />
flexibility on the one hand according to the input format<br />
and on the other hand according to the applicability of<br />
different extraction algorithms (Park et al., 2010).<br />
All presented studies treat the evaluation of ontology<br />
extraction tools. Nevertheless one can't infer predictions<br />
or expectations for the evaluation scenario in this work.<br />
The results are somehow contradictory: Hatala et al.<br />
weren't satisfied with Text2Onto but Park et al. judged it<br />
as the best of all tested systems. Ahrens classified<br />
OntoLT as helpful, though Park et al. criticized its<br />
user-friendliness. Additionally to this, none of the studies<br />
includes a comparison between an automatically<br />
constructed and a manually created ontology. This fact<br />
and the application of a German corpus distinguish this<br />
work from all so far published ones.<br />
250<br />
4. Evaluation<br />
4.1. Gold Standard Evaluation<br />
The Text2Onto-ontology contained 10174 concepts, 13<br />
subconcept relations and 945 instances. But only 981<br />
concepts, 3 subconcept relations and 18 instances made<br />
sense. Most of the extracted items were either not domain<br />
relevant or consisted of strings, which couldn't be<br />
interpreted (due to the partial supported linguistic<br />
analysis for German texts). No further relations were<br />
identified. The ontology covers ca. 56 % of all domain<br />
relevant terms of the corpus. Altogether, its quality is not<br />
as high as that of the manually created, well structured<br />
LT4eL-ontology. The Text2Onto-ontology includes only<br />
few hierarchical relations, so that it is more a list of<br />
concepts than a real ontology. Also, the coverage of the<br />
domain relevant terms is very low. Most of the concepts<br />
are very specific, e.g. PowerPoint and Excel are included,<br />
17 http://ontobuilder.bitbucket.org/<br />
18 http://doddle-owl.sourceforge.net/en/<br />
but more general concepts like editor are missing<br />
(although they appear in the texts).<br />
4.2. OntoLT vs. Text2Onto<br />
The OntoLT-ontology consisted of 3939 concepts, 2565<br />
subconcept relations, 105 further relations and 0<br />
instances. 829 concepts, 299 subconcept relations and 87<br />
further relations were considered to be domain relevant.<br />
The ontology covers ca. 58 % of all domain relevant<br />
terms of the corpus. Many relevant concepts are missing,<br />
because the system only extracted terms, which appeared<br />
together with a modifier in the text.<br />
The comparison of both semi-automatic extracted<br />
ontologies showed, that OntoLT had more problems to<br />
detect acronyms whereas Text2Onto often failed to<br />
identify compounds. The degree of coverage of domain<br />
relevant terms was similar.<br />
It turns out, that both systems need to be improved.<br />
Especially Text2Onto extracts an enormous amount of<br />
irrelevant concept candidates, so that the user has to<br />
spend a lot of time to delete them. In general, the<br />
underlying algorithms are not adequate to identify<br />
suitable items, because they are based on statistical<br />
methods: but the domain relevance of a term mustn't be<br />
dependent of the number of its occurrence in a text<br />
corpus (Lame, 2004).<br />
5. References<br />
Ahrens, M. (2010): Semi-autom. Generierung einer<br />
OWL-Ontologie aus domänensp. Texten, Dipl. Thesis.<br />
Buitelaar,P., Olejnik, D. Sintek, M. (2004): A Protégé<br />
Plug-in for Ontology Extraction from Text. In: Proc. of<br />
the 1st European Semantic Web Symposium.<br />
Cimiano, P., Völker, J. (2005): Text2Onto – A Framework<br />
for Ont. Learning and Data-driven Chance Discovery.<br />
Hatala, M., Siadaty, M., Gasevic, D., Jovanovic, J.,<br />
Torniai, C., (2009): Utility of Ontology Extraction Tools<br />
in the Hands of Educators. In: Proc. of the ICSC, USA.<br />
Lame, G. (2004): Using NLP Techniques to Identify<br />
Legal Ontology Components. In: Artificial<br />
Intelligence and Law 12, Nr.4, pp. 379-3<strong>96</strong>.<br />
Mossel, E. (2007): Crosslingual Ontology-Based<br />
Document Retrieval. In: Proc. of the RANLP 2007.<br />
Park, J. Cho, W. Rho, S. (2010): Evaluation<br />
ontology extraction tools, In: Data Knowl.Eng. 69,<br />
pp. 1043-1061.
Multilingual Resources and Multilingual Applications - Posters<br />
System Presentations<br />
251
Multilingual Resources and Multilingual Applications - System Presentations<br />
252
Multilingual Resources and Multilingual Applications - System Presentations<br />
New and future developments in EXMARaLDA<br />
Thomas Schmidt, Kai Wörner, Hanna Hedeland, Timm Lehmberg<br />
<strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong> (HZSK)<br />
Max Brauer-Allee 60<br />
D-22765 Hamburg<br />
E-mail: thomas.schmidt@uni-hamburg.de, kai.wörner@uni-hamburg.de, hanna.hedeland@uni-hamburg.de,<br />
timm.lehmberg@uni-hamburg.de<br />
Abstract<br />
We present some recent and planned future developments in EXMARaLDA, a system for creating, managing, analysing and<br />
publishing spoken language corpora. The new functionality concerns the areas of transcription and annotation, corpus management,<br />
query mechanisms, interoperability and corpus deployment. Future work is planned in the areas of automatic annotation,<br />
standardisation and workflow management.<br />
Keywords: annotation tools, corpora, spoken language, digital infrastructure<br />
EXMARaLDA 1<br />
1. Introduction<br />
(Schmidt & Wörner, 2009) is a system<br />
for creating, managing, analysing and publishing spoken<br />
language corpora. It was developed at the Research<br />
Centre on Multilingualism (SFB 538) between 2000 and<br />
<strong>2011</strong>. EXMARaLDA is based on a data model for<br />
time-aligned multi-layer annotations of audio or video<br />
data, following the general idea of the annotation graph<br />
framework (Bird & Liberman, 2001). It uses open<br />
standards (XML, Unicode) for data storage, is largely<br />
compatible with many other widely used media<br />
annotation tools (e.g. ELAN, Transcriber, CLAN) and<br />
can be used with all major operating systems (Windows,<br />
Macintosh, Linux). The principal software components<br />
of the system are a transcription editor (Partitur-Editor), a<br />
corpus management tool (Corpus Manager) and a KWIC<br />
concordancing tool (EXAKT).<br />
EXMARaLDA has been used to construct the corpus<br />
collection of the Research Centre on Multilingualism<br />
comprising 23 multilingual corpora of spoken language<br />
(see Hedeland et al., this volume). It is also used for<br />
several larger corpus projects outside Hamburg such as<br />
2<br />
the METU corpus of Spoken Turkish (Middle Eastern<br />
Technical University Ankara, see Ruhi et al., this<br />
1 http://www.exmaralda.org<br />
2 http://std.metu.edu.tr/<br />
volume), the GEWISS corpus of spoken academic<br />
discourse 3 (Universities of Leipzig, Wroclaw and Aston),<br />
the Corpus of Northern German Language Variation 4<br />
(SiN – Universities of Hamburg, Bielefeld, Frankfurt/O.,<br />
Münster, Kiel and Potsdam) and the Corpus of Spoken<br />
Language in the Ruhrgebiet 5<br />
(KgSR, University of<br />
Bochum).<br />
This paper focuses on new functionality added or<br />
improved during the last two years and sketches some<br />
plans for the future development of the system.<br />
2. New and improved functionality<br />
2.1. Transcription and annotation<br />
The Partitur-Editor now provides additional support for<br />
time alignment of transcription and audio and/or video in<br />
the form of a time-based visualisation of the media signal.<br />
Navigation in this visualization is synchronised with<br />
navigation in the transcript, and the visualization can be<br />
used to specify the temporal extent of new annotations<br />
and to modify the start and end points of existing<br />
annotations. This has turned out a way to significantly<br />
improve transcription speed and accuracy.<br />
3<br />
https://gewiss.uni-leipzig.de/de/<br />
4<br />
http://sin.sign-lang.uni-hamburg.de/drupal/<br />
5<br />
http://www.ruhr-uni-bochum.de/kgsr/<br />
253
Multilingual Resources and Multilingual Applications - System Presentations<br />
Similarly, systematic manual annotation with (closed) tag<br />
sets is now supported through a configurable annotation<br />
panel which allows the user to define one or several<br />
hierarchical tag sets, assign tags to keyboard shortcuts<br />
and link them to specific labels of annotation layers. It is<br />
also possible to specify dependencies between different<br />
tag sets so that the user is offered only those tags which<br />
are applicable in a certain context. Annotation speed and<br />
consistency can thus be improved considerably.<br />
254<br />
Figure 1: Annotation Panel in the Partitur-Editor<br />
For large scale standoff annotation of corpora, a separate<br />
tool – Sextant (Standoff EXMARaLDA Transcription<br />
Annotation Tool, Wörner, 2010) – was added to the<br />
system’s tool suite. Sextant can be used to efficiently add<br />
standoff tags from closed tag sets to a segmented<br />
EXMARaLDA transcription. Annotations are stored as<br />
TEI conformant feature structures which point into<br />
transcriptions via ID references. For further processing,<br />
the standoff annotation can also be integrated into the<br />
main file.<br />
2.2. Corpus management<br />
The Corpus Manager was supplemented with a set of<br />
operations to aid in the maintenance of transcriptions,<br />
recordings and metadata. This includes functionality for<br />
checking the structural consistency (e.g. temporal<br />
integrity of time-alignment, correct assignment of<br />
annotations to primary layers etc.), the validity of<br />
transcriptions with respect to a given transcription<br />
convention, as well as the completeness and consistency<br />
of metadata descriptions. Furthermore, a set of analysis<br />
functions operating on a corpus as a whole was added.<br />
Users can now generate and manipulate global<br />
type/token and frequency lists for a given corpus,<br />
perform global search and replace routines or generate<br />
corpus statistics according to different parameters. These<br />
new features are intended to facilitate both corpus<br />
construction and corpus use.<br />
2.3. Query mechanisms<br />
For the query tool EXAKT, several new features were<br />
added to support the user in formulating complex queries<br />
to a corpus.<br />
A Levenshtein calculation was made available which<br />
selects from a given list of words all entries which are<br />
sufficiently similar to a form selected by the user. This<br />
can help to minimize the risk that (potentially<br />
unpredictable) variants – as are common in spoken<br />
language corpora – are accidentally overlooked in<br />
queries.<br />
Figure 2: Word list with Levenshtein functionality in<br />
EXAKT<br />
A regular expression library can now be used to store and<br />
retrieve common queries. This is meant mainly as a help<br />
for those users who are not experts in the design of formal<br />
queries.<br />
Through an extension of the original KWIC functionality,<br />
EXAKT is now also able to carry out queries across two<br />
or more annotation layers. This is achieved by adding one<br />
or more so called annotation columns in which<br />
annotation data from a specified annotation level<br />
overlapping with the existing search results are added to<br />
the concordance. The type of overlap between<br />
annotations can be specified as illustrated in figure 3.
Multilingual Resources and Multilingual Applications - System Presentations<br />
Figure 3: Specifiying the overlaptype for a multilevel<br />
search in EXAKT<br />
2.4. Interoperability<br />
Much work has been invested to further improve and<br />
optimise EXMARaLDA’s compatibility with other<br />
widely used transcription and annotation tools. Wizards<br />
for importing entire corpora from Transcriber, FOLKER,<br />
CLAN and ELAN were integrated into EXAKT thereby<br />
considerably extending the tool’s area of application.<br />
Moreover, a proposal for a spoken language transcription<br />
standard based on the P5 version of the TEI guidelines<br />
was formulated (Schmidt, <strong>2011</strong>), and a droplet<br />
application (TEI-Drop) was added to the EXMARaLDA<br />
toolset which enables users to easily transform<br />
Transcriber, FOLKER, CLAN, ELAN or EXMARaLDA<br />
files into this TEI conformant format.<br />
Figure 4: Screenshot of TEI-Drop<br />
2.5. Corpus deployment<br />
Completed EXMARaLDA corpora can now also be made<br />
available (i.e. queried) via a relational database with<br />
EXAKT. Compared to the deployment in the form of<br />
individual XML files which are then queried either<br />
locally or via http with EXAKT, this method not only<br />
facilitates data access, but also considerably improves<br />
query performance (by a factor of about 10 for smaller<br />
corpora, probably more for larger corpora) and allows for<br />
a more fine-grained access management. Furthermore,<br />
making data available in this way is also a prerequisite<br />
for integrating EXMARaLDA data into evolving<br />
distributed infrastructures like CLARIN.<br />
With the general availability of HTML5, methods for<br />
visualizing corpus data for web presentations could also<br />
be simplified and improved considerably. The integration<br />
of transcription text and underlying audio or video<br />
recording now no longer depends on Flash technology,<br />
but can be efficiently realised with standard browser<br />
technology.<br />
3. Future work<br />
With the end of the maximum funding period of the<br />
Research Centre on Multilingualism in June <strong>2011</strong>,<br />
EXMARaLDA’s original context of development has<br />
also ceased to exist. Although the system is now in a<br />
stable state and should remain usable for quite some time<br />
with some minimal maintenance work, we still see much<br />
potential for future development in at least three areas.<br />
3.1. Automatic annotation<br />
Additional manual and automatic annotation methods are<br />
required in order to make spoken language corpora more<br />
useful for corpus linguistic research. We have<br />
consequently started to explore the application of<br />
methods developed for written language, such as<br />
automatic part-of-speech-tagging or lemmatisation to<br />
EXMARaLDA corpora.<br />
First tests were carried out on the Hamburg Map Task<br />
Corpus (HAMATAC, Hedeland & Schmidt, 2012) with<br />
TreeTagger (Schmid, 1995), which was integrated via the<br />
TT4J interface (Eckart de Castilho et al., 2009) into<br />
EXMARaLDA. HAMATAC was POS-tagged and<br />
lemmatised with the default German parameter file,<br />
trained on written newspaper texts. The data were first<br />
tokenized using EXMARaLDA’s segmentation<br />
functionality which segments and distinguishes words,<br />
punctuation, pauses and non-phonological segments.<br />
Only words and punctuation were fed as input into the<br />
255
Multilingual Resources and Multilingual Applications - System Presentations<br />
tagger in the sequence in which they occur in the<br />
transcription. The tagging results were saved as<br />
EXMARaLDA standoff annotation files which can be<br />
further processed in the Sextant tool (see above). A<br />
student assistant was instructed to manually check and<br />
correct all POS tags. An evaluation shows that roughly<br />
80% of POS tags were assigned correctly. The error rate<br />
is thus considerably higher than for the best results which<br />
can be obtained on written texts (about 97% correct tags).<br />
By far the most tagging errors, however, occurred with<br />
word forms which are specific to spoken language, such<br />
as hesitation markers (“äh”, “ähm”), interjections and<br />
incomplete forms (cut-off words). Since especially the<br />
former are highly frequent but very limited in form (three<br />
forms äh, ähm and hm account for about half of the<br />
tagging errors), we expect a retraining of the TreeTagger<br />
parameter file on the corrected data to lead to a much<br />
lower error rate.<br />
3.2. Standardisation<br />
Further work in standardisation of data models, metadata<br />
descriptions, file formats and transcription conventions is<br />
needed in order to integrate spoken language data on<br />
equal footing with written data into the language resource<br />
landscape. EXMARaLDA as one of the most<br />
interoperable systems of its kind already provides a solid<br />
basis for developing and establishing such standards.<br />
Future work in this area should attempt to consolidate<br />
this basis with more general approaches like the<br />
guidelines of the Text Encoding Initiative,<br />
standardisation efforts within the ISO framework and<br />
emerging standards for digital infrastructures.<br />
3.3. Workflow management<br />
As we survey, train and support users in constructing and<br />
analysing spoken language corpora with EXMARaLDA,<br />
we observe how important it is to organise the tools’<br />
functionalities into an efficient workflow. Right now, the<br />
EXMARaLDA tools operate in a standalone fashion on<br />
local file systems, leaving many important aspects of the<br />
workflow (e.g. version control, consistency checking etc.)<br />
in the users’ responsibility. A tight integration of the tools<br />
with a repository solution may make it much easier,<br />
especially for larger projects, to organise their workflows<br />
and construct and publish their corpora in a maximally<br />
efficient and effective manner. We plan to explore this<br />
256<br />
possibility further in the follow-up projects at the<br />
Hamburg Centre for Language Corpora (HZSK). 6<br />
4. Acknowledgements<br />
Work on EXMARaLDA was funded by the University of<br />
Hamburg and by grants from the Deutsche<br />
Forschungsgemeinschaft (DFG).<br />
5. References<br />
Bird, S., Liberman, M. (2001): A formal framework for<br />
linguistic annotation. In: Speech Communication (33),<br />
pp. 23-60.<br />
Eckart de Castilho, R., Holtz, M., Teich, E. (2009):<br />
Computational support for corpus analysis work flows:<br />
The case of integrating automatic and manual<br />
annotations. In: Lingustic Processing Pipelines<br />
Workshop at GSCL 2009 - Book of Abstracts<br />
(electronic proceedings), October 2009.<br />
Hedeland, H., Schmidt, T. (2012): Technological and<br />
methodological challenges in creating, annotating and<br />
sharing a learner corpus of spoken German. To appear<br />
in: Schmidt, T., Wörner, K.: Multilingual Corpora and<br />
Multilingual Corpus Analysis. To appear as part of the<br />
series ‘Hamburg Studies in Multilingualism’ (HSM).<br />
Amsterdam: Benjamins.<br />
Schmid, H. (1995): Improvements in Part-of-Speech<br />
Tagging with an Application to German. Proceedings<br />
of the ACL SIGDAT-Workshop. March 1995.<br />
Schmidt, T., Wörner, K. (2009): EXMARaLDA –<br />
Creating, analysing and sharing spoken language<br />
corpora for pragmatic research. In: Pragmatics (19:4),<br />
pp. 565-582.<br />
Schmidt, T. (<strong>2011</strong>): A TEI-based approach to<br />
standardising spoken language transcription. In:<br />
Journal of the Text Encoding Initiative (1).<br />
Wörner, K. (2010): Werkzeuge zur flachen Annotation<br />
von Transkriptionen gesprochener Sprache. PhD<br />
Thesis, <strong>Universität</strong> Bielefeld,<br />
http://bieson.ub.uni-bielefeld.de/volltexte/2010/1669/.<br />
6 http://www.corpora.uni-hamburg.de
Multilingual Resources and Multilingual Applications - System Presentations<br />
Der VLC Language Index<br />
Dirk Schäfer, Jürgen Handke<br />
Institut <strong>für</strong> Anglistik und Amerikanistik, Philipps-Universtität Marburg<br />
Wilhelm-Röpke-Straße 6D<br />
E-mail: {dirk.schaefer,handke}@staff.uni-marburg.de<br />
Abstract<br />
Der Language Index ist eine Sammlung von Audiodaten von Sprachen der Welt. Als Bestandteil der Online-Lernplattform “Virtual<br />
Linguistics Campus” repräsentiert der Language Index Sprachaufnahmen in standardisierter Form und typologische Informationen<br />
mit Web-Technologien, die zum Zwecke der Analyse, z.B. in der Lehre, verwendet werden können.<br />
Keywords: Audio-Korpus, Typologie, Web<br />
1. Übersicht<br />
Der Language Index als Teil der Online Lernplattform<br />
„Virtual Linguistics Campus“ ist eine Sammlung von<br />
strukturierten Audiodaten von Sprachen der Welt. Im<br />
Rahmen einer Systemvorführung stellen wir vor, wie die<br />
Daten präsentiert werden und wie Forscher die vorhandenen<br />
Audiodaten nutzen können. Der restliche Artikel<br />
beschreibt das Datenformat <strong>für</strong> die Sprachaufnahmen,<br />
und die Benutzerschnittstellen.<br />
2. Erstellung von Sprachaufnahmen<br />
Die Sprachaufnahmen stellen einen Parallelkorpus dar,<br />
da von jedem Sprecher dieselben Wörter, Halbsätze und<br />
Sätze gesprochen wurden. Zu diesem Zweck existiert<br />
eine Sammlung von standardisierten Datenblättern, die<br />
erweitert wird, sobald eine neue Sprache hinzukommt.<br />
Für manche Sprachen existieren mehrere leicht voneinander<br />
abweichende Datenblätter, da alle Sprecher die<br />
Daten entsprechend ihres regionalen Dialekts übersetzt<br />
haben. Zurzeit verfügen wir über Datenblätter zu 110<br />
Sprachen und Regionaldialekten, sowie Sprachaufnahmen<br />
von 850 Sprechern.<br />
2.1. Verfahren zur Gewährleistung der Qualität<br />
von Sprachaufnahmen<br />
Um die Qualität der Sprachaufnahmen zu gewährleisten,<br />
hat sich folgendes Verfahren bewährt:<br />
a. Der Sprecher überprüft das vorhandene Datenblatt<br />
zu seiner Sprache. Ist kein Datenblatt zu seiner<br />
Sprache verfügbar, übersetzt er die Keywords und<br />
Sätze.<br />
b. Verfügt die Sprache über kein Schriftsystem, werden<br />
Abbildung 1: Benutzerschnittstelle<br />
die Datenblätter auf der Basis des IPA-Alphabets<br />
durch Interaktion mit dem Sprecher erstellt.<br />
c. Der Sprecher liest die Keywords und Sätze im normalen<br />
Tempo vor. Die Aufnahme erfolgt vor Ort mit<br />
einem digitalen Aufnahmegerät, über das Web mit<br />
Hilfe von Skype oder mit einem Headset am heimischen<br />
Computer.<br />
d. Die aufgenommenen Sprachdaten werden nachbearbeitet<br />
und mit Cuepoints versehen.<br />
e. Die vollständige Sprachaufnahme mitsamt Transkription<br />
und Transliteration wird dem Sprecher zu<br />
Kontrolle vorgelegt.<br />
f. Die Aufnahme wird über den VLC Language Index<br />
verfügbar gemacht.<br />
257
Multilingual Resources and Multilingual Applications - System Presentations<br />
258<br />
3. Benutzerschnittstelle<br />
Es gibt besondere Schwierigkeiten bei der Repräsenta-<br />
tion solcher audiobasierter Parallelkorpora. Zum Beispiel<br />
muss eine einfache Benutzbarkeit gewährleistet sein, die<br />
ohne Einarbeitungszeit einen schnellen Zugriff auf alle<br />
gewünschten Daten ermöglicht. Außerdem liegt es in der<br />
Natur eines Parallelkorpus, dass Vergleichsmöglichkeiten<br />
gegeben sein müssen.<br />
Der Language Index ist eine auf Webtechniken basierende<br />
Anwendung mit hohen Flash und Flex Anteilen.<br />
Seit 2006 wird die Google Maps API zur Darstellung von<br />
Sprachdaten auf Karten eingesetzt. Mit dem Anwachsen<br />
des Datenbestandes wurde eine Datenbanknutzung notwendig,<br />
besondere Verfahren mussten eingesetzt werden,<br />
um eine performante Kommunikation zwischen PHP und<br />
den auf Flex basierenden Benutzeroberflächen zu gewährleisten.<br />
Der Zugriff auf die Audiodaten im VLC Language Index<br />
ist auf verschiedene Weisen möglich:<br />
� Eine Liste von Sprachaufnahmen, nach Sprachen<br />
sortiert.<br />
� Eine Google-Map bei der jede Sprachaufnahme als<br />
Pin dargestellt wird, beim Daraufklicken öffnet sich<br />
ein Popup-Fenster.<br />
� Ein Filterinterface bei dem sich bestimmte syntaktische,<br />
morphologische, phonologische und weitere<br />
Parameter einstellen lassen.<br />
4. Besondere Features<br />
Es gibt zusätzliche besondere Features, die sich mit dem<br />
Datenbestand des Parallelkorpus realisieren lassen. Mit<br />
Hilfe des „Cognate Comparison“ Werkzeugs können die<br />
Benutzer nach Wahl eines Kognats akustisch miteinander<br />
vergleichen, indem der Benutzer Pins auf einer Karte<br />
oder Einträge in einer Liste auswählt.<br />
Auf „Acoustic Vowel Charts“ werden die Frequenzen<br />
derselben Vokale verschiedener Sprecher visualisiert.<br />
5. Ausblick<br />
Es gibt verschiedene Einsatzszenarien, <strong>für</strong> die der VLC<br />
Language Index genutzt werden kann, dazu gehören die<br />
Lehre, Abschlußarbeiten und Forschung auf Master- und<br />
PhD-Niveau.<br />
Jüngstes Feature ist das mp3-Download-Angebot mit<br />
einem bibliographischen Referenzierungssystem <strong>für</strong> alle<br />
Sprachdaten, damit diese auf einfache Weise in anderen<br />
Werkzeugen genutzt werden können und wissenschaftlichen<br />
Arbeiten, die auf diesen Daten basieren, beigelegt<br />
werden können.<br />
6. Weblink<br />
Virtual Lingustics Campus (VLC):<br />
http://www.linguistics-online.de<br />
Abbildung 2: Acoustic Vowel Chart
Multilingual Resources and Multilingual Applications - System Presentations<br />
Topological Fields, Constituents and Coreference:<br />
A New Multi-layer Architecture for TüBa-D/Z<br />
Thomas Krause*, Julia Ritz+, Amir Zeldes*, Florian Zipser*‡<br />
* Humboldt-<strong>Universität</strong> zu Berlin, Unter den Linden 6, 10099 Berlin<br />
+ <strong>Universität</strong> Potsdam, Karl-Liebknecht-Straße 24-25, 14476 Potsdam<br />
‡ INRIA<br />
E-mail: krause@informatik.hu-berlin.de, jritz@uni-potsdam.de, amir.zeldes@rz.hu-berlin.de, f.zipser@gmx.de<br />
Abstract<br />
This presentation is concerned with a new multi-layer representation of the German TüBa-D/Z Treebank, which allows users to<br />
conveniently query and visualize annotations for syntactic constituents, topological fields and coreference either separately or in<br />
conjunction.<br />
Keywords: corpus search tool, multi-layer annotation, treebank, German<br />
1. The Original Corpus<br />
The TüBa-D/Z corpus (Tübinger Baumbank des<br />
Deutschen / Zeitungskorpus, Telljohann et al., 2003)<br />
was already at its release, in a sense, a multi-layer<br />
corpus, since it combined information about constituent<br />
syntax with topological field annotation. However, the<br />
corpus was originally constructed using the TigerXML<br />
format (Mengel & Lezius, 2000), which only allowed<br />
for one type of internal node: the syntactic category,<br />
which was used to express both types of annotation.<br />
Figure 1 shows a representation of a sentence from the<br />
corpus in the TigerSearch tool (Lezius, 2002).<br />
Though the layers of topological and constituent syntax<br />
annotations are in principle separate, users must take<br />
into account the intervening topological nodes when<br />
formulating syntactic queries, and vice versa.<br />
With the addition of coreference information in<br />
Version 5 of the corpus (Hinrichs et al., 2004), which<br />
were created using the MMAX tool (see Müller &<br />
Strube, 2006 for the latest version), the facilities of the<br />
TigerSearch software could no longer be used to search<br />
through all annotations of the different layers (syntax,<br />
coreference and topological fields), since TigerSearch<br />
indexes only individual sentences, whereas coreference<br />
annotations require a full document context.<br />
Figure 1: TüBa-D/Z Version 3 in TigerSearch. Topological fields (VF: Vorfeld, LK:<br />
linke Klammer, MF: Mittelfeld) are represented as syntactic categories.<br />
259
Multilingual Resources and Multilingual Applications - System Presentations<br />
260<br />
2. The New Architecture<br />
Our goal is to make all existing layers of annotation<br />
available for simultaneous search, but in a way that<br />
allows each one to be searched separately without<br />
intervening nodes from other annotation layers. For this<br />
purpose, we have converted the latest Version 6 of<br />
TüBa-D/Z to the multi-layer XML format PAULA<br />
(Dipper, 2005). We then converted and edited the corpus<br />
using the SaltNPepper converter framework (Zipser &<br />
Romary, 2010), which gives us an in-memory<br />
representation of the corpus that can be manipulated<br />
more easily. During this step, we disentangled the<br />
syntactic, topological and coreference annotations. The<br />
resulting corpus was then exported and fed into ANNIS<br />
(Zeldes et al., 2009), a corpus search and visualization<br />
tool for multi-layer corpora. The resulting annotation<br />
layers are visualized in Figure 2, which shows a separate<br />
syntax tree (without topological fields), spans<br />
representing fields, and a full document view for the<br />
coreference annotation in which coreferent expressions<br />
are highlighted in the same color.<br />
Figure 2: TüBa-D/Z in ANNIS with separate<br />
annotation layers.<br />
3. Corpus Search<br />
Using the new architecture and the ANNIS Query<br />
Language (AQL) 1 it becomes possible to query syntax,<br />
topological fields and coreference more easily and<br />
intuitively, both simultaneously and separately. In the<br />
following, we will discuss three example applications<br />
briefly: one investigating topological fields only, one<br />
combining all three annotation layers, and one extracting<br />
syntactic frames with the help of the exporter<br />
functionality in ANNIS.<br />
3.1. Application 1: Topological Fields<br />
As a simple example of the easily accessible topological<br />
field information, we can consider the following query,<br />
which retrieves clauses before the left sentence bracket,<br />
in the main-clause preverbal domain (Vorfeld, VF),<br />
which contain two complementizer fields (C) after one<br />
another (the operator ‘>’ represents dominance, and ‘.*’<br />
represents indirect precedence, the numbers ‘#1’ etc.<br />
refer to the nodes declared at the beginning of the query,<br />
in order):<br />
(1) field="VF" & field="C" & field="C"<br />
& #1 > #2 & #1 > #3 & #2 .* #3<br />
Figure 3 shows an example result with its separate field<br />
grid and syntax tree, for the sentence: Daß und wie<br />
Demokratie funktionieren kann, hat der zähe Kampf der<br />
Frauen um den Paragraphen 218 gezeigt ‘The women’s<br />
tenacious fight for paragraph 218 has shown that, and<br />
how, democracy can work.’ By directly querying the<br />
topological fields we can avoid having to consider<br />
possible syntactic nodes intervening between VF and C.<br />
3.2. Application 2: Coreference, Syntax and Fields<br />
Next, let us first search for objects that are cataphors,<br />
but not reflexive pronouns. In TüBa-D/Z, cataphors are<br />
linked to their subsequents via the ‘cataphoric’ relation.<br />
The AQL expression is given in (2a): there is a node –<br />
any node – number 1 and another node, number 2, and<br />
node 1 points to node 2 using the ‘cataphoric’ relation.<br />
(2a.) node & node & #1 ->cataphoric #2<br />
We now add syntactic constraints: the cataphor, node 1,<br />
shall be an object (OA or OD, i.e. accusative or dative).<br />
In TüBa-D/Z, the grammatical function of a noun phrase<br />
1 A tutorial of the query language can be found at<br />
http://www.sfb632.uni-potsdam.de/~d1/annis/.
Multilingual Resources and Multilingual Applications - System Presentations<br />
Figure 3: Separate fields and syntactic phrases for VF with two C positions<br />
(NX) is specified as a label of the dominance edge<br />
connecting this NX and its parent.<br />
(2b.) node & node & #1 ->cataphoric<br />
#2 & cat="NX" & #3 _=_ #1 & cat & #4<br />
>[func=/O[AD]/] #3 & pos!="PRF" & #5<br />
_=_ #1<br />
(read: there is a node, number 3, of category NX, and<br />
node 3 covers the same tokens as node 1, and there is a<br />
node of any category, and this node number 4 dominates<br />
‘>’ node number 3, with the edge label ‘func’ (function)<br />
= OA or OD. We use regular expressions to specify the<br />
label. To exclude reflexive pronouns (part of speech<br />
‘PRF’), we use negation (‘!=’)). The search yields 51<br />
results, with scalable contexts and color-highlighting of<br />
the matches (cataphors and their subsequents).<br />
Secondly, let us query for noun phrases in the VF, with a<br />
definite determiner and their antecedents in the leftneighbour<br />
sentences.<br />
(2c.) field="VF" & cat="NX" & #1 _=_<br />
#2 & pos="ART" & #2 > #3 &<br />
tok=/[Dd]../ & #3 _=_ #4 & node & #5<br />
_=_ #2 & node & #5 ->coreferential #6<br />
& cat="TOP" & #7 _i_ #1 & cat="TOP" &<br />
#8 _i_ #6 & #8 . #7<br />
(‘>*’ represents indirect dominance)<br />
This query yields 766 results. Using the match counts of<br />
(2c.) and similar queries, we can create a contingency<br />
table of definite vs. pronomial VF-constituents and<br />
whether their respective antecedents occur in the leftneigbour<br />
sentence (‘close’) or more distantly: 43% of<br />
the definites and 61% of the pronouns in VF have a<br />
‘close’ antecedent – a difference that is highly<br />
significant (χ²=142.72, p #2 & #2 > #3 & #1<br />
>[func="OA"] #4 & #4 >[func="HD"] #5<br />
This query searches for a verbal phrase dominated by a<br />
clause (SIMPX) and dominating the lemma schreiben,<br />
where the same clause also dominates a nominal phrase<br />
with the function OA, which in turn dominates its head<br />
noun (pos="NN", func="HD"). Using the built-in<br />
WEKA exporter, we can produce a list of all the nominal<br />
object arguments of a verb much like in a dependency<br />
treebank, along with the form and part-of-speech of the<br />
relevant verb, as shown in Figure 3. Note that both finite<br />
and non-finite clauses are found, as well as verb-second<br />
and verb-final clauses, which now all have similar tree<br />
structures regardless of topological fields.<br />
'271192', 'wer immer seine Texte schreibt', 'SIMPX',<br />
'271186', 'Texte', 'apm', 'NN', '271187', 'seine Texte',<br />
'NX', '271189', 'schreibt', '3sis', 'VVFIN', '271190',<br />
'schreibt', 'VXFIN'<br />
'1134826', 'Songs schreiben', 'SIMPX', '1134820',<br />
'Songs', 'apm', 'NN', '1134821', 'Songs', 'NX',<br />
'1134823', 'schreiben', '--', 'VVINF', '1134824',<br />
'schreiben', 'VXINF'<br />
'1526561', 'Ich schreibe Satire', 'SIMPX', '151<strong>96</strong>02',<br />
'Satire', 'asf', 'NN', '1526559', 'Satire', 'NX',<br />
'1519599', 'schreibe', '1sis', 'VVFIN', '1526557',<br />
'schreibe', 'VXFIN'<br />
Figure 3: Excerpt of results from the WEKA Exporter<br />
for query (3).<br />
The exporter gives the values of all annotations for the<br />
nodes we have searched for, in order, as well as the text<br />
covered by those nodes. We can therefore easily get<br />
261
Multilingual Resources and Multilingual Applications - System Presentations<br />
tabular access to the contents of the clause (e.g. Songs<br />
schreiben ‘to write songs’), the object (Songs), the form<br />
and part-of-speech of the verb (schreiben, VVINF),<br />
morphological annotation (apm for a plural masculine<br />
noun in the accusative), etc.<br />
262<br />
4. Conclusion<br />
We have suggested an advanced, layer-separated<br />
representation architecture for TüBa-D/Z. This<br />
architecture facilitates corpus querying and exploitation.<br />
By means of examples, we have shown that the corpus<br />
search tool ANNIS allows for a qualitative and<br />
quantitative study of the interplay of syntactic,<br />
topological and information structural factors annotated<br />
in TüBa-D/Z.<br />
5. References<br />
Dipper, S. (2005): XML-based Stand-off Representation<br />
and Exploitation of Multi-Level Linguistic<br />
Annotation. Proceedings of Berliner XML Tage 2005<br />
(BXML 2005). Berlin, Germany, pp. 39-50.<br />
Hinrichs, E. W., Kübler, S., Naumann, K., Telljohann,<br />
H., Trushkina, J. (2004): Recent developments in<br />
linguistic annotations of the TüBa-D/Z treebank.<br />
Proceedings of the Third Workshop on Treebanks and<br />
Linguistic Theories.<br />
Lezius, W. (2002): Ein Suchwerkzeug <strong>für</strong> syntaktisch<br />
annotierte Textkorpora. PhD Thesis, Institut <strong>für</strong><br />
maschinelle Sprachverarbeitung Stuttgart.<br />
Mengel, A., Lezius, W. (2000): An XML-based<br />
encoding format for syntactically annotated corpora.<br />
Proceedings of the Second International Conference<br />
on Language Resources and Engineering (LREC<br />
2000). Athens.<br />
Müller, C., Strube, M. (2006): Multi-Level Annotation<br />
of Linguistic Data with MMAX2. In: Braun, Sabine,<br />
Kohn, Kurt & Mukherjee, Joybrato (eds.), Corpus<br />
Technology and Language Pedagogy. Frankfurt: Peter<br />
Lang, pp. 197-214.<br />
Telljohann, H., Hinrichs, E. W., Kübler, S. (2003):<br />
Stylebook for the Tübingen Treebank of Written<br />
German.<br />
Zeldes, A., Ritz, J., Lüdeling, A., Chiarcos, C. (2009):<br />
ANNIS: A Search Tool for Multi-Layer Annotated<br />
Corpora. Proceedings of Corpus Linguistics 2009,<br />
Liverpool, July 20-23, 2009.<br />
Zipser, F., Romary, L. (2010): A model oriented<br />
approach to the mapping of annotation formats using<br />
standards. Proceedings of the Workshop on Language<br />
Resource and Language Technology Standards,<br />
LREC 2010. Malta, pp. 7-18.
Multilingual Resources and Multilingual Applications - System Presentations<br />
MT Server Land Translation Services<br />
Christian Federmann<br />
DFKI – Language Technology Lab<br />
Stuhlsatzenhausweg 3, D-66123 Saarbrücken, GERMANY<br />
E-mail: cfedermann@dfki.de<br />
Abstract<br />
We demonstrate MT Server Land, an open-source architecture for machine translation services that is developed by the MT group at<br />
DFKI. The system can flexibly be extended and allows lay users to make use of MT technology within a web browser or by using<br />
simple HTTP POST requests from custom applications. A central broker server collects and distributes translation requests to several<br />
worker servers that create the actual translations. User access is realized via a fast and easy-to-use web interface or through an<br />
XML-RPC-based API that allows integrating translation services into external applications. We have implemented worker servers<br />
for several existing translation systems such as the Moses SMT decoder or the Lucy RBMT engine. We also show how other,<br />
web-based translation tools such as Google Translate can be integrated into the MT Server Land application. The source code is<br />
published under an open BSD-style license and is freely available from GitHub.<br />
Keywords: Machine Translation, Web Service, Translation Framework, Open-Source Tool<br />
1. Introduction<br />
Machine translation (MT) is a field of active research<br />
with lots of different MT systems being built for shared<br />
tasks and experiments. The step from the research<br />
community towards real-world application of available<br />
technology requires easy-to-use MT services that are<br />
available via the Internet and allow collecting feedback<br />
and criticism from real users. Such applications are<br />
important means to increase visibility of MT research and<br />
to help shaping the multi-lingual web. Applications such<br />
as Google Translate 1<br />
allow lay users to quickly and<br />
effortlessly create translations of texts or even complete<br />
web pages; the continued success of such services shows<br />
the potential that lies in usable machine translation,<br />
something both developers and researchers should be<br />
targeting.<br />
In the context of ongoing MT research projects at DFKI's<br />
language technology lab, we have decided to design and<br />
implement such a translation application. We have<br />
released the source code under a permissive open-source<br />
license and hope that it becomes a useful tool for the<br />
MT community. A screenshot of the MT Server Land<br />
application is shown in Figure 1.<br />
1 http://translate.google.com<br />
Figure 1: Screenshot of MT Server Land<br />
2. System Architecture<br />
The system consists of two different layers: first, we have<br />
the so-called broker server that handles all direct<br />
requests from end users or via API calls alike. Second, we<br />
have a layer of worker servers, each implementing some<br />
sort of machine translation functionality. All<br />
communication between users and workers is channeled<br />
through the broker server that acts as a central “proxy”<br />
server. An overview of the system architecture is given in<br />
Figure 2.<br />
For users, both broker and workers “constitute” the MT<br />
263
Multilingual Resources and Multilingual Applications - System Presentations<br />
Server Land system; the broker server is the “visible”<br />
part of the application while the various worker servers<br />
perform the “invisible” translation work. The system has<br />
been designed to make it easier for lay users to access and<br />
use machine translation technology without the need to<br />
fully dive into the complexities of current MT research.<br />
Within MT Server Land, translation functionality is<br />
available by starting up suitable worker server instances<br />
for a specific MT engine. The startup process for workers<br />
is standardized using some easy-to-understand<br />
parameters for, e.g., the hostname/IP address or port<br />
number of the worker server process. All “low-level”<br />
work (de-/serialization, transfer of requests/results, etc.)<br />
is handled by the worker server instances. Of course, it is<br />
possible to design and create new worker server instances,<br />
e.g., to demonstrate new features in a research translation<br />
system or to integrate other MT systems.<br />
Human users connect to the system using any modern<br />
web browser; API access can be implemented using<br />
HTTP POST and/or XML-RPC requests. It would be<br />
relatively easy to extend the current API interface to<br />
support other protocols such as SOAP or REST. By<br />
design, all internal method calls that connect to the<br />
worker layer have to be implemented with XML-RPC. In<br />
order to prevent encoding problems with the input text,<br />
we send and receive all data encoded as Base64 Strings<br />
between broker and workers; the broker server takes care<br />
of the necessary conversion steps. Translation requests<br />
are converted into serialized, binary Strings using Google<br />
protocol buffer compilation.<br />
Figure 2: Architecture overview of MT Server Land<br />
264<br />
2.1. Broker Server<br />
The broker server has been implemented using the<br />
django web framework 2<br />
, which takes care of low-level<br />
tasks and allows for rapid development and clean design<br />
of components. We have used the framework for other<br />
project work before and think it is well suited to the task.<br />
The framework itself is available under an open-source<br />
BSD-license.<br />
2.1.1. Object Models<br />
The broker server implements two main django models<br />
that we describe subsequently. Please note that we have<br />
also developed additional object models, e.g. for quota<br />
management. See the source code for more information.<br />
� WorkerServer stores all information related to a<br />
remote worker server. This includes source and<br />
target language, the respective hostname and port<br />
address as well as a name and a short description.<br />
Available worker servers within MT Server Land can<br />
be constrained to function for specific user and/or<br />
API accounts only.<br />
� TranslationRequest models a translation job and<br />
related information such as the chosen worker server,<br />
the source text and the assigned request id.<br />
Furthermore we store additional metadata<br />
information. Once the translation result has been<br />
obtained from the translation worker server, it is also<br />
stored within the instance so that it can be removed<br />
from the worker server’s job queue.<br />
2.1.2. User Interface<br />
We developed a browser-based web interface to access<br />
and use the MT Server Land application. End users first<br />
have to authenticate before they can access their<br />
dashboard that lists all known translation requests for<br />
the current user and also allows creating new requests.<br />
When creating a new translation request, the user may<br />
choose which translation worker server should be used to<br />
generate the translation for the chosen language pair. We<br />
use a validation step to ensure that the respective worker<br />
server supports the selected language pair and is currently<br />
able to receive new translation requests from the broker<br />
server; after successful validation, the new translation<br />
request is sent to the worker server that starts processing<br />
the given source text.<br />
2 http://www.djangoproject.com/
Multilingual Resources and Multilingual Applications - System Presentations<br />
Once the chosen worker server has completed a<br />
translation request, the result is transferred to (and also<br />
cached by) the object instance inside the broker server's<br />
data storage. The user can view the result within the<br />
dashboard or download the file to a local hard disk.<br />
Translation requests can be deleted at any time,<br />
effectively terminating the corresponding thread within<br />
the connected worker server (if the translation is still<br />
running). If an error occurs during translation, the system<br />
will recognize this and deactivate the respective<br />
translation requests.<br />
2.1.3. API Interface<br />
In parallel to the browser interface, we have designed and<br />
implemented an API that allows connecting applications<br />
to the MT functionality provided by our application using<br />
HTTP POST requests. Again, we first require<br />
authentication before any machine translation can be<br />
used. We provide methods to list all requests for the<br />
current “user” (i.e. the application account) and to create,<br />
download, or delete translation requests. Extension to<br />
REST or SOAP protocols is possible.<br />
2.2. Worker Server Implementations<br />
A layer of so-called worker servers that are connected to<br />
the central broker server implements the actual machine<br />
translation functionality. For the MT Server Land, we<br />
have implemented worker servers for the following MT<br />
systems:<br />
� Moses SMT: a Moses (Koehn et al., 2007) worker is<br />
configured to serve exactly one language pair. We<br />
use the Moses Server mode to keep translation and<br />
language model in memory, which helps to speed up<br />
the translation process. As the limitation to one<br />
language pair effectively means that a huge number<br />
of Moses worker server instances has to be started in<br />
a typical usage scenario, we have also worked on a<br />
better implementation which allows to serve any<br />
number of language pairs from one worker instance.<br />
Future improvements could be achieved by<br />
integrating “on-the-fly” configuration switching and<br />
remote language models to reduce the amount of<br />
resources required by the Moses worker server.<br />
� Lucy RBMT: our Lucy (Alonso & Thurmair, 2003)<br />
worker is implemented using a Lucy Server mode<br />
wrapper. This is a small Python program running on<br />
the Windows machine on which Lucy is installed.<br />
We have implemented a simple XML-RPC based<br />
API interface to send translation requests to the Lucy<br />
engine and later fetch the corresponding results. For<br />
integration in MT Server Land, we simply had to<br />
“tunnel” our Lucy worker server calls to this Lucy<br />
server mode implementation.<br />
� Joshua SMT: similar to the Moses worker, we have<br />
created a Joshua (Li et al., 2010) worker that works<br />
by creating a new Joshua instance for each<br />
translation request.<br />
We have also created worker servers for popular online<br />
translation engines such as Google Translate, Microsoft<br />
Translator, or Yahoo! BabelFish. We will demonstrate<br />
the worker servers in our presentation.<br />
3. Acknowledgements<br />
This work was supported by the EuroMatrixPlus project<br />
(IST-231720) that is funded by the European Community<br />
under the Seventh Framework Programme for Research<br />
and Technological Development.<br />
4. References<br />
Alonso, J. A., Thurmair, G. (2003). The Comprendium<br />
Translator System. In Proceedings of the Ninth<br />
Machine Translation Summit.<br />
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,<br />
Federico, M., Bertoldi, N., Cowan, B., Shen, W.,<br />
Moran, C., Zens, R., Dyer, C. J., Bojar, O., Constantin,<br />
A., Herbst, E. (2007). Moses: Open Source Toolkit for<br />
Statistical Machine Translation. In Proceedings of the<br />
45th Annual Meeting of the Association for<br />
Computational Linguistics Companion Volume<br />
Proceedings of the Demo and Poster Sessions,<br />
pp. 177–180, Prague, Czech Republic. Association for<br />
Computational Linguistics.<br />
Li, Z., Callison-Burch, C., Dyer, C., Ganitkevitch, J.,<br />
Irvine, A., Khudanpur, S., Schwartz, L., Thornton, W.,<br />
Wang, Z., Weese, J., Zaidan, O. (2010). Joshua 2.0: A<br />
Toolkit for Parsing-based Machine Translation with<br />
Syntax, Semirings, Discriminative Training and other<br />
Goodies. In Proceedings of the Joint Fifth Workshop<br />
on Statistical Machine Translation and MetricsMATR,<br />
pp. 133–137, Uppsala, Sweden. Association for<br />
Computational Linguistics.<br />
265
WORKING PAPERS IN MULTILINGUALISM <strong>•</strong> Series B<br />
ARBEITEN ZUR MEHRSPRACHIGKEIT <strong>•</strong> Folge B<br />
Publications to date <strong>•</strong> Bisher erschienen:<br />
1. Jürgen M. Meisel: On transfer at the initial state<br />
of L2 acquisition: Revisiting Universal Grammar.<br />
2. Kristin Bührig, Latif Durlanik & Bernd Meyer:<br />
Arzt-Patienten-Kommunikation im Krankenhaus:<br />
konstitutive Handlungseinheiten, institutionelle<br />
Handlungslinien.<br />
3. Georg Kaiser: Dialect contact and language<br />
change. A case study on word-order change in<br />
French.<br />
4. Susanne S. Jekat & Lorenzo Tessiore: End-to-<br />
End Evaluation of Machine Interpretation Systems:<br />
A Graphical Evaluation Tool.<br />
5. Thomas Ehlen: Sprache - Diskurs - Text. Überlegungen<br />
zu den kommunikativen Rahmenbedingungen<br />
mittelalterlicher Zweisprachigkeit <strong>für</strong> das<br />
Verhältnis von Latein und Deutsch.<br />
Nikolaus Henkel: Lateinisch-Deutsch.<br />
6. Kristin Bührig & Jochen Rehbein: Reproduzierendes<br />
Handeln. Übersetzen, simultanes und konsekutives<br />
Dolmetschen im diskursanalytischen<br />
Vergleich.<br />
7. Jürgen M. Meisel: The Simultaneous Acquisition<br />
of Two First Languages: Early Differentiation<br />
and Subsequent Development of Grammars.<br />
8. Bernd Meyer: Medizinische Aufklärungsgespräche:<br />
Struktur und Zwecksetzung aus diskursanalytischer<br />
Sicht.<br />
9. Kristin Bührig, Latif Durlanik & Bernd Meyer<br />
(Hrsg.): Dolmetschen und Übersetzen in medizinischen<br />
Institutionen. Beiträge zum Kolloquium<br />
‘Dolmetschen in Institutionen’ vom17. - 18.03.<br />
2000 in Hamburg.<br />
10. Juliane House: Concepts and Methods of Translation<br />
Criticism: A Linguistic Perspective.<br />
11. Bernd Meyer & Notis Toufexis (Hrsg.):<br />
Text/Diskurs, Oralität/Literalität unter dem Aspekt<br />
mehrsprachiger Kommunikation.<br />
12. Hans Eideneier: Zur mittelalterlichen Vorgeschichte<br />
der neugriechischen Diglossie.<br />
13. Kristin Bührig, Juliane House, Susanne J. Jekat:<br />
Abstracts of the International Symposium on Linguistics<br />
and Translation, University of Hamburg,<br />
20th - 21st November 2000.<br />
14. Sascha W. Felix: Theta Parametrization. Predicate-Argument<br />
Structure in English and Japanese.<br />
15. Mary A. Kato: Aspects of my Bilingualism: Japanese<br />
as L1 and Portuguese and English as L2.<br />
16. Natascha Müller, Katja Cantone, Tanja Kupisch<br />
& Katrin Schmitz: Das mehrsprachige Kind: Italienisch<br />
– Deutsch.<br />
17. Kurt Braunmüller: Semicommunication and Accommodation:<br />
Observations from the Linguistic<br />
Situation in Scandinavia.<br />
18. Tessa Say: Feature Acquisition in Bilingual Child<br />
Language Development.<br />
19. Kurt Braunmüller & Ludger Zeevaert: Semikommunikation,<br />
rezeptive Mehrsprachigkeit und verwandte<br />
Phänomene. Eine bibliographische Bestandsaufnahme.<br />
20. Nicole Baumgarten, Juliane House & Julia<br />
Probst: Untersuchungen zum Englischen als ‘lingua<br />
franca’ in verdeckter Übersetzung. Theoretischer<br />
Hintergrund, Weiterentwicklung des Analyseverfahrens<br />
und erste Ergebnisse.<br />
21. Per Warter: Lexical Identification and Decoding<br />
Strategies in Interscandinavian Communication.<br />
22. Susanne J. Jekat & Patricia J. Nüßlein: Übersetzen<br />
und Dolmetschen: Grundlegende Aspekte und<br />
Forschungsergebnisse.<br />
23. Claudia Böttger & Julia Probst: Adressatenorientierung<br />
in englischen und deutschen Texten.<br />
24. Anja Möhring: The acquisition of French by<br />
German children of pre-school age. An empirical<br />
investigation of gender assignment and gender<br />
agreement.<br />
25. Jochen Rehbein: Turkish in European Societies.<br />
26. Katja Francesca Cantone & Marc-Olivier Hinzelin:<br />
Proceedings of the Colloquium on Structure,<br />
Acquisition, and Change of Grammars: Phonological<br />
and Syntactic Aspects. Volume I.<br />
27. Katja Francesca Cantone & Marc-Olivier Hinzelin:<br />
Proceedings of the Colloquium on Structure,<br />
Acquisition, and Change of Grammars: Phonological<br />
and Syntactic Aspects. Volume II.<br />
28. Utta v. Gleich: Multilingualism and multilingual<br />
Literacies in Latin American Educational Systems.<br />
29. Christine Glanz & Utta v. Gleich: Mehrsprachige<br />
literale Praktiken im religiösen Alltag. Ein Vergleich<br />
literaler Ereignisse in Uganda und Bolivien.<br />
30. Jürgen M. Meisel: From bilingual language acquisition<br />
to theories of diachronic change.<br />
31. Florian Coulmas & Makoto Watanabe: Japan‘s<br />
Nascent Multilingualism.<br />
32. Tanja Kupisch: The acquisition of the DP in<br />
French as the weaker language.<br />
33. Utta v. Gleich, Mechthild Reh & Christine Glanz:<br />
Mehrsprachige literale Praktiken im Kulturvergleich:<br />
Uganda und Bolivien. Die Datenerhebungs-<br />
und Auswertungsmethoden.<br />
34. Thomas Schmidt: EXMARaLDA - ein System zur<br />
Diskurstranskription auf dem Computer.<br />
35. Esther Rinke: On the licensing of null subjects in<br />
Old French.<br />
36. Bernd Meyer & Ludger Zeevaert: Sprachwechselphänomene<br />
in gedolmetschten und semikommunikativen<br />
Diskursen.
37. Annette Herkenrath & Birsel Karakoç: Zum Erwerb<br />
von Verfahren der Subordination bei türkisch-deutsch<br />
bilingualen Kindern – Transkripte<br />
und quantitative Aspekte.<br />
38. Guntram Haag: Illokution und Adressatenorientierung<br />
in der Zwettler Gesamtübersetzung und<br />
der Melker Rumpfbearbeitung der ‘Disticha Catonis’:<br />
funktionale und sprachliche Einflussfaktoren.<br />
39. Kristin Bührig: Multimodalität in gedolmetschten<br />
Aufklärungsgesprächen. Grafische Abbildungen<br />
in der Wissensvermittlung.<br />
40. Jochen Rehbein: Pragmatische Aspekte des Kontrastierens<br />
von Sprachen – Türkisch und Deutsch<br />
im Vergleich.<br />
41. Christine Glanz & Okot Benge: Exploring Multilingual<br />
Community Literacies. Workshop at the<br />
Ugandan German Cultural Society, Kampala,<br />
September 2001.<br />
42. Christina Janik: Modalisierungen im Dolmetschprozess.<br />
43. Hans Eideneier: „Rhetorik und Stil“ – der griechische<br />
Beitrag.<br />
44. Annette Herkenrath, Birsel Karakoç & Jochen<br />
Reh-bein: Interrogative elements as subordinators<br />
in Turkish – aspects of Turkish-German bilingual<br />
children’s language use.<br />
45. Marc-Olivier Hinzelin: The Aquisition of Subjects<br />
in Bilingual Children: Pronoun Use in Portuguese-German<br />
Children.<br />
46. Thomas Schmidt: Visualising Linguistic Annotation<br />
as Interlinear Text.<br />
47. Nicole Baumgarten: Language-specific Realization<br />
of Extralinguistic Concepts in Original and<br />
Translation Texts: Social Gender in Popular<br />
Film.<br />
48. Nicole Baumgarten: Close or distant: Constructions<br />
of proximity in translations and parallel<br />
texts.<br />
49. Katrin Monika Schmitz & Natascha Müller:<br />
Strong and clitic pronouns in monolingual and<br />
bilingual first language acquisition: Comparing<br />
French and Italian.<br />
50. Bernd Meyer: Bilingual Risk communication.<br />
51. Bernd Meyer: Dolmetschertraining aus diskursanalytischer<br />
Sicht: Überlegungen zu einer Fortbildung<br />
<strong>für</strong> zweisprachige Pflegekräfte.<br />
52. Monika Rothweiler, Solveig Kroffke & Michael<br />
Bernreuter: Grammar Acquisition in Bilingual<br />
Children with Specific Language Impairment:<br />
Prerequisites and Questions.<br />
Solveig Kroffke & Monika Rothweiler: The Bilingual´s<br />
Language Modes in Early Second Language<br />
Acquisition – Contexts of Language Use<br />
and Diagnosis of Language Disorders.<br />
53. Gerard Doetjes: Auf falsche[r] Fährte in der interskandinavischen<br />
Kommunikation.<br />
54. Angela Beuerle & Kurt Braunmüller: Early Germanic<br />
bilingualism? Evidence from the earliest<br />
runic inscriptions and from the defixiones in Roman<br />
utility epigraphy.<br />
Kurt Braunmüller: Grammatical indicators for<br />
bilingualism in the oldest runic inscriptions?<br />
55. Annette Herkenrath & Birsel Karakoç: Zur Morphosyntax<br />
äußerungsinterner Konnektivität bei<br />
mono- und bilingualen türkischen Kindern.<br />
56. Jochen Rehbein, Thomas Schmidt, Bernd Meyer,<br />
Franziska Watzke & Annette Herkenrath: Handbuch<br />
<strong>für</strong> das computergestützte Transkribieren<br />
nach HIAT.<br />
57. Kristin Bührig & Bernd Meyer: Ad hocinterpreting<br />
and the achievement of communicative<br />
purposes in specific kinds of doctor-patient<br />
discourse.<br />
58. Margaret M. Kehoe & Conxita Lleó: The emergence<br />
of language specific rhythm in German-<br />
Spanish bilingual children.<br />
59. Christiane Hohenstein: Japanese and German ‘I<br />
think–constructions’.<br />
60. Christiane Hohenstein: Interactional expectations<br />
and linguistic knowledge in academic expert discourse<br />
(Japanese/German).<br />
61. Solveig Kroffke & Bernd Meyer: Verständigungsprobleme<br />
in bilingualen Anamnesegesprächen.<br />
62. Thomas Schmidt: Time-based data models and<br />
the Text Encoding Initiative’s guidelines for transcription<br />
of speech.<br />
63. Anja Möhring: Against full transfer during early<br />
phases of L2 acquisition: Evidence from German<br />
learners of French.<br />
64. Bernadette Golinski & Gerard Doetjes: Sprachverstehensuntersuchungen<br />
im semikommunikativen<br />
Kontext.<br />
65. Lukas Pietsch: Re-inventing the ‘perfect’ wheel:<br />
Grammaticalisation and the Hiberno-English<br />
medial-object perfects.<br />
66. Esther Rinke: Wortstellungswandel in Infinitivkomplementen<br />
kausativer Verben im Portugiesischen.<br />
67. Imme Kuchenbrandt, Tanja Kupisch & Esther<br />
Rinke: Pronominal Objects in Romance: Comparing<br />
French, Italian, Portuguese, Romanian<br />
and Spanish.<br />
68. Javier Arias, Noemi Kintana, Martin Rakow &<br />
Susanne Rieckborn: Sprachdominanz: Konzepte<br />
und Kriterien.<br />
69. Matthias Bonnesen: The acquisition of questions<br />
by two German-French bilingual children<br />
70. Chrystalla A. Thoma & Ludger Zeevaert: Klitische<br />
Pronomina im Griechischen und Schwedischen:<br />
Eine vergleichende Untersuchung zu synchroner<br />
Funktion und diachroner Entwicklung<br />
klitischer Pronomina in griechischen und schwedischen<br />
narrativen Texten des 15. bis 18. Jahrhunderts<br />
71. Thomas Johnen: Redewiedergabe zwischen<br />
Konnektivität und Modalität: Zur Markierung<br />
von Redewiedergabe in Dolmetscheräußerungen<br />
in gedolmetschten Arzt-Patientengesprächen<br />
72. Nicole Baumgarten: Converging conventions?<br />
Macrosyntactic conjunction with English ‘and’<br />
and German ‘und’<br />
73. Susanne Rieckborn: Entwicklung der ‚schwachen<br />
Sprache‘ im unbalancierten L1-Erwerb
74. Ludger Zeevaert: Variation und kontaktinduzierter<br />
Wandel im Altschwedischen<br />
75. Belma Haznedar: Is there a relationship between<br />
inherent aspect of predicates and their finiteness<br />
in child L2 English?<br />
76. Bernd Heine: Contact-induced word order<br />
change without word order change<br />
77. Matthias Bonnesen: Is the left periphery a vulnerable<br />
domain in unbalanced bilingual first language<br />
acquisition?<br />
78. Tanja Kupisch & Esther Rinke: Italienische und<br />
portugiesische Possessivpronomina im diachronischen<br />
Vergleich: Determinanten oder Adjektive?<br />
79. Imme Kuchenbrandt, Conxita Lleó, Martin Rakow,<br />
Javier Arias Navarro: Große Tests <strong>für</strong> kleine<br />
Datenbasen?<br />
80. Jürgen M. Meisel: Exploring the limits of the<br />
LAD<br />
81. Steffen Höder, Kai Wörner, Ludger Zeevaert:<br />
Corpus-based investigations on word order<br />
change: The case of Old Nordic<br />
82. Lukas Pietsch: The Irish English “After Perfect”<br />
in context: Borrowing and syntactic productivity<br />
83. Matthias Bonnesen & Solveig Kroffke: The acquisition<br />
of questions in L2 German and French<br />
by children and adults<br />
84. Julia Davydova: Preterite and present perfect in<br />
Irish English: Determinants of variation<br />
85. Ezel Babur, Solveig Chilla & Bernd Meyer:<br />
Aspekte der Kommunikation in der logopädischen<br />
Diagnostik mit ein- und mehrsprachigen Kindern<br />
86. Imme Kuchenbrandt: Cross-linguistic influences<br />
in the acquisition of grammatical gender?<br />
87. Anne Küppers: Sprecherdeiktika in deutschen<br />
und französischen Aktionärsbriefen<br />
88. Demet Özçetin: Die Versprachlichung mentaler<br />
Prozesse in englischen und deutschen Wirtschaftstexten<br />
89. Barbara Miertsch: The acquisition of gender<br />
markings by early second language learners of<br />
French<br />
90. Kurt Braunmüller: On the relevance of receptive<br />
multilingualism in a globalised world: Theory,<br />
history and evidence from today’s Scandinavia<br />
91. Jill P. Morford & Martina L. Carlson: Sign perception<br />
and recognition in non-native signers of<br />
ASL<br />
92. Andrea Bicsar: How the “Traveling Rocks” of<br />
Death Valley become “Moving Rocks”: The Case<br />
of an English-Hungarian Popular Science Text<br />
Translation<br />
93. Anne-Kathrin Preißler: Subjektpronomina im<br />
späten Mittelfranzösischen: Das Journal de Jean<br />
Héroard<br />
94. Ingo Feldhausen, Ariadna Benet, Andrea<br />
Pešková: Prosodische Grenzen in der Spontansprache:<br />
Eine Untersuchung zum Zentralkatalanischen<br />
und porteño-Spanischen<br />
95. Manuela Schönenberger, Franziska Sterner, Tobias<br />
Ruberg: The realization of indirect objects<br />
and case marking in experimental data from child<br />
L1 and child L2 German<br />
<strong>96</strong>. Hanna Hedeland, Thomas Schmidt, Kai Wörner<br />
(eds.): Multilingual Resources and Multilingual<br />
Applications – Proceedings of the Conference of<br />
the German Society for Computational Linguistics<br />
and Language Technology (GSCL) <strong>2011</strong>