05.12.2012 Views

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

96 • 2011 - Hamburger Zentrum für Sprachkorpora - Universität ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ARBEITEN ZUR MEHRSPRACHIGKEIT<br />

WORKING PAPERS IN MULTILINGUALISM<br />

Folge B <strong>•</strong> Series B<br />

<strong>96</strong> <strong>•</strong> <strong>2011</strong><br />

Hanna Hedeland, Thomas Schmidt, Kai Wörner (eds.)<br />

Multilingual Resources and<br />

Multilingual Applications<br />

Proceedings of the Conference of the<br />

German Society for Computational Linguistics and<br />

Language Technology (GSCL) <strong>2011</strong><br />

Sonderforschungsbereich<br />

Mehrsprachigkeit<br />

© ISSN 0176-599X


© Hanna Hedeland, Thomas Schmidt, Kai Wörner<br />

<strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong><br />

Max Brauer-Allee 60<br />

D-22765 Hamburg<br />

ARBEITEN ZUR MEHRSPRACHIGKEIT<br />

WORKING PAPERS IN MULTILINGUALISM<br />

Folge B, Nr. <strong>96</strong> <strong>•</strong> <strong>2011</strong><br />

Hanna Hedeland, Thomas Schmidt, Kai Wörner (eds.)<br />

Multilingual Resources and Multilingual Applications<br />

Proceedings of the Conference of the German Society for<br />

Computational Linguistics and Language Technology (GSCL) <strong>2011</strong><br />

Die „Arbeiten zur Mehrsprachigkeit – Folge B“ publizieren Forschungsarbeiten aus dem<br />

Sonderforschungsbereich 538 Mehrsprachigkeit, der von der Deutschen Forschungsgemeinschaft<br />

im Juli 1999 an der <strong>Universität</strong> Hamburg eingerichtet wurde. Wir danken der<br />

DFG <strong>für</strong> ihre Unterstützung.<br />

Die „Arbeiten zur Mehrsprachigkeit – Folge B“ sind bei der Deutschen Bibliothek in Frankfurt/Main<br />

mit der Seriennummer ISSN 0176-559X eingetragen.<br />

Redaktion:<br />

Martin Elsig, Svenja Kranich, Thomas Schmidt, Manuela Schönenberger<br />

Technische Umsetzung:<br />

Thomas Schmidt


Collaborative Research Center: Multilingualism<br />

Sonderforschungsbereich 538: Mehrsprachigkeit<br />

University of Hamburg<br />

Founded in July 1999, the Collaborative Research Centre on Multilingualism conducts research<br />

on patterns of language use in multilingual environments, bilingual language acquisition,<br />

and the role of multilingualism and language contact for language change.<br />

In the current, fourth funding period (2008–<strong>2011</strong>), the Centre comprises two main research<br />

branches, each of which focuses on a central set of common issues, and a third branch of<br />

projects dealing with practical application solutions. Branch E, Multilingual Language Acquisition,<br />

consists of four projects, with a common focus on the nature of “critical phases” in<br />

language acquisition. Branch H, Historical Aspects of Multilingualism and Variation, consists<br />

of eight projects, dealing with questions of language change and language contact. This<br />

branch also comprises projects of former separate branch K, Multilingual Communication.<br />

Since 2007, a new Branch T, Transfer Projects, has been active. It consists of five projects<br />

whose goal is to develop concrete solutions for practical problems relating to multilingual<br />

situations, based on research results derived from the Centre’s research activities.<br />

Languages currently studied at the Research Centre include the following: Danish, Catalan,<br />

English, Faroese, French, German, German Sign Language, Icelandic, Irish, Polish, Portuguese,<br />

Spanish, Swedish, and Turkish, as well as several historical or regional sub-varieties<br />

of some of these languages.<br />

Chair:<br />

Prof. Dr. Christoph Gabriel<br />

christoph.gabriel@uni-hamburg.de<br />

Co-chairs:<br />

Prof. Dr. Kurt Braunmüller<br />

braunmueller@rrz.uni-hamburg.de<br />

Prof. Dr. Barbara Hänel-Faulhaber<br />

Barbara.Haenel@uni-hamburg.de<br />

University of Hamburg, SFB 538, Max-Brauer-Allee 60, D-22765 Hamburg.<br />

Tel. +49 40-42838-6432. http://www.uni-hamburg.de/sfb538/


Local Organizing Comittee<br />

� Thomas Schmidt<br />

� Kai Wörner<br />

� Timm Lehmberg<br />

� Hanna Hedeland<br />

Program Committee<br />

� Maja Bärenfänger (<strong>Universität</strong> Gießen)<br />

� Stefanie Dipper (<strong>Universität</strong> Bochum)<br />

� Kurt Eberle (Lingenio Heidelberg)<br />

� Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)<br />

� Ullrich Heid (<strong>Universität</strong> Hildesheim)<br />

� Claudia Kunze (Qualisys GmbH)<br />

� Lothar Lemnitzer (Berlin-Brandenburgische Akademie der Wissenschaften)<br />

� Henning Lobin (<strong>Universität</strong> Gießen)<br />

� Ernesto de Luca (Technische <strong>Universität</strong> Berlin)<br />

� Cerstin Mahlow (<strong>Universität</strong> Zürich)<br />

� Alexander Mehler (<strong>Universität</strong> Bielefeld)<br />

� Wolfgang Menzel (<strong>Universität</strong> Hamburg)<br />

� Georg Rehm (Deutsches Forschungszentrum <strong>für</strong> Künstliche Intelligenz)<br />

� Josef Ruppenhofer (<strong>Universität</strong> Saarbrücken)<br />

� Thomas Schmidt (<strong>Universität</strong> Hamburg)<br />

� Roman Schneider (Institut <strong>für</strong> Deutsche Sprache Mannheim)<br />

� Bernhard Schröder (<strong>Universität</strong> Duisburg)<br />

� Manfred Stede (<strong>Universität</strong> Potsdam)<br />

� Angelika Storrer (<strong>Universität</strong> Dortmund)<br />

� Maik Stührenberg (<strong>Universität</strong> Bielefeld)<br />

� Thorsten Trippel (<strong>Universität</strong> Tübingen)<br />

� Cristina Vertan (<strong>Universität</strong> Hamburg)<br />

� Andreas Witt (Institut <strong>für</strong> Deutsche Sprache Mannheim)<br />

� Christian Wolff (<strong>Universität</strong> Regensburg)<br />

� Kai Wörner (<strong>Universität</strong> Hamburg)


Call for Papers<br />

The Conference of the German Society for Computational Linguistics and Language Technology (GSCL)<br />

in <strong>2011</strong> will take place from 28th to 30th September <strong>2011</strong> at the University of Hamburg. The main<br />

conference theme is Multilingual Resources and Multilingual Applications.<br />

Contributions to any topic related to Computational Linguistics and Language Technology are invited, but<br />

we especially encourage submissions that are related to the main theme. The topic “Multilingual Resources<br />

and Multilingual Applications” comprises all aspects of computational linguistics and speech and language<br />

technology in which issues of multilingualism, of language contrasts or of language independent<br />

representations play a major role. This includes, for instance:<br />

� representation and analysis of parallel corpora and comparable corpora<br />

� multilingual lexical resources<br />

� machine translation<br />

� annotation and analysis of learner corpora<br />

� linguistic variation in linguistic data and applications<br />

� localisation and internationalisation<br />

Conference languages are English and German; contributions are welcome in both languages. Three types<br />

of submission are invited:<br />

� Regular talk – Submission of an extended abstract<br />

� Poster – Submission of an abstract<br />

� System presentation – Submission of an abstract<br />

Only contributions in electronic form will be accepted. We do not provide style sheets for submissions at<br />

this stage; constraints on length are given below. Submissions must be submitted via the electronic<br />

conference system. Accepted abstracts will be published as a book of abstracts on the occasion of the<br />

conference. Extended versions of the papers will be published as a special issue of the Journal for<br />

Language Technology and Computational Linguistics after the conference.<br />

German Society for<br />

Computational Linguistics<br />

& Language Technology


Contents<br />

I Invited Talks<br />

Constructing Parallel Lexicon Fragments Based on English FrameNet Entries:<br />

Semantic and Syntactic Issues ..................................................................................................... 9<br />

Hans C. Boas<br />

The Multilingual Web: Opportunities, Borders and Visions....................................................... 19<br />

Felix Sasaki<br />

Combining Various Text Analysis Tools for Multilingual Media Monitoring ............................. 25<br />

Ralf Steinberger<br />

II Regular Papers<br />

Generating Inflection Variants of Multi-Word Terms for French and German........................... 33<br />

Simon Clematide, Luzia Roth<br />

Tackling the Variation in International Location Information Data: An Approach<br />

Using Open Semantic Databases ............................................................................................... 39<br />

Janine Wolf, Manfred Stede, Michaela Atterer<br />

Towards Multilingual Biographical Event Extraction - Initial Thoughts on the<br />

Design of a New Annotation Scheme.......................................................................................... 45<br />

Michaela Geierhos, Jean-Leon Bouraoui, Patrick Watrin<br />

The Corpus of Academic Learner English (CALE): A New Resource for the Study<br />

of Lexico-Grammatical Variation in Advanced Learner Varieties.............................................. 51<br />

Marcus Callies, Ekaterina Zaytseva<br />

From Multilingual Web-Archives to Parallel Treebanks in Five Minutes................................... 57<br />

Markus Killer, Rico Sennrich, Martin Volk<br />

Querying Multilevel Annotation and Alignment for Detecting Grammatical<br />

Valence Divergencies ................................................................................................................ 63<br />

Oliver Čulo<br />

SPIGA – A Multilingual News Aggregator................................................................................. 69<br />

Leonhard Hennig, Danuta Ploch, Daniel Prawdzik, Benjamin Armbruster,<br />

Christoph Büscher, Ernesto William De Luca, Holger Düwiger, Şahin Albayrak<br />

From Historic Books to Annotated XML: Building a Large Multilingual<br />

Diachronic Corpus .................................................................................................................... 75<br />

Magdalena Jitca, Rico Sennrich, Martin Volk<br />

Visualizing Dependency Structures............................................................................................ 81<br />

Chris Culy, Verena Lyding, Henrik Dittmann<br />

A Functional Database Framework for Querying Very Large Multi-Layer Corpora.................. 87<br />

Roman Schneider


Hybrid Machine Translation for German in taraXÜ: Can Translation Costs Be<br />

Decreased Without Degrading Quality? ................................................................................... 93<br />

Aljoscha Burchardt, Christian Federmann, Hans Uszkoreit<br />

Annotation of Explicit and Implicit Discourse Relations in the TüBa-D/Z Treebank.................. 99<br />

Anna Gastel, Sabrina Schulze, Yannick Versley, Erhard Hinrichs<br />

Devil’s Advocate on Metadata in Science .................................................................................105<br />

Christina Hoppermann, Thorsten Trippel, Claus Zinn<br />

Improving an Existing RBMT System by Stochastic Analysis ....................................................111<br />

Christian Federmann, Sabine Hunsicker<br />

Terminology Extraction and Term Variation Patterns: A Study of French and<br />

German Data............................................................................................................................117<br />

Marion Weller, Helena Blancafort, Anita Gojun, Ulrich Heid<br />

Ansätze zur Verbesserung der Retrieval-Leistung kommerzieller Translation-<br />

Memory-Systeme.......................................................................................................................123<br />

Dino Azzano, Uwe Reinke, Melanie Sauer<br />

WikiWarsDE: A German Corpus of Narratives Annotated with Temporal<br />

Expressions ..............................................................................................................................129<br />

Jannik Strötgen, Michael Gertz<br />

Translation and Language Change with Reference to Popular Science Articles:<br />

The Interplay of Diachronic and Synchronic Corpus-Based Studies .........................................135<br />

Sofia Malamatidou<br />

A Comparable Wikipedia Corpus: From Wiki Syntax to POS Tagged XML ..............................141<br />

Noah Bubenhofer, Stefanie Haupt, Horst Schwinn<br />

A German Grammar for Generation in OpenCCG....................................................................145<br />

Jean Vancoppenolle, Eric Tabbert, Gerlof Bouma, Manfred Stede<br />

Multilingualism in Ancient Texts: Language Detection by Example of Old High<br />

German and Old Saxon.............................................................................................................151<br />

Zahurul Islam, Roland Mittmann, Alexander Mehler<br />

Multilinguale Phrasenextraktion mit Hilfe einer lexikonunabhängigen<br />

Analysekomponente am Beispiel von Patentschriften und nuztergenerierten<br />

Inhalten ....................................................................................................................................157<br />

Daniela Becks, Julia Maria Schulz, Christa Womser-Hacker, Thomas Mandl<br />

Die Digitale Rätoromanische Chrestomathie – Werkzeuge und Verfahren <strong>für</strong> die<br />

Korpuserstellung durch kollaborative Volltexterschließung ....................................................163<br />

Claes Neuefeind, Jürgen Rolshoven, Fabian Steeg<br />

Ein umgekehrtes Lehnwörterbuch als Internetportal und elektronische Ressource:<br />

Lexikographische und technische Grundlagen..........................................................................169<br />

Peter Meyer, Stefan Engelberg<br />

Localizing a Core HPSG-Based Grammar for Bulgarian .........................................................175<br />

Petya Osenova


III Poster Presentations<br />

Autorenunterstützung <strong>für</strong> die Maschinelle Übersetzung............................................................183<br />

Melanie Siegel<br />

Experimenting with Corpus-Based MT Approaches..................................................................187<br />

Monica Gavrila<br />

Method of POS-Disambiguation Using Information about Words Co-Occurrence<br />

(For Russian) ..........................................................................................................................191<br />

Edward Klyshinsky, Natalia Kochetkova, Maxim Litvinov, Vadim Maximov<br />

Von TMF in Richtung UML: In drei Schritten zu einem Modell des<br />

übersetzungsorientierten Fachwörterbuchs ..............................................................................197<br />

Georg Löckinger<br />

Annotating for Precision and Recall in Speech Act Variation: The Case of<br />

Directives in the Spoken Turkish Corpus ..................................................................................203<br />

Şükriye Ruhi, Thomas Schmidt, Kai Wörner, Kerem Eryılmaz<br />

The SoSaBiEC Corpus: Social Structure and Bilinguality in Everyday<br />

Conversation ............................................................................................................................207<br />

Veronika Ries, Andy Lücking<br />

DIL, ein zweisprachiges Online-Fachwörterbuch der Linguistik<br />

(Deutsch-Italienisch) ...............................................................................................................211<br />

Carolina Flinz<br />

Knowledge Extraction and Representation: The EcoLexicon Methodology...............................215<br />

Pilar León Araúz, Arianne Reimerink<br />

Processing Multilingual Customer Contacts via Social Media..................................................219<br />

Michaela Geierhos, Yeong Su Lee, Matthias Bargel<br />

ATLAS – A Robust Multilingual Platform for the Web ..............................................................223<br />

Diman Karagiozov, Svetla Koeva, Maciej Ogrodniczuk, Cristina Vertan<br />

Multilingual Corpora at the Hamburg Centre for Language Corpora ......................................227<br />

Hanna Hedeland, Timm Lehmberg, Thomas Schmidt, Kai Wörner<br />

The English Passive and the German Learner – Compiling an Annotated<br />

Learner Corpus to Investigate the Importance of Educational Settings.....................................233<br />

Verena Möller, Ulrich Heid<br />

Register, Genre, Rhetorical Functions: Variation in English Native-Speaker<br />

and Learner Writing.................................................................................................................239<br />

Ekaterina Zaytseva<br />

Tools to Analyse German-English Contrasts in Cohesion.........................................................243<br />

Kerstin Kunz, Ekaterina Lapshinova-Koltunski<br />

Comparison and Evaluation of Ontology Extraction Systems ...................................................247<br />

Stefanie Reimers


IV System Presentations<br />

New and Future Developments in EXMARaLDA.......................................................................253<br />

Thomas Schmidt, Kai Wörner, Hanna Hedeland, Timm Lehmberg<br />

Der VLC Language Index .........................................................................................................257<br />

Dirk Schäfer, Jürgen Handke<br />

Topological Fields, Constituents and Coreference: A New Multi-Layer<br />

Architecture for TüBa-D/Z........................................................................................................259<br />

Thomas Krause, Julia Ritz, Amir Zeldes, Florian Zipser<br />

MT Server Land Translation Services.......................................................................................263<br />

Christian Federmann


Invited Talks


Multilingual Resources and Multilingual Applications - Invited Talks<br />

8


Multilingual Resources and Multilingual Applications - Invited Talks<br />

Constructing parallel lexicon fragments based on English FrameNet entries:<br />

Semantic and syntactic issues<br />

Hans C. Boas<br />

The University of Texas at Austin<br />

Department of Germanic Studies and Department of Linguistics<br />

1 University Station, C3300, Austin, TX 78712-0304, U.S.A.<br />

E-mail: hcb@mail.utexas.edu<br />

Abstract<br />

This paper investigates how semantic frames from FrameNet can be re-used for constructing FrameNets for other languages.<br />

Section one provides a brief overview of Frame Semantics (Fillmore, 1982). Section 2 introduces the main structuring principles of<br />

the Berkeley FrameNet project. Section three presents a typology of FrameNets for different languages, highlighting a number of<br />

important issues surrounding the universal applicability of semantic frames. Section four shows that while it is often possible to reuse<br />

semantic frames across languages in a principled way it is not always straightforward because of systematic syntactic<br />

differences in how lexical units express the semantics of frames. Section five summarizes the issues discussed in this paper.<br />

Keywords: Computational Lexicography, FrameNet, Frame Semantics, Syntax<br />

1. Frame Semantics<br />

Research in Frame Semantics (Fillmore, 1982; 1985) is<br />

empirical, cognitive, and ethnographic in nature. It seeks<br />

to describe and analyze what users of a language<br />

understand about what is communicated by their<br />

language (Fillmore & Baker, 2010). Central to this line<br />

of research is the notion of semantic frame, which<br />

provides the basis for the organization of the lexicon,<br />

thereby linking individual word senses, relationships<br />

between the senses of polysemous words, and<br />

relationships among semantically related words. In this<br />

conception of the lexicon, there is a network of<br />

hierarchically organized and intersecting frames through<br />

which semantic relationships between collections of<br />

concepts are identified (Petruck et al., 2004). A frame is<br />

any system of concepts related in such a way that to<br />

understand any one concept it is necessary to understand<br />

the entire system; introducing any one concept results in<br />

all of them becoming available. In Frame Semantics,<br />

word meanings are thus characterized in terms of<br />

experience-based schematizations of the speaker's<br />

world, i.e. frames. It is held that understanding any<br />

element in a frame requires access to an understanding<br />

of the whole structure (Petruck & Boas, 2003). 1 The<br />

following section shows how the concept of semantic<br />

frame has been used to structure the lexicon of English<br />

for the purpose of creating a lexical database.<br />

2. The Berkeley FrameNet Project<br />

The Berkeley FrameNet Project (Lowe et al., 1997;<br />

Baker et al., 1998; Fillmore et al., 2003a; Ruppenhofer<br />

et al., 2010) is building a lexical database that aims to<br />

provide, for a significant portion of the vocabulary of<br />

contemporary English, a body of semantically and<br />

syntactically annotated sentences from which reliable<br />

information can be reported on the valences or<br />

combinatorial possibilities of each item targeted for<br />

analysis (Fillmore & Baker, 2001). The method of<br />

inquiry is to find groups of words whose frame<br />

structures can be described together, by virtue of their<br />

sharing common schematic backgrounds and patterns of<br />

expressions that can combine with them to form larger<br />

phrases or sentences. In the typical case, words that<br />

share a frame can be used in paraphrases of each other.<br />

The general purposes of the project are both to provide<br />

1 See Petruck (19<strong>96</strong>), Ziem (2008), and Fillmore & Baker<br />

(2010) on how different theories employ the notion of “frame.”<br />

9


Multilingual Resources and Multilingual Applications - Invited Talks<br />

reliable descriptions of the syntactic and semantic<br />

combinatorial properties of each word in the lexicon,<br />

and to assemble information about alternative ways of<br />

expressing concepts in the same conceptual domain<br />

(Fillmore & Baker, 2010).<br />

To illustrate, consider the sentence Joe stole the watch<br />

from Michael. The verb steal is said to evoke the Theft<br />

frame, which is also evoked by a number of<br />

semantically related verbs such as snatch, shoplift,<br />

pinch, filch, and thieve, among others, as well as nouns<br />

such as thief and stealer. 2 The Theft frame represents a<br />

scenario with different Frame Elements (FEs) that can<br />

be regarded as instances of more general semantic roles<br />

such as AGENT, PATIENT, INSTRUMENT, etc. More precisely,<br />

the Theft frame describes situations in which a<br />

PERPETRATOR (the person or other agent that takes the<br />

GOODS away) takes GOODS (anything that can be taken<br />

away) that belong to a VICTIM (the person (or other<br />

sentient being or group) that owns the GOODS before they<br />

are taken away by the PERPETRATOR). Sometimes more<br />

specific information is given about the SOURCE (the<br />

initial location of the GOODS before they change<br />

location). 3 The necessary background information to<br />

interpret steal and other semantically related verbs as<br />

evoking the Theft frame also requires an<br />

understanding of illegal activities, property ownership,<br />

taking things, and a great deal more (see Boas, 2005b;<br />

Bertoldi et al., 2010; Dux, <strong>2011</strong>).<br />

Based on the frame concept, FrameNet researchers<br />

follow a lexical analysis process that typically consists<br />

of the following steps according to Fillmore & Baker<br />

(2010:321-322): (1) Characterizing the frames, i.e. the<br />

situation types for which the language has provided<br />

special expressive means; (2) Describing and naming<br />

the Frame Elements (FEs), i.e. the aspects and<br />

components of individual frames that are likely to be<br />

mentioned in the phrases and sentences that are<br />

instances of those frames; (3) Selecting lexical units<br />

(LUs) that belong to the frame, i.e. words from all parts<br />

of speech that evoke and depend on the conceptual<br />

2 Names of frames are in courier font. Names of Frame<br />

Elements (FEs) are in small caps font.<br />

3 Besides so-called core Frame Elements, there are also<br />

peripheral Frame Elements that describe more general aspects<br />

of a situation, such as MEANS (e.g. by trickery), TIME (e.g. two<br />

days ago), MANNER (e.g. quietly), or PLACE (e.g. in the city).<br />

10<br />

background associated with the individual frames; (4)<br />

Creating annotations of sentences sampled from a very<br />

large corpus showing the ways in which individual LUs<br />

in the frame allow frame-relevant information to be<br />

linguistically presented; (5) Automatically generating<br />

lexical entries, and the valence descriptions contained in<br />

them, that summarize observations derivable from them<br />

(see also Atkins et al., 2003; Fillmore & Petruck, 2003;<br />

Fillmore et al., 2003b; Ruppenhofer et al., 2010).<br />

The results of this work-flow are stored in FrameNet<br />

(http://framenet.icsi.berkeley.edu), an online lexical<br />

database (Baker et al., 2003) currently containing<br />

information about more than 1,000 frames and more<br />

than 10,000 LUs. 4 Users can access FrameNet data in a<br />

variety of ways. The most prominent methods include<br />

searching for individual frames or specific LUs.<br />

Figure 1: Partial valence table for steal.v in the Theft<br />

frame<br />

Each entry for a LU in FrameNet consists of the<br />

following parts: (1) A description of the frame together<br />

with definitions of the relevant FEs, annotated examples<br />

sentences illustrating the relevant FEs in context, and a<br />

list of other LUs evoking the same frame; (2) An<br />

annotation report displaying all the annotated corpus<br />

4 For differences between FrameNet and other lexical databases<br />

such as WordNet see Boas (2005a/2005b/2009).


sentences for a given LU; (3) A lexical entry report<br />

which summarizes the syntactic realization of the FEs<br />

and the valence patterns of the LU in two separate tables<br />

(see Fillmore et al., 2003B; Fillmore, 2007).<br />

Figure 1 above illustrates an excerpt from the valence<br />

patterns in the lexical report of steal in the Theft<br />

frame. The column on the far left lists the number of<br />

annotated example sentences (in the annotation report)<br />

illustrating the individual valence patterns. The rows<br />

represent so-called frame element configurations<br />

together with their syntactic realizations in terms of<br />

phrase type and grammatical function. For example, the<br />

third frame element configuration from the top lists the<br />

FEs GOODS, MANNER, and PERPETRATOR. The GOODS are<br />

realized syntactically as a NP Object, the MANNER as a<br />

dependent ADVP, and the PERPETRATOR as an external NP.<br />

Such systematic valence tables allow researchers to gain<br />

a better understanding of how the semantics of frames<br />

are realized syntactically. 5<br />

3. FrameNets for other languages<br />

3.1. Similarities and differences<br />

Following the success of the Berkeley FrameNet for<br />

English, a number of FrameNets for other languages<br />

were developed over the past ten years. Based on ideas<br />

outlined in Heid (19<strong>96</strong>), Fontenelle (1997), and Boas<br />

(2001/2002/2005a), researchers aimed to create parallel<br />

FrameNets by re-using frames constructed by the<br />

Berkeley FrameNet project for English. While<br />

FrameNets for other languages aim to re-use English<br />

FrameNet frames to the greatest extent possible, they<br />

differ in a number of important points from the original<br />

FrameNet (see Boas, 2009).<br />

For example, projects such as SALSA (Burchardt et al.,<br />

2009) aim to create full-text annotation of an entire<br />

German corpus instead of finding isolated corpus<br />

sentences to identify lexicographically relevant<br />

information as is the case with the Berkeley FrameNet<br />

and Spanish FrameNet (Subirats, 2009). FrameNets for<br />

other languages also differ in what types of resources<br />

5 For details about the different phrase types and grammatical<br />

functions, including the different types of null instantiation<br />

(CNI, DNI, and INI) (Fillmore 1986), see Fillmore et al.<br />

2003b, Boas 2009, Fillmore & Baker 2010, and Ruppenhofer<br />

et al. 2010.<br />

Multilingual Resources and Multilingual Applications - Invited Talks<br />

they use as data pools. That is, besides exploiting a<br />

monolingual corpus as is the case with Japanese<br />

FrameNet (Ohara, 2009) or Hebrew FrameNet (Petruck,<br />

2009), projects such as French FrameNet (Pitel, 2009) or<br />

BiFrameNet (Fung & Chen, 2004) also employ multilingual<br />

corpora and other existing lexical resources.<br />

Another difference concerns the tools used for data<br />

extraction and annotation. While the Japanese and<br />

Spanish FrameNets adopted the Berkeley FrameNet<br />

software (Baker et al., 2003) with slight modifications,<br />

other projects such as SALSA developed their own tools<br />

to conduct semi-automatic annotation on top of existing<br />

syntactic annotations found in the TIGER corpus, or<br />

they integrate off-the shelf software as is the case with<br />

French FrameNet or Hebrew FrameNet. FrameNets for<br />

other languages also differ in the methodology used to<br />

produce parallel lexicon fragments. While German<br />

FrameNet (Boas, 2002) and Japanese FrameNet (Ohara,<br />

2009) rely on manual annotations, French FrameNet and<br />

BiFrameNet use semi-automatic and automatic<br />

approaches to create parallel lexicon fragments for<br />

French and Chinese. Finally, FrameNets for other<br />

languages also differ in their semantic domains and the<br />

goals they pursue. While most non-English FrameNets<br />

aim to create databases with broad coverage, other<br />

projects focus on specific lexical domains such as<br />

football (a.k.a. soccer) language (Schmidt, 2009) or the<br />

language of criminal justice (Bertoldi et al., 2010).<br />

Finally, while the data from almost all non-English<br />

FrameNets are intended to be used by a variety of<br />

audiences, Multi FrameNet6 is intended to support<br />

vocabulary acquisition in the foreign language<br />

classroom (see Atzler, <strong>2011</strong>).<br />

3.2. Re-using (English) semantic frames<br />

To exemplify how English FrameNet frames can be reused<br />

for the creation of parallel lexicon fragments<br />

consider Boas' (2005a) discussion of the English verb<br />

answer evoking the Communication_Response<br />

frame and its counterpart responder in Spanish<br />

FrameNet. The basic idea is that since the two verbs are<br />

translation equivalents they should evoke the same<br />

semantic frame, which should in turn be used as a<br />

common structuring device for combining the respective<br />

6 http://www.coerll.utexas.edu/coerll/taxonomy/term/627<br />

11


Multilingual Resources and Multilingual Applications - Invited Talks<br />

English and Spanish lexicon fragments. Since the<br />

MySQL databases representing each of the non-English<br />

FrameNets are similar in structure to the English<br />

MySQL database in that they share the same type of<br />

conceptual backbone (i.e., the semantic frames and<br />

frame relations), this step involves determining which<br />

English LUs are equivalent to corresponding non-<br />

English LUs.<br />

However, before creating parallel lexicon fragments for<br />

Spanish and linking them to their English counterparts<br />

via their semantic frame it is necessary to first conduct a<br />

detailed comparison of the individual LUs and how they<br />

realize the semantics of the frame. To begin, consider<br />

the different ways in which the FEs of the<br />

Communication_response frame are realized with<br />

answer.<br />

FE Name Syntactic Realization<br />

SPEAKER NP.Ext, PP_by_Comp, CNI<br />

MESSAGE INI, NP.Obj, PP_with.Comp, QUO.Comp,<br />

Sfin.Comp<br />

ADDRESSEE DNI<br />

DEPICTIVE PP_with.Comp<br />

MANNER AVP.Comp, PPing_without.Comp<br />

MEANS PPing_by.Comp<br />

MEDIUM PP_by.Comp, PP_in.Comp,<br />

PP_over.Comp<br />

TRIGGER NP.Ext, DNI, NP.Obj, Swh.Comp<br />

Table 1: Partial realization table for the verb answer<br />

(Boas 2005a)<br />

Table 1 shows that that there is a significant amount of<br />

variation in how FEs of the Communication_<br />

Response frame are realized with answer. For<br />

example, the FE DEPICTIVE has only one option for its<br />

syntactic realization, i.e. a PP complement headed by<br />

with. Other FEs such as SPEAKER and MANNER exhibit<br />

more flexibility in how the FEs of the frame are realized<br />

12<br />

syntactically while yet another set of FEs such as<br />

MESSAGE and TRIGGER exhibit the highest degree of<br />

syntactic variation. Now that we know the full range of<br />

how the FEs of the Communication_Response<br />

frame are realized syntactically with answer we can take<br />

the next step towards creating a parallel lexical entry for<br />

its Spanish counterpart responder.<br />

This step involves the use of bilingual dictionaries and<br />

parallel corpora in order to identify possible Spanish<br />

translation equivalents of answer. While this procedure<br />

may seem trivial, it is a rather lengthy and complicated<br />

process because it is necessary to consider the full range<br />

of valence patterns (the combination of FEs and their<br />

syntactic realizations) of the English LU answer listed in<br />

FrameNet. It lists a total of 22 different frame element<br />

configurations, totaling 32 different combinations in<br />

which these sequences may be realized syntactically. As<br />

the full valence table for answer is rather long we focus<br />

on only one out of the 22 frame element configurations,<br />

namely that of SPEAKER (Sp), MESSAGE (M), TRIGGER (Tr),<br />

and ADDRESSEE (A) in Table 2.<br />

Sp M Tr A<br />

a. NP.Ext NP.Obj DNI DNI<br />

b. NP.Ext PP_with.Comp DNI DNI<br />

c. NP.Ext QUO.Comp DNI DNI<br />

d. NP.Ext Sfin.Comp DNI DNI<br />

Table 2: Excerpt from the Valence Table for answer<br />

(Boas 2005a)<br />

As Table 2 shows, the frame element configuration<br />

exhibits a certain amount of variation in how the FEs are<br />

realized syntactically: All four valence patterns have the<br />

FE SPEAKER realized as an external noun phrase, and the<br />

FEs TRIGGER and ADDRESSEE not realized overtly at the<br />

syntactic level, but null instantiated as Definite Null<br />

Instantiation (DNI). In other words, in sentences such as<br />

He answered with another question the FEs TRIGGER and<br />

ADDRESSEE are understood in context although they are<br />

not realized syntactically.<br />

With the English-specific information about answer and<br />

the more general frame information in place we are now


in a position to search for the corresponding frame<br />

element configuration of its Spanish translation<br />

equivalent responder. Taking a look at the lexical entry<br />

of responder in Spanish FrameNet we see that the<br />

variation of syntactic realizations of FEs is similar to<br />

that of answer in Table 1.<br />

FE Name Syntactic Realizations<br />

SPEAKER NP.Ext, NP.Dobj, CNI, PP_por.COMP<br />

MESSAGE AVP.AObj, DNI, QUO.DObj,<br />

queSind.DObj, queSind.Ext<br />

ADDRESSEE NP.Ext, NP.IObj, PP_a.IObj, DNI, INI<br />

DEPICTIVE AJP.Comp<br />

MANNER AVP.AObj, PP_de.AObj<br />

MEANS VPndo.AObj<br />

MEDIUM PP_en.AObj<br />

TRIGGER PP_a.PObj, PP_de.PObj, DNI<br />

Table 3: Partial Realization Table for the verb responder<br />

(Boas 2005a)<br />

Spanish FrameNet also offers a valence table that<br />

includes for responder a total of 23 different frame<br />

element configurations. Among these, we find a<br />

combination of FEs and their syntactic realization that is<br />

comparable in structure to that of its English counterpart<br />

in Table 2 above.<br />

Multilingual Resources and Multilingual Applications - Invited Talks<br />

Sp M Tr A<br />

a. NP.Ext QUO.DObj DNI DNI<br />

b. NP.Ext QueSind.DObj DNI DNI<br />

Table 4: Excerpt from the Valence Table for responder<br />

(Boas 2005a)<br />

Comparing Tables 2 and 4 we see that answer and<br />

responder exhibit comparable valence combinations<br />

with the FEs SPEAKER and MESSAGE realized syntactically<br />

while the FEs TRIGGER and ADDRESSEE are not realized<br />

syntactically, but are instead implicitly understood (they<br />

are definite null instantiations). With a Spanish<br />

counterpart in place it now becomes possible to link the<br />

Spanish set of frame element configurations in Table 4<br />

with its English counterpart in Table 2 via the<br />

Communication_Response frame as the following<br />

Figure illustrates.<br />

Figure 2: Linking partial English and Spanish lexicon<br />

fragments via semantic frames (Boas 2005a)<br />

Figure 5 shows how the lexicon fragments of answer<br />

and responder are linked via the Communication_<br />

Response frame. The 'a' index points to the<br />

respective first lines in the valence tables of the two LUs<br />

(cf. Tables 2 and 4) and identifies the two syntactic<br />

frames as being translation equivalents of each other. At<br />

the top of Figure 2 we see the verb answer with one of<br />

its 22 frame element configurations, i.e. SPEAKER,<br />

TRIGGER, MESSAGE, and ADDRESSEE. Figure 2 shows for<br />

this configuration one possible set of syntactic<br />

realizations of these FEs, that given in row (a) in Table 2<br />

above. The 9a designation following answer indicates<br />

that this lexicon fragment is the ninth configuration of<br />

FEs out of a total of 22 frame element configurations<br />

listed in the complete realization table. Of the ninth<br />

frame element configuration 'a' indicates that it is the<br />

first of a list of various possible syntactic realizations of<br />

these FEs (there are a total of four, cf. Table 2 above).<br />

As already pointed out, the FE SPEAKER is realized<br />

syntactically as an external NP, MESSAGE as an object NP,<br />

and both TRIGGER and ADDRESSEE are null instantiated.<br />

13


Multilingual Resources and Multilingual Applications - Invited Talks<br />

The bottom of Figure 2 shows responder with the first<br />

of the 17 frame element configurations (recall that there<br />

are a total of 23). For one of these configurations, we<br />

see one subset of syntactic realizations of these FEs,<br />

namely the first row catalogued by Spanish FrameNet<br />

for this configuration (see row (a) in Table 3).<br />

The two parallel lexicon fragments at the top and the<br />

bottom of Figure 2 are linked by indexing their specific<br />

semantic and syntactic configurations as equivalents<br />

within the Communication_Response frame. This<br />

linking is indicated by the arrows pointing from the top<br />

and the bottom of the partial lexical entries to the midsection<br />

in Figure 2, which symbolizes the<br />

Communication_Response frame at the<br />

conceptual level, i.e. without any language-specific<br />

specifications. Note that this procedure does not<br />

automatically link the entire lexical entries of answer<br />

and responder to each other. Establishing such a<br />

correspondence link connects only the relevant frame<br />

element configurations and their syntactic realizations in<br />

Tables 2 and 4 via the common semantic frame, because<br />

they can be regarded as translation equivalents.<br />

Although linking the two lexicon fragments this way<br />

results in a systematic way of creating parallel lexicon<br />

fragments based on semantic frames (which serve as<br />

interlingual representations), it is not yet possible to<br />

automatically create or connect such parallel lexicon<br />

fragments. This means that one must carefully compare<br />

each individual part of the valence table of a LU in the<br />

source language with each individual part of the valence<br />

table of a LU in the target language. This step is<br />

extremely time intensive because it involves a detailed<br />

comparison of bilingual dictionaries as well as<br />

electronic corpora to ensure matching translation<br />

equivalents. Recall that Figure 2 represents only a very<br />

small set of the full lexical entries of answer and<br />

responder. The procedure outlined above will have to be<br />

repeated for each of the 32 different valence patterns of<br />

answer – and its (possible) Spanish equivalents. The<br />

following section addresses a number of other issues<br />

that need to be considered carefully when creating<br />

parallel lexicon fragments based on semantic frames.<br />

4. Cross-linguistic problems<br />

Creating parallel lexicon entries for existing English<br />

14<br />

FrameNet entries and linking them to their English<br />

counterparts raises a number of important issues, most<br />

of which require careful (manual) linguistic analysis.<br />

While some of these issues apply to the creation of<br />

parallel entries across the board, others differ depending<br />

on the individual languages or the semantic frame. The<br />

following subsections, based on Boas (to appear),<br />

briefly address some of the most important issues, which<br />

all have direct bearing on how the semantics of a frame<br />

are realized syntactically across different languages.<br />

4.1. Polysemy and profiling differences<br />

While translation equivalents evoking the same frame<br />

are typically taken to describe the same types of scenes,<br />

they sometimes differ in how they profile FEs. For<br />

example, Boas (2002) discusses differences in how<br />

announce and various German translation equivalents<br />

evoke the Communication_Statement frame.<br />

When announce occurs with the syntactic frame [NP.Ext<br />

_ NP.Obj] to realize the SPEAKER and MESSAGE FEs as in<br />

They announced the birth of their child, German offers a<br />

range of different translation equivalents, including<br />

bekanntgeben, bekanntmachen, ankündigen, or anzeigen.<br />

Each of these German LUs comes with its own<br />

specific syntactic frames that express the semantics of<br />

the Communication_ Statement frame. When<br />

announce is used to describe situations in which a<br />

message is communicated via a medium such as a<br />

loudspeaker (e.g. Joe announced the arrival of the pizza<br />

over the intercom), German offers ansagen and<br />

durchsagen as more specific translation equivalents of<br />

announce besides the more general ankündigen. Thus,<br />

by providing different LUs German offers the option of<br />

profiling particular FEs of the<br />

Communication_Statement frame, thereby<br />

allowing for the representation of subtle meaning<br />

differences of the frame and the perspective given of a<br />

situation (see Ohara, 2009 on similar profiling<br />

differences between English and Japanese LUs evoking<br />

the Risk frame).<br />

4.2. Differences in lexicalization patterns<br />

Languages differ in how the lexicalize particular types<br />

of concepts (see Talmy, 1985), which may directly<br />

influence how the semantics of a particular frame are


ealized syntactically. For example, in a comparative<br />

study of English, Spanish, Japanese, and German motion<br />

verbs in The Hound of the Baskervilles (and its<br />

translations), Ellsworth et al. (2006) find that there are a<br />

number of differences in how the various concepts of<br />

motion are associated with different types of semantic<br />

frames. More specifically, they show that English return<br />

(cf. The wagonette was paid off and ordered to return to<br />

Coombe Tracey forthwith, while we started to walk to<br />

Merripit House) and Spanish regresar both evoke the<br />

Return frame, whereas the corresponding German<br />

zurückschicken evokes the Sending frame. These<br />

differences demonstrate that although the concept of<br />

motion is incorporated into indirect causation, the<br />

frames expressing indirect causation may vary from<br />

language to language (see Burchardt et al., 2009 for a<br />

discussion of more fine-grained distinctions between<br />

verbs evoking the same frame in English and German).<br />

4.3 Polysemy and translation equivalents<br />

Finding proper translation equivalents is typically a<br />

difficult task because one has to consider issues<br />

surrounding polysemy (Fillmore & Atkins, 2000; Boas,<br />

2002), zero translations (Salkie, 2002; Boas 2005a;<br />

Schmidt, 2009), and contextual and stylistic factors<br />

(Altenberg & Granger, 2002; Hasegawa et al., 2010),<br />

among others. To illustrate, consider Bertoldi's (2010)<br />

discussion of contrastive legal terminology in English<br />

and Brazilian Portuguese. Based on the English<br />

Criminal Process frame (see FrameNet), Bertoldi<br />

finds that while there are some straightforward<br />

translation equivalents of English LUs in Portuguese,<br />

others involve a detailed analysis of the relevant<br />

polysemy patterns.<br />

Consider Figure 3, which compares English and<br />

Portuguese LUs in the Notification_of_<br />

charges frame. The first problem discussed by<br />

Bertoldi (2010) addresses the fact that although there are<br />

corresponding Portuguese LUs such as denunciar, they<br />

do not evoke the same semantic frame as the English<br />

LUs, but rather a frame that could best be characterized<br />

as evoking the Accusation frame. The second<br />

problem is that six Portuguese translation equivalents of<br />

the English LUs evoking only the Notification_<br />

of_charges frame, i.e. acusar, acusação, denunciar,<br />

Multilingual Resources and Multilingual Applications - Invited Talks<br />

denuncia, pronunciar, and pronuncia, potentially evoke<br />

three different frames.<br />

Figure 3: English LUs from the Notification_of_<br />

Charges frame and their Portuguese translation<br />

equivalents (Bertoldi, 2010: 6)<br />

Figure 4: LUs evoking multiple frames in the Portuguese<br />

Crime_scenario frame (Bertoldi, 2010:7)<br />

This leads Bertoldi to claim that the LUs acusar,<br />

acusação, denunciar, and denuncia may evoke two<br />

different Criminal_Process sub-frames, besides<br />

15


Multilingual Resources and Multilingual Applications - Invited Talks<br />

other general language, non-legal specific frames, as is<br />

illustrated by Figure 4. Bertolid's analysis shows that<br />

finding translation equivalents is not always an easy task<br />

and that one needs to pay close attention to different<br />

polysemy networks across languages, which may<br />

sometimes be influenced by systematic differences such<br />

as differences between legal systems.<br />

4.4 Universal frames?<br />

Claims about the universality of certain linguistic<br />

features are abundant in the literature. When it comes to<br />

semantic frames the question is whether frames derived<br />

on the basis of English are applicable to the description<br />

and analysis of other languages (and vice versa). While<br />

a number of studies on motion verbs (Fillmore & Atkins,<br />

2000; Boas, 2002; Burchardt et al., 2009; Ohara, 2009)<br />

and communication verbs (Boas, 2005a; Subirats, 2009),<br />

among other semantic domains, suggest that there are<br />

frames that can be re-used for the description and<br />

analysis of other languages, there also seem to be<br />

culture-specific frames that may not be re-usable<br />

without significant modification.<br />

One set of examples comes from the English<br />

Personal_Relationship frame, whose semantics<br />

appears to be quite culture-specific. Atzler (<strong>2011</strong>) shows<br />

that concepts such as dating (to date) seem to be quite<br />

specific to Anglo culture and may not be directly<br />

applicable to the description of similar activities in<br />

German. Another, perhaps more extreme example, is the<br />

term sugar daddy, which has no exact counterpart in<br />

German, but instead requires a lengthy paraphrase in<br />

German to render the concept of this particular type of<br />

relationship in German.<br />

A second example comes from the intransitive Finnish<br />

verb saunoa (literally 'to sauna'), which has no direct<br />

English counterpart because it very culture-specific, and<br />

in effect evokes a particular type of frame. To this end,<br />

Leino (2010:131) claims that this verb (and<br />

correspondingly the Finnish Sauna frame) “expresses a<br />

situation in which the referent of the subject goes to the<br />

sauna, is in the sauna, participates in the sauna event, or<br />

something of the like.” Dealing with such culturespecific<br />

frames thus requires quite lengthy paraphrases<br />

to arrive at an approximation of the semantics of the<br />

frame in English.<br />

16<br />

5. Conclusions and outlook<br />

This paper has outlined some of the basic steps<br />

underlying the creation of parallel lexicon fragments.<br />

Employing semantic frames for this purpose is still a<br />

work in progress, but the successful compilation of<br />

several FrameNets for languages other than English is a<br />

good indication that this methodology should be pursued<br />

further.<br />

Clearly, the problems outlined in the previous section<br />

need to be solved. The first problem, polysemy and<br />

profiling differences, is perhaps the most daunting one.<br />

Decades of linguistic research into these issues (see, e.g.<br />

Leacock & Ravin, 2000; Altenberg & Granger, 2002)<br />

seem to suggest that there is no easy solution that could<br />

be implemented to arrive at an automatic way of<br />

analyzing, comparing, and classifying different<br />

polysemy and lexicalization patterns across languages.<br />

This means that for now these issues need to be<br />

addressed manually, in the form of careful linguistic<br />

analysis, in the near future.<br />

The same can be said about the problems surrounding<br />

lexicalization patterns, zero translations, and the<br />

universality of frames. Without a detailed catalogue of<br />

linguistic analyses of these phenomena in different<br />

languages, and a comparison across language pairs, any<br />

efforts regarding the effective linking of parallel lexicon<br />

fragments, whether on the basis of semantic frames or<br />

not, will undoubtedly hit many roadblocks.<br />

6. Acknowledgements<br />

Work on this paper was supported by a fellowship for<br />

experienced researchers from the Alexander von<br />

Humboldt Foundation, as well as by Title VI grant<br />

#P229A100014 (Center for Open Educational Resources<br />

and Language Learning) to the University of Texas at<br />

Austin.<br />

7. References<br />

Altenberg, B., Granger, S. (2002): Recent trends in<br />

cross-linguistic studies. In B. Altenberg & S. Granger<br />

(Eds.), Lexis in Contrast. Amsterdam/Philadelphia:<br />

John Benjamins, pp. 3-50.<br />

Atkins, B.T.S., Fillmore, C.J., Johnson, C.R. (2003):<br />

Lexicographic relevance: Selecting information from


Multilingual Resources and Multilingual Applications - Invited Talks<br />

corpus evidence. International Journal of<br />

Lexicography, 16(3), pp. 251-280.<br />

Atzler, J. (<strong>2011</strong>): Twist in the line: Frame Semantics as a<br />

vocabulary teaching and learning tool. Doctoral<br />

Dissertation, The University of Texas at Austin.<br />

Baker, C.F., Fillmore, C.J., Lowe, J.B. (1998): The<br />

Berkeley FrameNet Project. In COLING-ACL '98:<br />

Proceedings of the Conference, pp. 86-90.<br />

Baker, C.F., Fillmore, C.J., Cronin, B. (2003): The<br />

Structure of the FrameNet Database. In International<br />

Journal of Lexicography, 16(3), pp. 281-2<strong>96</strong>.<br />

Bertoldi, A. (2010): When translation equivalents do not<br />

find meaning equivalence: a contrastive study of the<br />

frame Criminal_Process. Manuscript. UT<br />

Austin.<br />

Bertoldi, A., Chishman, R., Boas, H.C. (2010): Verbs of<br />

judgment in English and Portuguese: What<br />

contrastive analysis can say about Frame Semantics.<br />

Calidoscopio, 8 (3), pp. 210-225.<br />

Boas, H.C. (2001): Frame Semantics as a framework for<br />

describing polysemy and syntactic structures of<br />

English and and German motion verbs in contrastive<br />

computational lexicography. In Proceedings of<br />

Corpus Linguistics 2001, pp. 64-73.<br />

Boas, H.C. (2002): Bilingual FrameNet dictionaries for<br />

machine translation. In: Proceedings of the Third<br />

International Conference on Language Resources and<br />

Evaluation, Vol. IV, pp. 1364-1371.<br />

Boas, H.C. (2005a): Semantic frames as interlingual<br />

representations for multilingual lexical databases.<br />

International Journal of Lexicography, 18(4), pp. 445-<br />

478.<br />

Boas, H.C. (2005b): From theory to practice: Frame<br />

Semantics and the design of FrameNet. In S. Langer<br />

& D. Schnorbusch (Eds.), Semantik im Lexikon.<br />

Tübingen: Narr, pp. 129-160.<br />

Boas, H.C. (2009): Recent trends in multilingual<br />

lexicography. In H.C. Boas (Ed.), Multilingual<br />

FrameNets in Computational Lexicography: Methods<br />

and Applications. Berlin/New York: Mouton de<br />

Gruyter, pp. 1-36.<br />

Boas, H.C. (to appear): Frame Semantics and<br />

Translation. In I. Antunano and A. Lopez (Eds.),<br />

Translation in Cognitive Linguistics. Berlin/New<br />

York: Mouton de Gruyter.<br />

Burchardt, A., Erk, K., Frank, A., Kowalski, A., Pado,<br />

S., & Pinkal, M. (2009): Using FrameNet for the<br />

semantic analysis of German: annotation,<br />

representation, and automation. In H.C. Boas (Ed.),<br />

Multilingual FrameNets in Computational<br />

Lexicography: Methods and Applications. Berlin/New<br />

York: Mouton de Gruyter, pp. 209-244.<br />

Dux, R. (<strong>2011</strong>): A frame-semantic analysis of five<br />

English verbs evoking the Theft frame. M.A. Report.<br />

The University of Texas at Austin.<br />

Ellsworth, M, Ohara, K., Subirats, C., & Schmidt, T.<br />

(2006): Frame-semantic analysis of motion scenarios<br />

in English, German, Spanish, and Japanese.<br />

Presentation given at the 4th International Conference<br />

on Construction Grammar (ICCG-4), Tokyo.<br />

Fillmore, C.J. (1982): Frame Semantics. In Linguistic<br />

Society of Korea (Ed.), Linguistics in the Morning<br />

Calm. Seoul: Hanshin, pp. 111-138.<br />

Fillmore, C.J. (1985): Frames and the semantics of<br />

understanding. Quadernie di Semantica, 6, pp. 222-<br />

254.<br />

Fillmore, C.J. (2006): Pragmatically controlled zero<br />

anaphora. BLS, 12, pp. 95-107.<br />

Fillmore, C.J. (2007): Valency issues in FrameNet. In: T.<br />

Herbst & K. Götz-Vetteler (Eds.), Valency:<br />

theoretical, descriptive, and cognitive issues.<br />

Berlin/New York: Mouton de Gruyter, pp. 129-160.<br />

Fillmore, C.J., Atkins, B.T.S. (2000): Describing<br />

polysemy: The case of 'crawl'. In Y. Ravin and C.<br />

Laecock (Eds.), Polysemy. Oxford: Oxford University<br />

Press, pp. 91-110.<br />

Fillmore, C.J., Baker, C.F. (2010): A frames approach to<br />

semantic analysis. In: B. Heine and H. Narrog (Eds.),<br />

The Oxford Handbook of Linguistic Analysis.<br />

Oxford: Oxford University Press, pp. 313-340.<br />

Fillmore, C.J., Petruck, M.R.L. (2003): FrameNet<br />

Glossary. International Journal of Lexicography,<br />

16(3), pp. 359-361.<br />

Fillmore, C.J, Johnson, C.R., Petruck, M.R.L. (2003a):<br />

Background to FrameNet. International Journal of<br />

Lexicography, 16(3), pp. 235-251.<br />

Fillmore, C.J., Petruck, M.R.L., Ruppenhofer, J.,<br />

Wright, A. (2003b): FrameNet in action: The case of<br />

Attaching. International Journal of Lexicography,<br />

16(3), pp. 297-333.<br />

17


Multilingual Resources and Multilingual Applications - Invited Talks<br />

Fontenelle, T. (1997): Using a bilingual dictionary to<br />

18<br />

create semantic networks. International Journal of<br />

Lexicography, 10(4), pp. 275-303.<br />

Fung, P., Chen, B. (2004): BiFrameNet: Bilingual frame<br />

semantics resource construction by cross-lingual<br />

induction. In Proceedings of COLING 2004.<br />

Hasegawa, Y., Lee-Goldman, R., Ohara, K., Fujii, S.,<br />

Fillmore, C.J. (2010): On expressing measurement<br />

and comparison in English and Japanese. In H.C.<br />

Boas (Ed.), Contrastive Studies in Construction<br />

Grammar. Amsterdam/Philadelphia: John Benjamins,<br />

pp. 169-200.<br />

Heid, U. (19<strong>96</strong>): Creating a multilingual data collection<br />

for bilingual lexicography from parallel monolingual<br />

lexicons. In Procedings of the VIIth EURALEX<br />

International Congress, pp. 559-573.<br />

Leino, J. (2010): Results, cases, and constructions:<br />

Argument structure constructions in English and<br />

Finnish. In H.C. Boas (Ed.), Contrasive Studies in<br />

Construction Grammar. Amsterdam/Philadelphia:<br />

John Benjamins, pp. 103-136.<br />

Lowe, J.B., Baker, C.F., Fillmore, C.J. (1997): A frame-<br />

semantic approach to semantic annotation. In<br />

Proceedings of the SIGLEX Workshop on Tagging<br />

Text with Lexical Semantics: Why, What, and How?<br />

Held April 4-5, in Washington, D.C., in conjunction<br />

with ANLP-97.<br />

Ohara, K. (2009): Frame-based contrastive lexical<br />

semantics in Japanese FrameNet: The case of risk and<br />

kakeru. In H.C. Boas (Ed.), Multilingual FrameNets<br />

in Computational Lexicography: Methods and<br />

Applications. Berlin/New York: Mouton de Gruyter,<br />

pp. 163-182.<br />

Petruck, M.R.L. (19<strong>96</strong>): Frame Semantics. In J.<br />

Verschueren, J.-O. Östman, J. Blommaert, C. Bulcaen<br />

(Eds.), Handbook of Pragmatics.<br />

Amsterdam/Philadelphia: John Benjamins, pp. 1-13.<br />

Petruck, M.R.L. (2009): Typological considerations in<br />

constructing a Hebrew FrameNet. In H.C. Boas (Ed.),<br />

Multilingual FrameNets in Computational<br />

Lexicography: Methods and Applications. Berlin/New<br />

York: Mouton de Gruyter, pp. 183-208.<br />

Petruck, M.R.L, Boas, H.C. (2003): All in a day's week.<br />

In . Hajicova, A. Kotesovcova, J. Mirovsky (Eds.),<br />

Proceedings of CIL 17. CD-ROM. Prague:<br />

Matfyzpress.<br />

Petruck, M.R.L., Fillmore, C.J., Baker, C.F., Ellsworth,<br />

M., Ruppenhofer, J. (2004): Reframing FrameNet<br />

data. In Proceedings of the 11 th EURALEX<br />

International Conference, pp. 405-416.<br />

Pitel, G. (2009): Cross-lingual labeling of semantic<br />

predicates and roles: A low-resource method based on<br />

bilingual L(atent) S(emantic) A(nalysis). In H.C. Boas<br />

(Ed.), Multilingual FrameNets in Computational<br />

Lexicography: Methods and Applications. Berlin/New<br />

York: Mouton de Gruyter, pp. 245-286.<br />

Ruppenhofer, J., Ellsworth, M., Petruck, M.R.L.,<br />

Johnson, C., Scheffczyk, J. (2010): FrameNet II:<br />

Extended theory and practice. Available at http://<br />

framenet.icsi.berkeley.edu<br />

Salkie, R. (2002): Two types of translation equivalence.<br />

In B. Altenberg and S. Granger (Eds.), Lexis in<br />

Contrast. Amsterdam/Philadelphia: John Benjamins,<br />

pp. 51-72.<br />

Schmidt, T. (2009): The Kicktionary – A multilingual<br />

lexical resource of football language. In H.C. Boas<br />

(Ed.), Multilingual FrameNets in Computational<br />

Lexicography: Methods and Applications. Berlin/New<br />

York: Mouton de Gruyter, pp. 101-134.<br />

Subirats, C. (2009): Spanish FrameNet: A frame-<br />

semantic analysis of the Spanish lexicon. In H.C.<br />

Boas (Ed.), Multilingual FrameNets in Computational<br />

Lexicography: Methods and Applications. Berlin/New<br />

York: Mouton de Gruyter, pp. 135-162.<br />

Talmy, L. (1985): Lexicalization patterns: semantic<br />

structures in lexical forms. In T. Shopen (Ed.),<br />

Language Typology and Syntactic Description.<br />

Cambridge: Cambridge University Press, pp. 57-149.<br />

Ziem, A. (2008): Frames und sprachliches Wissen.<br />

Berlin/New York: Mouton de Gruyter.


Multilingual Resources and Multilingual Applications - Invited Talks<br />

The Multilingual Web: Opportunities, Borders and Visions<br />

Felix Sasaki<br />

DFKI, LT-Lab / Univ. of Applied Sciences Potsdam<br />

Alt-Moabit 91c, 10559 Berlin<br />

E-mail: felix.sasaki@dfki.de<br />

Abstract<br />

The Web is growing more and more in languages other than English, leading to the opportunity of a truly multilingual, global<br />

information space. However, the full potential of multilingual information creation and access across language borders has not yet<br />

been developed. We report about a workshop series and project called “MultilingualWeb” that aims at analyzing borders or “gaps”<br />

within Web technology standardization hindering multilinguality on the Web. MultilingualWeb targets at scientific, industrial and<br />

user communities who need to collaborate closely. We conclude with a first concrete outcome of MultilingualWeb: the upcoming<br />

formation of a cross-community, W3C standardization activity that will close some of the gaps that already have been recognized.<br />

Keywords: Multilinguality, Web, standardization, language technology, metadata<br />

1. Introduction: Missing Links between<br />

Languages on the Web<br />

A recent blog post discussed “Languages of the World<br />

(Wide Web)” 1<br />

. Via impressive visualizations, it showed<br />

the amount of content per language and number of links<br />

between languages. By no surprise English is a dominant<br />

language on the Web, and every other language has a<br />

certain number of links to English web pages.<br />

Nevertheless, the amount of content in many other<br />

languages is continuously and rapidly growing.<br />

Unfortunately, the links between these languages and<br />

links to English are rather few.<br />

What does this mean? First, it demonstrates that English<br />

is a lingua franca on the Web. Users who are not capable<br />

or willing to use this lingua franca cannot communicate<br />

with others and are not part of the global information<br />

society; they are residents of local silos on the Web.<br />

Second, the desire to communicate in one’s own<br />

language is high and is growing.<br />

Several issues need to be resolved to tear down the walls<br />

between language communities on the web. One key<br />

issue is the availability of standardized technologies to<br />

create content in your own language, and to access<br />

1 See<br />

http://googleresearch.blogspot.com/<strong>2011</strong>/07/languages-of-worl<br />

d-wide-web.html<br />

content across languages. The need to resolve this issue<br />

led to the creation of the “MultilingualWeb” project.<br />

2. MultilingualWeb: Overview<br />

MultilingualWeb http://www.multilingualweb.eu/ is an<br />

EU-funded thematic network project exploring standards<br />

and best practices that support the creation, localization<br />

and use of multilingual web-based information. It is lead<br />

by the World Wide Web Consortium (W3C), the major<br />

stakeholder for creating the technological building blocks<br />

of the web. MultilingualWeb encompasses 22<br />

partners http://www.multilingualweb.eu/partners, both<br />

from research and various industries, related to content<br />

creation, localization, various software providers etc. The<br />

project main part is a series of four public workshops, to<br />

discuss what standards and best practices currently exist,<br />

and what gaps need to be filled. The project started in<br />

April 2010; as of writing, two workshops have been held.<br />

They have been of enormous success, in terms of the<br />

number of participants, awareness esp. in social media,<br />

and the outcome of discussions. In the reminder of this<br />

abstract, we will discuss current findings 2<br />

of the project<br />

and will take a look at what the two upcoming workshops<br />

and future projects might bring.<br />

2<br />

More details on the findings can be found in workshop reports<br />

on the project website http://www.multilingualweb.eu .<br />

19


Multilingual Resources and Multilingual Applications - Invited Talks<br />

20<br />

3. About Terminology and Communities<br />

One gap is related to the communities, industry and<br />

technology stacks that need to be aware of standards<br />

related to the multilingual Web. Internationalization<br />

deals with the prerequisites to create content in many<br />

languages. This involves technologies and standards<br />

related to character encoding, language identification,<br />

font selection etc. The proper internationalization of (web)<br />

technologies is required for localization: the adaptation<br />

to local markets and cultures. Localization often involves<br />

translation. With more and more content that needs to be<br />

translated and a growing number of target languages, the<br />

use of language technologies (e.g. machine translation)<br />

comes into play.<br />

A huge success of the MultilingualWeb project is that<br />

major stakeholders from the areas of internationalization,<br />

localization and language technologies have been<br />

brought together. This is important since both in terms of<br />

research and industry projects, so far the communities do<br />

not overlap. The same is true for conference series; see<br />

e.g. the (non) overlap of attendees at Localization World,<br />

LREC or the Unicode conferences.<br />

4. Workshop Sessions and Topics<br />

MultilingualWeb provides a common umbrella for these<br />

communities via a set of labels used for the workshop<br />

sessions:<br />

� Developers provide the basic technological building<br />

blocks (e.g. browsers) for multilingual web content<br />

creation.<br />

� Creators use the technologies to create multilingual<br />

content.<br />

� Localizers adapt the content to regions and cultures.<br />

� Machines are used to support multilingual content<br />

creation and access, e.g. via machine translation.<br />

� Users more and more do not only consume content,<br />

but also at the same time become contributors - see<br />

e.g. growing number of users in social networks.<br />

� Policy makers decide about strategies for fostering<br />

multilinguality on the Web. They play an important<br />

role in big multinational companies, regional or<br />

international governmental or non-governmental<br />

bodies, standardization bodies etc.<br />

Of course the above labels serve only as a rough<br />

orientation. But esp. for the detection of gaps (see below)<br />

they have proven to be quite useful. The following<br />

subsections provide a brief summary of some outcomes<br />

ordered via these labels and based on the workshop<br />

reports. For further details the reader may be referred to<br />

these reports.<br />

4.1. Developers<br />

Developers are providing the technological building<br />

blocks that are needed for multilingual content creation<br />

and content access on the web. Many of these building<br />

blocks are still under development, and web browsers<br />

play a crucial role. During the workshops, many<br />

presentations dealt with enhancement of characters and<br />

fonts support, locale data formats, internationalized<br />

domain names and typographic support.<br />

Gaps in this area are also related to handling of<br />

translations: although more and more web content is<br />

being translated, the key web technology HTML so far<br />

has no means to support this process. Here it is important<br />

that the need for such means is being brought to the<br />

standard development organizations, namely W3C, and<br />

esp. to the browser implementers.<br />

Another gap is what technology stacks are being<br />

developed, and how content providers are actually<br />

adopting them. HTML5 plays a crucial role in the future<br />

of web technology development, but for many content<br />

providers its relation to other parts of the technology eco<br />

system is not clear yet.<br />

4.2. Creators<br />

Creators more and more need to bring content to many,<br />

esp. mobile devices. Since these devices lack computing<br />

power, many aspects of multilinguality (e.g. usage of<br />

large fonts) need to be taken care of in a specific manner.<br />

“Content” does not only mean text. It also encompasses<br />

information for multilimodal and voice applications, or<br />

SMS, especially in developing countries. Navigation of<br />

content esp. across languages is another area without<br />

standardized approaches or best practices.<br />

Like in the developer area, translation is important for<br />

content creation too. There is no standardized way to<br />

identify non-translatable content, to identify tools used<br />

for translation, translation quality etc.<br />

4.3. Localizers<br />

Localizers deal with the process of localization, which<br />

involves many aspects: content creation, the distribution


of content to language service providers, further<br />

distribution to translators, etc. To organize this process<br />

there is a need to improve standards and better integrate<br />

them. Metadata plays a crucial role in this respect, as we<br />

will discuss later.<br />

Content itself is becoming more complex and fast<br />

changing - and localization approaches need to be<br />

adapted accordingly. In the area of localization, many<br />

standards have been developed: for the representation of<br />

content in the translation process, for terminology<br />

management, translation memories etc. The gap here is to<br />

understand how the standards interplay. This is not an<br />

easy task, since sometimes there are competing<br />

technologies available. Hence, currently there are quite a<br />

few initiatives dedicated to interoperability in the<br />

localization area, including the integration with web<br />

content creation and machine translation.<br />

4.4. Machines<br />

For machines, that is applications based on language<br />

technology, the need for standardization esp. related to<br />

metadata and the localization process is of outmost<br />

importance. Language resources are crucial in this area,<br />

including their standardized representation and means to<br />

share resources. The META-SHARE infrastructure<br />

currently being developed is expected to play an<br />

important role in this area.<br />

While discussing developers, creators and localizers,<br />

machine translation has been mentioned already. It has<br />

become clear that a close integration of machine<br />

translation technologies to these areas is a major<br />

requirement for the better translation quality.<br />

Machines play a crucial role in building bridges between<br />

smaller and larger languages, and to change the picture<br />

about “languages on the web” that we mentioned at the<br />

beginning of this paper.<br />

4.5. Users<br />

Users normally have no strong voice in the development<br />

of multilingual or other technologies. At the<br />

MultilingualWeb workshops, it became clear that the<br />

worldwide interest in multilingual content is high, but<br />

significant organizational and technical challenges need<br />

to be approached for reaching people in continents such<br />

as Africa and Asia.<br />

Multilingual social media are becoming more important<br />

Multilingual Resources and Multilingual Applications - Invited Talks<br />

and can be supported by language technology<br />

applications like on-the-fly machine translation.<br />

However it is important to have a clear border between<br />

controlled and uncontrolled environments of content<br />

creation. Only in this way the right tools can be chosen to<br />

achieve high quality translation of small amounts of text,<br />

versus gist translation for larger text bodies.<br />

4.6. Policy Makers<br />

The topic of policy makers was not discussed as a<br />

separate session in the first workshop, but only in the 2 nd<br />

one. Nevertheless it is of high importance: many gaps<br />

related to the multilingual web are not technical ones, but<br />

are related to e.g. political decisions about the adoption of<br />

standards. Esp. in the localization and language<br />

technology area, proprietary solutions prevailed for a<br />

long time. Here we are ahead of a radical change, and<br />

MultilingualWeb will play a crucial role in bringing the<br />

right people together.<br />

Some technological pieces have a lot of political aspects.<br />

The META-SHARE infrastructure mentioned before is a<br />

good example. A key aspect of this infrastructure is the<br />

licensing model it will provide, since not everybody will<br />

be willing to share language resources for free.<br />

5. Metadata for Language Related<br />

Technology in the Web<br />

5.1. Introduction<br />

After the broad overview of various gaps that have been<br />

detected, we will now dive deeper into gaps related to<br />

metadata. All communities we mentioned before already<br />

for a while have used such metadata:<br />

� in internationalization, metadata is used to identify<br />

character encoding or language;<br />

� in localization, metadata helps to organize the<br />

localization workflow, e.g. to identify parts of<br />

content that need to be translated;<br />

� in language technology, metadata helps as a heuristic<br />

to complement language technology applications.<br />

Such heuristics can be useful for the language technology<br />

application of automatic detection of the language of<br />

content. The heuristic here can be e.g. the language<br />

identifier given in a web page. However, to be able to<br />

judge its reliability, it is important that many stakeholders<br />

work together and that there are stable bridges between<br />

internationalization, localization and language<br />

21


Multilingual Resources and Multilingual Applications - Invited Talks<br />

technology. As one concrete outcome of the<br />

MultilingualWeb project, a project has been prepared that<br />

will work on creating these bridges. The basic project<br />

idea is summarized below.<br />

5.2. Three gaps related to Metadata<br />

Language technology applications (machine translation,<br />

automatic summarization, cross-language information<br />

retrieval, automatic quality assurance etc.) and resources<br />

(grammars, translation memories, corpora, lexica etc.)<br />

are increasingly becoming available on the web and<br />

integrated into HTML and Web based content and<br />

accessible via web applications and web service APIs.<br />

This approach has partially been successful in fostering<br />

interoperability between language technology resources<br />

and applications. However, it lacks the integration with<br />

the “Open Web Platform”, i.e.: with the growing set of<br />

technologies used for creating and consuming the Web in<br />

many applications, on many devices, for many (and more<br />

and more) users.<br />

From the view of this current platform, language<br />

technology is a black box: Services like online machine<br />

translation receive textual input, and produce some<br />

output. The end users have no means to adjust language<br />

technology to their needs, and they are not able to<br />

influence language technology based processes in detail.<br />

On the other hand, providers of language technology face<br />

difficulties in adapting to specific demands by users in a<br />

timely and cost-effective manner, which is a problem also<br />

experience by Language Service Providers as they<br />

increasing adopt language technologies.<br />

To address the “black box” problem, three gaps that have<br />

been detected during the MultilingualWeb workshops<br />

need to be filled. They play a role in the chain of<br />

multilingual content processing and consumption on the<br />

Web:<br />

� An online machine translation service might make<br />

mistakes like translation of fixed terminology or<br />

named entities. This demonstrates gap no. 1:<br />

language technology does not know about metadata<br />

in the source content, e.g. “What parts of the input<br />

should be translated?”<br />

� In the database from which the translated text has<br />

been generated, the information about translatability<br />

might have been available. However, the machine<br />

translation service does not know about that kind of<br />

22<br />

“hidden Web” information. This reveals gap no. 2:<br />

there is no description of the processes available,<br />

which were the basis for generating “surface Web”<br />

pages.<br />

� The last gap no. 3 is about a standardized approach<br />

for identification. This means first that identification<br />

of information to fill the gaps 1 and 2 is so far not<br />

described in a standardized manner. For example,<br />

there is no commonly identified translate flag<br />

available in core web technologies like HTML.<br />

Second, it means that so far resources used by<br />

language technology applications (e.g. “what<br />

lexicon is used for machine translation?”) and the<br />

applications themselves (e.g. “general purpose<br />

machine translation versus application tailored<br />

towards a specific domain”) cannot be identified<br />

uniquely. This hinders the ad hoc creation of<br />

language technology applications on the Web, i.e.<br />

the re-combination of resources and application<br />

modules.<br />

5.3. Addressing the Gaps: MultilingualWeb-LT<br />

To close the gaps mentioned above, a project called<br />

MultilingualWeb-LT has been formed that is planned to<br />

start in early 2012. The consortium of<br />

MultilingualWeb-LT consists of 14 partners from the<br />

areas of CMS systems, localization service providers,<br />

language technology industry and research etc. As the<br />

forum of work gaps, the project will start a working<br />

group within W3C.<br />

The goal of MultilingualWeb-LT is to define a standard<br />

that fills the gaps, including three mostly open source<br />

reference implementations around three topic areas, in<br />

which metadata is being used:<br />

� Integration of CMS and Localization Chain.<br />

Modules for the Drupal CMS system will be built<br />

that support the creation of the metadata. The<br />

metadata will then be taken up in web-based tools<br />

that support the localization chain: from the process<br />

of gathering of localizable content, the distribution to<br />

translators, to the re-aggregation of the results into<br />

localized output.<br />

� Online MT Systems. MT systems will be made<br />

aware of the metadata, which will lead to more<br />

satisfactory translation results. An online MT system


Multilingual Resources and Multilingual Applications - Invited Talks<br />

will be made sensitive to the outputs of the modified<br />

CMS described above.<br />

� MT Training. Metadata aware tools for training MT<br />

systems will be built. Again these are closely related<br />

to CMS that produce the necessary metadata. They<br />

will lead to better quality for MT training corpora<br />

harvested on the Web.<br />

The above description shows that CMS systems play a<br />

crucial role in MultilingualWeb-LT. The usage of<br />

language identifiers for deciding about the language of<br />

content (see sec. 4) can be enhanced e.g. by the MT<br />

training module mentioned above. However, since<br />

MultilingualWeb-LT will be a regular W3C working<br />

group, other W3C member organizations might join that<br />

group. This is highly desired, hoping not only that further<br />

implementations will be built, but also that consensus<br />

about and usage of the metadata stretches out to the web<br />

community.<br />

5.4. MultilingualWeb-LT: Already a Success Story<br />

Although MultilingualWeb-LT has not started yet, it is<br />

already a success story: It is a direct outcome of the<br />

MultilingualWeb project and of two other projects that<br />

play an important role - among others - for community<br />

building in the area of language technology research and<br />

industry.<br />

� FLaReNet (Fostering Language Resources Network)<br />

has developed a common vision for the area of<br />

language resources. The FLaReNet “Blueprint of<br />

Actions and Infrastructures” is a set of<br />

recommendations to support this vision in terms of<br />

(technical) infrastructure, R&D, and politics. As part<br />

of these recommendations, the task of “putting<br />

standards in action” has been described as highly<br />

important; MultilingualWeb-LT is a direct<br />

implementation of this task.<br />

� META-NET is dedicated to fostering the<br />

technological foundations of a multilingual<br />

European information society, by building a shared<br />

vision and strategic research agenda, an open<br />

distributed facility for the sharing and exchange of<br />

resources (META-SHARE), and by building bridges<br />

to relevant neighbouring technology fields.<br />

MultilingualWeb-LT is a bridge to support the<br />

exchange between the language technology<br />

community and the web community at large.<br />

These projects and the formation of<br />

Multilingual-Web-LT itself demonstrate that a holistic<br />

view prevails, in which the differences between<br />

internationalization, localization and language<br />

technology mentioned before become of less importance,<br />

for the common aim of a truly multilingual web.<br />

6. Upcoming Workshops and the Future<br />

At the time of writing, two workshops are planned for the<br />

MultilingualWeb project. A workshop in September <strong>2011</strong><br />

will take place in Ireland. Naturally it will have a focus in<br />

localization, since many software related companies in<br />

Ireland work on this topic.<br />

The last workshop will take place in Luxembourg in<br />

March 2012 and will wrap up the MultilingualWeb<br />

project. However, the holistic view of a multilingual web,<br />

including the communities of internationalization,<br />

localization, language technology and the web<br />

community itself, will be put forward using the<br />

MultilingualWeb brand. The MultilingualWeb-LT project<br />

is one means to carry on that brand. It is the hope of the<br />

author that other activities will follow and that<br />

cross-community collaboration will become a common<br />

place. Only in this way we will be able to tear down<br />

language barriers on the web and to achieve a truly global<br />

information society.<br />

7. Acknowledgements<br />

This extended abstract has been supported by the<br />

European Commission as part of the Competitiveness<br />

and Innovation Framework Programme and through ICT<br />

PSP Grants: Agreement No. 250500 (MultilingualWeb<br />

contract) and 249119 (META-NET T4ME contract).<br />

23


Multilingual Resources and Multilingual Applications - Invited Talks<br />

24


Multilingual Resources and Multilingual Applications - Invited Talks<br />

Combining various text analysis tools for multilingual media monitoring<br />

Ralf Steinberger<br />

European Commission – Joint Research Centre (JRC)<br />

21027 Ispra (VA), Italy<br />

E-mail: Ralf.Steinberger@jrc.ec.europa.eu, URL: http://langtech.jrc.ec.europa.eu/<br />

Abstract<br />

There is ample evidence that information contained in media reports is complementary across countries and languages. This holds<br />

both for facts and for opinions. Monitoring multilingual and multinational media therefore gives a more complete picture of the<br />

world than monitoring the media of only one language, even if it is a world language like English. Wide coverage and highly<br />

multilingual text processing is thus important. The JRC-developed Europe Media Monitor (EMM) family of applications gathers<br />

about 100,000 media reports per day in 50 languages from the internet, groups related articles, classifies them, detects and follows<br />

trends, produces statistics and issues automatic alerts. For a subset of 20 languages, it also extracts and disambiguates entities<br />

(persons, organisations and locations) and reported speech, links related news over time and across languages, gathers historical<br />

information about entities and produces various types of social networks. More recent R&D efforts focus on event scenario template<br />

filling, opinion mining, multi-document summarisation, and machine translation. This extended abstract gives an overview of EMM<br />

from a functionality point of view rather than providing technical detail.<br />

Keywords: news analysis; multilingual; automatic alerting; text mining; information extraction.<br />

1. EMM: Background and Objectives<br />

The JRC with its 2700 employees working in five<br />

different European locations in a wide variety of<br />

scientific-technical fields is a Directorate General of the<br />

European Commission (EC). It is thus a governmental<br />

body free of national interests and without commercial<br />

objectives. Its main mandate is to provide scientific<br />

advice and technical know-how to European Union (EU)<br />

institutions and its international partners, as well as to EU<br />

member state organisations, with the purpose of<br />

supporting a wide range of EU policies. Lowering the<br />

language barrier in order to increase European integration<br />

and competitiveness is a declared EU objective.<br />

The JRC-developed Europe Media Monitor (EMM) is a<br />

publicly accessible family of four news gathering and<br />

analysis applications consisting of NewsBrief, the<br />

Medical Information System MedISys, NewsExplorer and<br />

EMM-Labs. They are accessible via the single<br />

URL http://emm.newsbrief.eu/overview.html. The first<br />

EMM website went online in 2002 and it has since been<br />

extended and improved continuously. The initial<br />

objective was to complement the manual news clipping<br />

services of the EC, by searching for news reports online,<br />

categorising them according to user needs, and providing<br />

an interface for human moderation (selection and<br />

re-organisation of articles; creation of layout to print<br />

in-house newspapers). EMM users thus typically have a<br />

specific information need and want to be informed about<br />

any media reports concerning their subject of interest.<br />

Monitoring the media for events that are dangerous to the<br />

public health (PH) is a typical example. EMM thus<br />

continuously gathers news from the web, automatically<br />

selects PH-related news items (e.g. on chemical,<br />

biological, radiological and nuclear (CBRN) threats<br />

including disease outbreaks, natural disasters and more),<br />

presents the information on targeted web pages, detects<br />

unexpected information spikes and alerts users about<br />

them. In addition to PH, EMM categories cover a very<br />

wide range of further subject areas, including the<br />

environment, politics, finance, security, various scientific<br />

and policy areas, general information on all countries of<br />

the globe, etc. For an overview of EMM, see Steinberger<br />

et al. (2009).<br />

25


Multilingual Resources and Multilingual Applications - Invited Talks<br />

26<br />

Figure 1. Various aggregated statistics and graphs showing category-based information for one category<br />

(ECOLOGY) derived from reports in multiple languages.<br />

2. Information complementarity across<br />

languages and countries; news bias<br />

While national EMM clients are mostly interested in the<br />

news of their own country and that of surrounding<br />

countries (e.g. for disease outbreak monitoring), they also<br />

need to follow mass gatherings (e.g. for religious,<br />

sport-related or political reasons) because participants<br />

may bring back diseases. In addition to the news in the 23<br />

official EU languages, EMM thus also monitors news in<br />

Arabic, Chinese, Croatian, Farsi, Swahili, etc., to<br />

mention just a few of the altogether 50 languages. While<br />

major events such as wars or natural disasters are usually<br />

well-covered in world languages such as English, French<br />

and Spanish, many small events are typically only<br />

mentioned in the national or even in regional press. For<br />

instance, disease outbreaks, small-scale violent events<br />

and accidents, fraud cases, etc. are usually not reported<br />

outside the national borders. The study by Piskorski et al.<br />

(<strong>2011</strong>) comparing media reports in six languages showed<br />

that only 51 out of 523 events (of the event types violence,<br />

natural disasters and man-made disasters) were reported<br />

in more than one language. 350 out of the 523 events<br />

were found in non-English news.<br />

Due to this information complementarity across<br />

languages and countries, it is crucial that monitoring<br />

systems like EMM process texts in many different<br />

languages. Using Machine Translation (MT) into one<br />

language (usually English) and filtering the news in that<br />

language is only a partial solution because specialist<br />

terms and names are often badly translated. The benefits<br />

of processing texts in the original language was also<br />

formulated by Larkey et al. (2004) in their native<br />

language hypothesis.<br />

We observed the following benefits of applying<br />

multilingual text mining tools:<br />

1) Different languages cover different geographical<br />

areas of the world, for specific subject areas as well<br />

as generally. EMM-NewsBrief’s news clouds<br />

2)<br />

(see http://emm.newsbrief.eu/geo?type=cluster&for<br />

mat=html&language=all) show this clearly.<br />

More information on entities (persons and organisations;<br />

see NewsExplorer entity pages) can be<br />

3)<br />

extracted from multilingual text. This is due to<br />

different contents found, but also to varying<br />

linguistic coverage of the text mining software.<br />

Many more named entity variant spellings (including<br />

across scripts) are found when analysing different<br />

languages (see NewsExplorer entity pages). These


variant spellings can then be used for improved<br />

retrieval, for generating multilingual social networks,<br />

and more.<br />

4) News bias – regarding the choice of facts as well as<br />

the expression of opinions – will be reduced by<br />

looking at the media coming from different countries.<br />

News bias becomes visually evident when looking at<br />

automatically generated social networks (see, e.g.<br />

Pouliquen et al., 2007, and Tanev, 2007). For instance,<br />

mentions of national politicians are usually preferred<br />

in national news, resulting in an inflated view of the<br />

importance of one’s own political leaders.<br />

From the point of view of an organisation with a close<br />

relationship to many international users, there is thus no<br />

doubt that highly multilingual text mining applications<br />

are necessary and useful.<br />

3. Ways to harness the benefits<br />

of multilinguality<br />

Extracting information from multilingual media reports<br />

and merging the information into a single view is<br />

possible, but developing text mining tools for each of the<br />

languages costs effort and is time-consuming. However,<br />

there are various ways to limit the effort per language (for<br />

an overview of documented methods, see Steinberger,<br />

forthcoming). Some monitoring and automatic alerting<br />

functionality can even be achieved with relatively simple<br />

means. This section summarises the main multilingual<br />

media monitoring functionality provided by the EMM<br />

family of applications.<br />

3.1. Multilingual category alerting<br />

EMM categorises all incoming news items into over<br />

Figure 2. Visual alerting of country-category<br />

combinations for all Public-Health-related categories.<br />

The alert level decreases from left to right.<br />

Multilingual Resources and Multilingual Applications - Invited Talks<br />

Figure 3. English news cluster with 26 articles and<br />

automatically generated links pointing to equivalent<br />

news in the other 19 NewsExplorer languages.<br />

1,000 categories, using mostly Boolean combinations or<br />

weighted lists of search words and regular expressions.<br />

As the categories are the same across all languages,<br />

simple statistics can show differences of reporting across<br />

languages and countries and highlight any bias (see<br />

Figure 1 for some examples). Even automatic alerting<br />

about imported events reported in any of the languages is<br />

possible: EMM keeps two-week averages for the number<br />

of articles falling into any country-category combination<br />

(e.g. POLAND-TUBERCULOSIS) so that a higher influx of<br />

articles in only one of these combinations can trigger an<br />

alert even if the overall number of articles about this<br />

category has hardly changed. That way, users are visually<br />

alerted of the sudden increase of articles in that<br />

combination even for languages they cannot read (see<br />

Figure 2). Once aware, they can translate the articles or<br />

search for the cause of the news spike via their<br />

professional contacts. This functionality is much used<br />

and appreciated by centres for disease prevention and<br />

control around the world.<br />

3.2. Linking of related news across languages<br />

Every 24hours, EMM-NewsExplorer clusters the related<br />

news of the day, separately for each of the 20 languages it<br />

covers, and then links the news clusters to the equivalent<br />

clusters in the other languages (see Figure 3). Following<br />

the links allows users – for any news cluster of choice –<br />

to investigate how, and how intensely, the same event is<br />

reported in the different languages. For each news cluster,<br />

the number of articles – and meta-information such as<br />

entity names found (and more) – are displayed. Links to<br />

Google Translate allow the users to get a rough<br />

translation so that they can judge the relevancy of the<br />

articles and get an idea of what actually happened.<br />

27


Multilingual Resources and Multilingual Applications - Invited Talks<br />

The software additionally tracks related news over time,<br />

produces time lines and displays extracted meta-infor-<br />

mation about the news event. For details about the linking<br />

of related news items across languages and over time, see<br />

Pouliquen et al. (2008).<br />

3.3. Multilingual information gathering on<br />

named entities<br />

EMM-NewsExplorer identifies references to person and<br />

organisation names in twenty languages. It automatically<br />

identifies whether newly found names (within the same<br />

script or across different scripts) are simply spelling<br />

variants of another name or whether they are new names<br />

(for details, see Pouliquen & Steinberger, 2009). The<br />

EMM database currently contains up to 400 different<br />

automatically collected spellings for the same entity. Any<br />

EMM application making use of named entity<br />

information uses unique entity identifiers instead of<br />

concrete name spellings, allowing to merge information<br />

across documents, languages and scripts. The EMM<br />

software furthermore keeps track of titles and other<br />

expressions found next to the name, keeps statistics on<br />

where and when the names were found, and which<br />

entities get frequently mentioned together. The latter<br />

28<br />

Figure 4. Information automatically gathered over time by EMM-NewsExplorer<br />

from media reports in twenty or more languages on one named entity.<br />

information is used to generate social networks that are<br />

derived from the international media, thus being<br />

independent of national viewpoints. EMM software also<br />

detects quotations by and about each entity. The<br />

accumulated multilingual results are displayed on the<br />

NewsExplorer entity pages (see Figure 4), through which<br />

users can explore entities, their relations and related news.<br />

Click on any entity name in any of the EMM applications<br />

to explore this application.<br />

3.4. Multilingual event scenario template filling<br />

For a smaller subset of currently seven languages, the<br />

EMM-NEXUS software extracts structured descriptions<br />

of events relevant for global crisis monitoring, such as<br />

natural disasters; accidents; violent, medical and<br />

humanitarian events, etc. (Tanev et al., 2009; Piskorski et<br />

al., <strong>2011</strong>). For each news cluster about any such event,<br />

the software detects the event type; the event location; the<br />

count of dead, wounded, displaced, arrested etc. persons;<br />

the perpetrator in the event, as well as the weapons used,<br />

if applicable. Contradictory information found in<br />

different news articles (such as differing victim counts)<br />

are resolved to produce a best guess. The aggregated


Figure 5. EMM-Labs geographical visualisation of<br />

events extracted from media reports in seven languages.<br />

event information is then displayed on NewsBrief (in text<br />

form) and on EMM-Labs (in the form of a geographic<br />

map 1<br />

; see Figure 5).<br />

4. JRC’s multilingual text mining resources<br />

The previous section gave a rather brief overview of<br />

EMM functionality without giving technical detail.<br />

Scientific-technical details and evaluation results for all<br />

applications have been described in various publications<br />

available at http://langtech.jrc.ec.europa.eu/.<br />

The four main EMM applications are freely accessible<br />

for everybody. Additionally, the JRC has made available<br />

a number of resources (via the same website) that will<br />

hopefully be useful for developers of multilingual text<br />

mining systems. The JRC-Acquis parallel corpus in<br />

22 languages (Steinberger et al., 2006), comprising<br />

altogether over 1 billion words was publicly released in<br />

2006, followed by the DGT-Translation Memory in 2007.<br />

A new resource that can be used both as a translation<br />

memory and as a parallel corpus for text mining use is<br />

currently under preparation. JRC-Names, a collection of<br />

over 400,000 entity names and their multilingual spelling<br />

variants gathered in the course of seven years of daily<br />

news analysis (see Section 3.3), has been released in<br />

1 http://emm.newsbrief.eu/geo?type=event&format=html&langu<br />

age=all displays continuously updated live maps.<br />

Multilingual Resources and Multilingual Applications - Invited Talks<br />

September <strong>2011</strong> (Steinberger et al., <strong>2011</strong>). JRC-Names<br />

also comprises software to look up these known entities<br />

in multilingual text. Finally, the JRC Eurovoc Indexing<br />

software JEX, which categorises text in 23 different<br />

languages according to the thousands of subject domain<br />

categories of the Eurovoc thesaurus 2<br />

, will also be<br />

released soon.<br />

5. Ongoing and forthcoming work<br />

EMM customers have been making daily use of the<br />

media monitoring software for years. While being<br />

generally satisfied with the service, they would like to<br />

have more functionality and even higher language<br />

coverage. JRC’s ongoing research and development work<br />

focuses on three text mining main areas: (1) Multilingual<br />

multi-document summarisation: The purpose is to<br />

automatically summarise the thousands of news clusters<br />

generated every day; (2) Machine Translation (MT):<br />

While commercial MT software currently translates<br />

Arabic and Chinese EMM texts into English and<br />

hyperlinks to Google Translate are offered for all other<br />

languages, the JRC is working on developing its own MT<br />

software, based on Moses (Koehn et al., 2007);<br />

(3) Opinion mining / sentiment analysis: EMM users are<br />

not only interested in receiving contents, but they would<br />

also like to see opinions on certain subjects. They would<br />

like to see differences of opinions across different<br />

countries and media sources, as well as trends showing<br />

changes over time. See the JRC’s Language Technology<br />

website for publications showing the current progress in<br />

these fields.<br />

6. Acknowledgements<br />

Developing the EMM family of applications was a major<br />

multi-annual team effort. We would like to thank our<br />

present and former colleagues in the OPTIMA group for<br />

all their hard work.<br />

7. References<br />

Koehn P., Hoang, H., Birch, A., Callison-Burch, C.,<br />

Federico, M., Bertoldi, N., Cowan, B., Shen, W.,<br />

Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,<br />

A., Herbst, E. (2007): Moses: Open Source Toolkit for<br />

Statistical Machine Translation. Proceedings of the<br />

Annual Meeting of the Association for Computational<br />

2 See http://eurovoc.europa.eu/.<br />

29


Multilingual Resources and Multilingual Applications - Invited Talks<br />

Linguistics (ACL), demonstration session, Prague,<br />

Czech Republic, June 2007.<br />

Larkey, L., Feng, F., Connell, M., Lavrenko, V. (2004):<br />

Language-specific Models in Multilingual Topic<br />

Tracking. Proceedings of the 27 th annual international<br />

ACM SIGIR conference on Research and development<br />

in information retrieval, pp. 402-409.<br />

Piskorski, J., Belyaeva, J., Atkinson, M. (<strong>2011</strong>):<br />

Exploring the usefulness of cross-lingual information<br />

fusion for refining real-time news event extraction: A<br />

preliminary study. Proceedings of the 8 th International<br />

Conference ‘Recent Advances in Natural Language<br />

Processing’. Hissar, Bulgaria, 14-16 September <strong>2011</strong>.<br />

Pouliquen B., Steinberger, R. (2009): Automatic<br />

Construction of Multilingual Name Dictionaries. In: C.<br />

Goutte, N. Cancedda, M. Dymetman & G. Foster (eds.),<br />

Learning Machine Translation. MIT Press - Advances<br />

in Neural Information Processing Systems Series<br />

(NIPS), pp. 59-78.<br />

Pouliquen B., Steinberger, R., Deguernel, O. (2008):<br />

Story tracking: linking similar news over time and<br />

across languages. In Proceedings of the 2 nd workshop<br />

‘Multi-source Multilingual Information Extraction and<br />

Summarization’ (MMIES'2008) held at CoLing'2008.<br />

Manchester, UK, 23 August 2008.<br />

Pouliquen, B., Steinberger, R., Belyaeva, J. (2007):<br />

Multilingual multi-document continuously updated<br />

social networks. Proceedings of the Workshop<br />

‘Multi-source Multilingual Information Extraction and<br />

Summarization’ (MMIES'2007) held at RANLP'2007,<br />

pp. 25-32. Borovets, Bulgaria, 26 September 2007.<br />

Steinberger R. (forthcoming): A survey of methods to<br />

ease the development of highly multilingual Text<br />

Mining applications. Language Resources and<br />

Evaluation Journal LRE.<br />

Steinberger R., Pouliquen, B., van der Goot, E. (2009):<br />

An Introduction to the Europe Media Monitor Family<br />

of Applications. In: F. Gey, N. Kando & J. Karlgren<br />

(eds.): Information Access in a Multilingual World -<br />

Proceedings of the SIGIR 2009 Workshop<br />

(SIGIR-CLIR'2009), pp. 1-8. Boston, USA. 23 July<br />

2009.<br />

Steinberger R., Pouliquen, B., Widiger, A., Ignat, C.,<br />

Erjavec, T., Tufiş, D., Varga, D. (2006): The<br />

JRC-Acquis: A multilingual aligned parallel corpus<br />

with 20+ languages. Proceedings of the 5 th<br />

30<br />

International Conference on Language Resources and<br />

Evaluation (LREC'2006), pp. 2142-2147. Genoa, Italy,<br />

24-26 May 2006.<br />

Steinberger R., Pouliquen, B., Kabadjov, M., van der<br />

Goot, E. (<strong>2011</strong>): JRC-Names: A freely available,<br />

highly multilingual named entity resource.<br />

Proceedings of the 8 th<br />

International Conference<br />

‘Recent Advances in Natural Language Processing’.<br />

Hissar, Bulgaria, 14-16 September <strong>2011</strong>.<br />

Tanev, H. (2007): Unsupervised Learning of Social<br />

Networks from a Multiple-Source News Corpus.<br />

Proceedings of the Workshop ‘Multi-source<br />

Multilingual Information Extraction and<br />

Summarization’ (MMIES'2007), held at RANLP'2007,<br />

pp. 33-40. Borovets, Bulgaria, 26 September 2007.<br />

Tanev, H., Zavarella, V., Linge, J., Kabadjov, M.,<br />

Piskorski, J., Atkinson, M., Steinberger, R. (2009):<br />

Exploiting Machine Learning Techniques to Build an<br />

Event Extraction System for Portuguese and Spanish.<br />

LinguaMÁTICA, 2, pp. 55-66.


Multilingual Resources and Multilingual Applications - Invited Talks<br />

Regular Papers


Multilingual Resources and Multilingual Applications - Regular Papers<br />

32


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Generating Inflection Variants of Multi-Word Terms for French and German<br />

Simon Clematide, Luzia Roth<br />

Institute of Computational Linguistics, University of Zurich<br />

Binzmühlestr. 14, 8050 Zürich<br />

E-mail: simon.clematide@uzh.ch, luzia.roth@access.uzh.ch<br />

Abstract<br />

We describe a free Web-based service for the inflection of single words and multi-word terms for French and German. Its primary<br />

purpose is to provide glossary authors (instructors or students) of an open electronic learning management system with a practical<br />

way to add inflected variants for their glossary entries. The necessary morpho-syntactic processing for analysis and generation is<br />

implemented by finite-state transducers and a unification-based grammar framework in a declarative and principled way. The<br />

techniques required for German and French terms cover two typological different types of term creation and both can be easily<br />

transferred to other languages.<br />

Keywords: morphological generation, morphological analysis, multi-word terms, syntactic analysis, syntactic generation<br />

1. Introduction<br />

In the age of electronic media and rapid proliferation of<br />

technical terms and concepts, the use of glossaries and<br />

their dynamic linkage into running text seems to be<br />

important and self-evident in the area of e-learning.<br />

However, depending on the morphological properties of a<br />

language, e.g. the use of compounds or multi-word terms<br />

or the degree of surface modification that inflection<br />

imposes on words, the task of constructing inflected term<br />

variants from typically uninflected glossary entries is not<br />

a trivial task.<br />

In this article, we describe two Web services for inflected<br />

term variant generation that illustrate the different<br />

requirements regarding morphological and syntactic<br />

processing. Whereas French shows modest inflectional<br />

variation in comparison to German, French requires more<br />

effort regarding syntactic analysis of complex nominal<br />

phrases. For German, guessing the correct inflection<br />

class of unknown compounds is more important.<br />

A linguistically informed method for inflected term<br />

variant generation involves morphological and<br />

syntactical analysis and generation. In order to ensure<br />

this bidirectional processing, declarative linguistic<br />

frameworks such as finite-state transducers and<br />

rule-based unification grammars are beneficial. For a<br />

practical system, however, one wants to be able to<br />

analyze a wider range of expressions than what should<br />

actually be generated and presented to the user, e.g.<br />

entries in the form of back-of-the-book indexes should be<br />

understood by the system, but these forms will not appear<br />

in running text.<br />

Figure 1: Screenshot of the glossary author interface<br />

The main application domain for our services is the<br />

e-Learning Management Framework OLAT 1 where we<br />

provide glossary authors with an easy but fully<br />

controllable way to add inflected variants for their<br />

glossary entries. Our free Web-based generation<br />

service 2<br />

is only called once for a given term, viz. when the<br />

1 See<br />

http://www.olat.org for further information about the open<br />

source project OLAT (Online Learning and Training).<br />

2 The service is realized as a Common Gateway Interface (CGI),<br />

and it delivers a simple XML document customized for further<br />

processing in the glossary back-end of the e-learning<br />

33


Multilingual Resources and Multilingual Applications - Regular Papers<br />

glossary author edits an entry. As shown in Fig. 1, the<br />

glossary author is free to select or deselect any of the<br />

generated word forms.<br />

34<br />

2. Methods and Resources<br />

In this section, we first describe the lexical and<br />

morphological resources used for French and German. In<br />

section 2.2 we discuss the implementation of the<br />

syntactic processing module.<br />

2.1. Lexical Resources<br />

2.1.1. Lexical resources for French<br />

Morphalou 3 , a lexicon for inflected word forms in French<br />

(95,810 lemmata, 524,725 inflected forms), was used as a<br />

lexical resource to automatically build the finite-state<br />

transducer 4<br />

which provides all lexical information,<br />

including word forms and morphological tags.<br />

After the first evaluation ofour development set, some<br />

modifications were made to extend the vocabulary: As<br />

derivations with neo-classical elements are quite<br />

common in terminological expressions, all adjectives<br />

5<br />

were additionally combined with the prefixes of a list to<br />

create derivational forms such as audiovisuel,<br />

interethnique or biomédical.<br />

Furthermore, from all lexicon entries containing a<br />

hyphen the beginning from the entry including the<br />

hyphen was extracted. This string was taken as a prefix<br />

and combined with nouns to cover cases like<br />

demi-charge.<br />

2.1.2. Lexical resources for German<br />

We use the lexicon molifde (Clematide, 2008), which was<br />

mainly built by us by exploiting a full form lexicon<br />

generated by Morphy (Lezius, 2000), the German lexicon<br />

of the translation system OpenLogos 6<br />

, and the<br />

morphological resource Morphisto (Zielinski & Simon,<br />

2008). The manually curated resource contains roughly<br />

40,000 lemmas (nouns, adjectives, verbs), and by<br />

management software OLAT. See http://kitt.cl.uzh.ch/kitt/olat.<br />

3<br />

See http://www.cnrtl.fr/lexiques/morphalou for this resource,<br />

which is freely available for educational and academic<br />

purposes.<br />

4<br />

We use the Xerox Finite State Tools (XFST) (Beesley &<br />

Karttunen, 2003), which seamlessly integrate with the Xerox<br />

Linguistic Environment (XLE), see http://www2.parc.com/isl/groups/nltt/xle.<br />

5<br />

http://fr.wiktionary.org/wiki/Catégorie:Préfixes_en_français<br />

6<br />

Containing approx. 120,000 entries with inflection class<br />

categorizations of varying quality, see http://logos-os.dfki.de.<br />

applying automatic rules for derivation and conversion<br />

an additional set of 100,000 lemmas is created.<br />

As noun compounds are the most common and<br />

productive form of terms in German, a suffix-based<br />

inflection class guesser for nouns is necessary. In an<br />

evaluation experiment with 200 randomly selected nouns<br />

from a sociology lexicon 7<br />

, about 40% of the entries were<br />

unknown. We implemented a finite-state based ending<br />

guesser by exploiting frequency counts of lemma endings<br />

(3 up to 5 characters) from our curated lexicon. Roughly<br />

80% of the 73 unknown singular nouns got their correct<br />

inflection class. The finite-state based ending guesser is<br />

tightly coupled with the finite-state transducer derived<br />

from our lexicon. See Clematide (2009) for technical<br />

implementation details.<br />

2.2. Morpho-syntactic Analysis and Generation<br />

While the generation of inflected variants for single<br />

words can be easily done with the help of finite-state<br />

techniques only, this is not the case for a proper treatment<br />

of complex multi-word terms. Therefore, we decided to<br />

use a unification-based grammar framework for syntactic<br />

processing.<br />

The Xerox Linguistic Environment (XLE) has several<br />

benefits for our purposes:<br />

Firstly, finite-state transducers for morphological<br />

processing integrate in a seamless and efficient way.<br />

Additionally, different tokenizer transducers can be<br />

specified for analysis and generation. This proved to be<br />

useful for the treatment of French, e.g. regarding the<br />

treatment of hyphenated compounds.<br />

Secondly, there are predefined commands in XLE for<br />

parsing a term to its functional structure, neutralizing<br />

certain morpho-syntactic features, and generating all<br />

possible strings out of an underspecified functional<br />

structure.<br />

Thirdly, the implementation of optimality theory in XLE<br />

allows a principled way of specifying preference<br />

heuristics, for instance for the part of speech of an<br />

ambiguous word. Additionally, using optimality marks<br />

allows to analyze more constructions than what should be<br />

generated, e.g. terms in the format of back-of-the-book<br />

indexes as Automat, endlich. With the same technique<br />

different lexical specification conventions of French<br />

7 http://www.socioweb.org


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Terms Correct Incorrect Accuracy<br />

Generation Generation<br />

Development Set 400 376 24 94%<br />

parse failure: 19<br />

wrong parse: 5<br />

Test Set 50 48 2 98%<br />

parse failure: 1<br />

wrong parse: 1<br />

Table 1: Evaluation results for French from the development set and test set<br />

adjectives can be handled by the XLE grammar. Lexicon<br />

entries like grand, e or grand/e or grand(e) are parsed<br />

and will result in the same output grand, grande, grands,<br />

grandes.<br />

Lastly, dealing with unknown words is supported in XLE<br />

in a way that parts of a multi-word term that do not<br />

undergo inflection may be analyzed and regenerated<br />

verbatim. This is useful for the treatment of postnominal<br />

prepositional phrases.<br />

The use of a full-blown grammar engineering framework<br />

for the generation of inflected term variants might be<br />

seen as too much machinery at first sight. However, the<br />

experience we gained with this approach is definitely<br />

positive. Despite the expressivity of the framework, the<br />

processing time needed for the processing of one<br />

multi-word term is about 200ms on an AMD Opteron<br />

2200 MHz. Given the fact that our service is only called<br />

when an entry is created by a glossary author, this<br />

performance is adequate.<br />

2.2.1. French multi-word terms<br />

As French is more analytic than German, compounding is<br />

less prominent. The words in a multi-word term are<br />

syntactically depending on each other and require<br />

syntactic processing. The most common construction for<br />

multi-word terms is a noun combined with a preposition<br />

and a noun phrase (e.g. droit de l’individu). Such<br />

constructions typically correspond to German<br />

compounds. Each noun may be modified by one or more<br />

adjectives. For a correct generation of all inflected<br />

variants, the core noun and its core adjectives have to be<br />

identified, as these are the only parts to be altered for<br />

inflected variants. The core part of a French multi-word<br />

term is typically the one preceding the preposition (e.g.<br />

droit de l’individu → droits de l’individu). Due to this<br />

fact, even terms with unknown words can be handled as<br />

long as they follow the preposition. In our XLE grammar,<br />

a default parsing strategy for unknown words occurring<br />

after a preposition is built-in and for the generation side<br />

such input is copied unchanged.<br />

Further constructions for multi-word terms are: a noun<br />

with one or more adjectives, expressions with a hyphen<br />

(e.g. éthylène-glycol), noun-noun combinations (e.g.<br />

assurance maladie) or combinations of several nouns<br />

with et or ou (e.g. cause et effet). For our development set<br />

of 400 terms (see section 3.1.1 for further details), we get<br />

the following distribution: terms with prepositions (190),<br />

terms with adjectives (183), noun-noun combinations<br />

(16), terms with hyphens (9), combination of type noun et<br />

noun (2).<br />

2.2.2. Preference heuristics for French<br />

If the parsing of a one-word input term results in<br />

ambiguous structures, nouns are preferred to adjectives<br />

and verbs, as glossary entries often are nouns. For<br />

ambiguous structures of multi-word input terms the<br />

sequence noun-adjective is preferred to noun-noun, e.g.<br />

église moderne = noun + adjective instead of noun +<br />

noun. If a term is a combination of two nouns, only the<br />

first one is inflected, e.g. assurance maladie →<br />

assurances maladie.<br />

In expressions with a hyphen, inflection is carried out by<br />

treating the hyphenated part of the term as normal word:<br />

Core adjectives or nouns with a hyphen are inflected, all<br />

others are not, e.g. éthylène-glycol → éthylène-glycols,<br />

or document quasi-négociable → documents quasinégociables.<br />

In these two examples, the second part of<br />

the hyphenated expression is a core noun and has to be<br />

inflected. But there are cases where both parts of the<br />

hyphenated expression are non-core nouns. They are not<br />

inflected as in the example égalité homme-femme →<br />

égalités homme-femme. This example follows the<br />

35


Multilingual Resources and Multilingual Applications - Regular Papers<br />

construction of a noun-noun multi-word term and is<br />

treated as such.<br />

2.2.3. German multi-word terms<br />

A detailed technical report on the XLE-based generation<br />

and analysis part for German can be found in Clematide<br />

(2009). Currently, German multi-word terms are<br />

restricted to the combination of an attributive adjective<br />

and a noun that may be given in the textual form of<br />

’adjective noun’ or as back-of-the-book index entry<br />

’noun, adjective’. For instance, the lexicon entry<br />

endlicher Automat (finite state automaton) leads to the<br />

following 6 inflected forms: endlichem Automaten,<br />

endlicher Automat, endlicher Automaten, endlichen<br />

Automaten, endliche Automat, endliche Automate.<br />

2.2.4. Related work<br />

As far as term structures in French are concerned, Daille<br />

(2003) gives an overview that provided a base for our<br />

own analysis of multi-word terms structures. This<br />

classification was adapted and extended according to our<br />

potential glossary entries.<br />

Jacquemin (2001) developed FASTR, a system for<br />

identifying morphological and syntactical term variants<br />

for French and English where also minor lexical<br />

modifications may take place. We did not use this system<br />

mainly for two reasons: we also had to treat German and<br />

the creation of lexical variants was of minor importance<br />

for us too.<br />

In her contrastive study, Savary (2008) discusses<br />

different approaches of computational inflection<br />

regarding multi-word units. She emphasizes the lexical<br />

and sometimes idiosyncratic nature of multi-word<br />

expressions that may lead to problems for simple<br />

rule-based syntactic systems. However, our small-scale<br />

evaluation presented in the next section does not indicate<br />

severe problems for our approach.<br />

36<br />

3. Evaluation<br />

In this section, we present results of our tools derived<br />

from two small-scale evaluations.<br />

3.1.1. French<br />

A development set with 400 and a test set with<br />

50 glossary entries were taken randomly from EuroVoc 8<br />

,<br />

8<br />

http://eurovoc.europa.eu/drupal<br />

the EU’s multilingual thesaurus. Table 1 shows the<br />

results for both data sets. Parsing failures were due to<br />

unknown vocabulary entries such as abbrevations (e.g.<br />

CEC, P et T) or compounds (e.g. désoxyribonucléique,<br />

spéctrométrie). Surprisingly, quite common French<br />

words like jetable and environnemental (appeared<br />

5 times in the development set) were not covered by the<br />

lexicon. To alleviate the problem of missing vocabulary,<br />

additional open resources may be exploited 9<br />

. Wrong<br />

parses were caused by ambiguities between nouns and<br />

adjectives.<br />

3.1.2. German<br />

50 German multi-word terms were selected randomly<br />

from the preferred terms in EuroVoc. Without the<br />

unknown word guesser, the generation of inflected<br />

variants fails for 10 terms, resulting in an accuracy of<br />

80%. Applying the unknown word guesser for nouns<br />

allows a correct generation in 5 cases, thus giving an<br />

accuracy of 90%. 2 cases are due to unknown short nouns<br />

(the guesser requires a minimal length), 2 cases are due to<br />

unknown adjectives, and 1 case originated from an<br />

implementation error concerning adjectival nouns as<br />

Beamter (civil servant).<br />

4. Conclusions<br />

We have built a practical morphological generation<br />

service for French and German terms based on<br />

linguistically motivated processing. For multi-word<br />

terms, more constructions can be easily added through<br />

modifications of the syntactic term grammar.<br />

In order to achieve a higher lexical coverage, other<br />

resources can be integrated. In our French system, there<br />

is already an interface that allows for simple addition of<br />

new regular nouns and adjectives. For German,<br />

additional syntactic constructions for multi-word terms<br />

will be added.<br />

In order to resolve ambiguities on the level of parts of<br />

speech within multi-token terms, a part-of-speech<br />

tagging approach is feasible. However, for that purpose a<br />

specifically trained tagger is necessary<br />

9 E.g. wiktionaries (http://fr.wiktionary.org/wiki/Wiktionnaire),<br />

or different lexica with inflected forms such as lefff - lexique<br />

des formes fléchies du français (http://www.labri.fr/perso/clement/lefff),<br />

Dictionnaire DELA fléchi du français<br />

(http://infolingu.univ-mlv.fr), or Lexique3 (http://www.lexique.org),<br />

a lexicon with lemmata and grammatical<br />

categories.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

In a future step, we plan to extract nominal groups from a<br />

syntactically annotated corpus and use that material for<br />

the training of a part-of-speech tagger.<br />

5. Acknowledgements<br />

The University of Zurich supported this work by IIL<br />

grant funds. Luzia Roth implemented the French part<br />

under the supervision of Simon Clematide. The<br />

implementation of the lexicographic interface in OLAT<br />

was realized by Roman Haag under the supervision of<br />

Florian Gnägi.<br />

6. References<br />

Beesley, K.R., Karttunen, L. (2003): Finite-State<br />

Morphology: Xerox Tools and Techniques. CSLI<br />

Publications.<br />

Clematide, S. (2008): An OLIF-based open inflectional<br />

resource and yet another morphological system for<br />

German. In A. Storrer et al. (Eds.), Text Resources<br />

And Lexical Knowledge: selected papers from the 9th<br />

Conference on Natural Language Processing,<br />

KONVENS, Mouton de Gruyter, pp. 183-194.<br />

Clematide, S. (2009): A morpho-syntactic generation<br />

service for German glossary entries. In S. Clematide,<br />

M. Klenner, and M. Volk (Eds.), Searching Answers:<br />

Festschrift in Honour of Michael Hess on the Occasion<br />

of His 60th Birthday, Münster, Germany: Monsenstein<br />

und Vannerdat, pp. 33-43.<br />

Daille, B. (2003): Conceptual Structuring Through Term<br />

Variations. In Proceedings of the ACL 2003 workshop<br />

on multiword expressions analysis acquisition and<br />

treatment, pp. 9-16.<br />

Jacquemin, C. (2001): Spotting and Discovering Terms<br />

through Natural Language Processing. Massachusetts<br />

Institute of Technology.<br />

Lezius, W. (2000): Morphy - German morphology,<br />

Part-of-Speech tagging and applications. In<br />

Proceedings of the 9th EURALEX International<br />

Congress, Stuttgart, pp. 619-623.<br />

Savary, A. (2008): Computational Inflection of<br />

Multi-Word Units. A contrastive study of lexical<br />

approaches. Linguistic Issues in Language Technology<br />

- LiLT, 1(2).<br />

Zielinski, A., Simon C. (2008): Morphisto: An<br />

Open-Source Morphological Analyzer for German. In<br />

Proceedings of the FSMNLP 2008, pp. 177-184.<br />

37


Multilingual Resources and Multilingual Applications - Regular Papers<br />

38


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Tackling the Variation in International Location Information Data: An<br />

Approach Using Open Semantic Databases<br />

Janine Wolf 1 , Manfred Stede 2 , Michaela Atterer 1<br />

1 Linguistic Search Solutions R&D GmbH, Rosenstraße 2, 10178 Berlin<br />

2 <strong>Universität</strong> Potsdam, Karl-Liebknecht-Straße 24-25, 14476 Potsdam<br />

E-mail: janine@wolf-velten.de, stede@uni-potsdam.de, michaela.atterer@lssrd.de<br />

Abstract<br />

International location information ranges from mere relational descriptions of places or buildings over semi-structured address-like<br />

information up to fully structured postal address data. In order to be utilized, e.g. for associating events or people with<br />

geographical information, these location descriptions have to be decomposed and the relevant semantic information units have to<br />

be identified. However, they show a high amount of variation in order, occurrence and presentation of these semantic information<br />

units. In this work we present a new approach of using a semantic database and a rule-based algorithm to tackle the variation in<br />

such data and segment semi-structured location information strings into pre-defined elements. We show that our method is highly<br />

suitable for data cleansing and classifying address data into countries, reaching an f-score of up to 97 for the segmentation task, an<br />

f-score of 91 for the labelled segmentation task, and a success rate of 99% in the classification task.<br />

Keywords: address parsing, OpenStreetMap, address segmentation, data cleansing<br />

1. Introduction<br />

Databases of international location information, as<br />

maintained by most companies, often contain<br />

incomplete address data, variation in the order of<br />

elements, mixing of international conventions for<br />

address formatting or even semi-translated address parts.<br />

Moreover, the address data can be structured<br />

insufficiently or erroneously according to the database<br />

fields which makes the data unusable for further<br />

classification, querying and data cleansing tasks.<br />

Table 1 shows a number of possible variations of the<br />

same German address.<br />

address string problem description<br />

Willy-Brandt Street 1, Berlin partial translations<br />

#1 Willy-Brandt Street, Berlin 1000 non-standard format<br />

Willy-Brand-Str. 1 incorrect spelling<br />

Willy-Brandt-Str. 1, 1000 Berlin 20 politically outdated<br />

Willy-Brandt-Str.1, Haus 1<br />

presence of more<br />

3.Et., Zi. 101<br />

detailed information<br />

In der Willy-Brandt-Str in Berlin incomplete, e.g.<br />

extracted from free text<br />

Table 1: Examples of variation in postal addresses based on<br />

the German address Willy-Brandt-Str. 1, 10557 Berlin<br />

Apart from this kind of variation we also face variation<br />

in the description of location objects such as colloquial<br />

variations as Big Apple for New York, historical<br />

variations (Chemnitz/Karl-Marx-Stadt), transcription<br />

variants (Peking/Beijing) or translation variants<br />

(München/Munich).<br />

International addresses create further variation in<br />

address data as the typical Japanese address shown in<br />

Table 2 exemplifies.<br />

part of<br />

description string<br />

element type<br />

11-1 street number (mixed information:<br />

estate and building no.)<br />

Kamitoba-hokotate-cho city district<br />

Minami-ku ward of a city (town)<br />

Kyoto city (here: also prefecture)<br />

601-8501 postal code<br />

Table 2: Address elements: Japanese postal address<br />

example 11-1 Kamitoba-hokotate-cho,<br />

Minami- ku, Kyoto 601-8501<br />

All these variations pose major problems for data<br />

warehousing, such as deduplication, record linkage and<br />

identity matching.<br />

In this work we propose a method which is highly<br />

suitable for data cleansing. Tests on German, Australian<br />

and Japanese data show that it is moreover suitable for<br />

classifying address data into countries.<br />

39


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Our approach is rule-based and uses the open<br />

geographical database OpenStreetMap1 as well as<br />

country-specific rules and patterns. It is robust and<br />

easily extensible to further languages.<br />

2. Related Work<br />

Most work concerned with the segmentation of location<br />

information is based on statistical techniques<br />

(Borkar et al., 2001; Agichtein & Ganti, 2004; Christen<br />

& Belacic, 2005; Christen et. Al, 2002; Peng &<br />

McCallum, 2003; Marques & Gon Calves, 2004; Cortez<br />

& De Moura, 2010).<br />

However, as the experiments by Borkar et al. (2001)<br />

show, these methods drastically decrease in performance<br />

once confronted with a mixture of location strings from<br />

different countries. While an experiment on uniformly<br />

formatted address data from the U.S. reaches 99.6%<br />

accuracy, performance drops to 88.9% when trained and<br />

tested on addresses from mixed countries2 .<br />

There are only few published approaches on rule-based<br />

systems (Appelt et al, 1992; Riloff, 1993). Rule-based<br />

systems are generally thought to be less robust to noise,<br />

harder to adapt to other languages and thus considered<br />

suitable mainly for small domains. However a<br />

comparison of a rule based and statistical system<br />

(Borkar et al., 2001) showed that rules can compete with<br />

statistical approaches especially on inhomogeneous data.<br />

Given the fact, that huge geographical databases have<br />

become available in recent years, high-quality rulebased<br />

systems can be developed for large unrestricted<br />

domains with relatively little effort and easily be<br />

extended to more languages by adding more databases<br />

for the relevant countries.<br />

3. Location information Segmentation<br />

Figure 1 shows the general architecture of our system.<br />

In a preprocessing step the location information string is<br />

tokenised and normalised according to country-specific<br />

normalisation patterns (e.g. str becomes straße). Initial<br />

grouping is done if applicable, i.e. if indications for<br />

grouping already exist. These steps are necessary for a<br />

1 http://www.openstreetmap.org<br />

2 The accuracy measure used in this article is an overall<br />

measure of all element-wise measurements for the address<br />

elements under consideration and similar to the labelled recall<br />

measure used in Section 4.<br />

40<br />

later OpenStreetMap query because abbreviations or<br />

partial street names cannot be found in the database.<br />

Figure 1: The system architecture<br />

In the succeeding tagging step all identifiable<br />

geographical names are tagged by querying<br />

OpenStreetMap, allowing tagging ambiguities. Countryspecific<br />

string patterns aid the tagging of elements<br />

containing numbers. One of the difficulties within this<br />

step is that address elements often consist of more than<br />

one token. The challenge lies in querying for Oxford<br />

Street and not erroneously tagging Oxford as the place<br />

name and leaving Street untagged. This is achieved by a<br />

longest match policy. However, all match information is<br />

preserved. Multiple queries are used to account for<br />

diacritical variation as in umlauts in German (e.g. ü) and<br />

parentheses as parts of official geographical names (as<br />

in Frankfurt(Main)/Frankfurt Main) and other non<br />

alpha-numercial marks such as hyphens.<br />

As a result of this step, the elements are tagged with<br />

OpenStreetMap (OSM) internal types and not yet the<br />

address element types we are looking for. OSM types<br />

are often ambiguous. The string Potsdam, Brandenburg<br />

is tagged as Potsdam (county/city/hamlet) Brandenburg<br />

(town, state) , for instance3 .<br />

3 For a human reader familiar with the location it is clear, that<br />

this denotes the city of Potsdam within the state of<br />

Brandenburg, even though there is also a city called<br />

Brandenburg in the state of Brandenburg, for instance, or a<br />

hamlet called Potsdam in the state of Schleswig-Holstein.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

The following step maps OSM types to address<br />

elements. In the OpenStreetMap project, every country-<br />

specific subproject, e.g. the Japanese or the German<br />

OSM project, has its own guidelines about how to tag<br />

locations according to their administrative unit status (as<br />

being a city, town or hamlet4 ). Therefore we use country<br />

specific mappings from OSM internal types to one or<br />

more of the desired target address element types we<br />

define.<br />

The enrichment step provides rules for labelling<br />

address elements which have not been attributed a tag<br />

by a previous step because they were not found in the<br />

knowledge base due to spelling errors, for instance. The<br />

completion rules are of the following form:<br />

(type1 , type2 , . . . , typen ) − − > targetAddressElement<br />

If for each typex, x = 1 .. n, for the token at index x, the<br />

respective type can be found in the list of possible types,<br />

the tokens in the sequence are grouped and labelled with<br />

the type targetAddressElement. Examples for language<br />

specific completion token types are found in Table 3.<br />

A token tagged with one of these affix types indicates a<br />

(possibly still unlabelled) preceding/following location<br />

name and the token group is labelled appropriately<br />

including the marker token.<br />

compl. type examples description<br />

town_suf ku Suffix marking a<br />

station_suf Station,<br />

Ekimae,<br />

Meieki<br />

town/ward (Japan)<br />

Word marking a train<br />

station (Japan)<br />

village_suf mura, son Suffix marking a<br />

village (Japan)<br />

city_dist_pref Aza, Koaza Prefix usually<br />

preceding<br />

a city district or sub-<br />

district (Japan)<br />

street_suf Avenue, Suffix marking a street<br />

Road name (Australia)<br />

state_pref Freistaat Prefix marking a state<br />

name (Germany)<br />

Table 3: Completion types<br />

Some examples of completion rules are listed in Table 4.<br />

The left hand side of the rules specifies the token type<br />

pattern, the right hand side defines the target address<br />

element. An @ means that the token at the respective<br />

4 A hamlet is a small town or village.<br />

position must not have other possible types than the<br />

specified one.<br />

completion rule matching<br />

example<br />

(city_prefix,city) --> city Hansestadt<br />

(orientation_prefix,other,<br />

street_suffix) --> street_name<br />

Hamburg<br />

Lower Geoge<br />

Street (instead of<br />

George Street)<br />

(orientation_prefix,city) East<br />

Launcheston<br />

(contains_street_suffix) --> Ratausstraße<br />

street_name<br />

(instead of<br />

Rathausstraße<br />

(city,loc_suffix) --> city_district Berlin Mitte<br />

(state_prefix,state) --> state Freistaat Bayern<br />

(@city,@city) --> city Munich<br />

(München)<br />

(street_number,street_number_ext)<br />

--> street_number<br />

34a<br />

(street_number,sep_last_alphanum)<br />

--> street_number<br />

34 - 36<br />

Table 4: Example completion rules<br />

The final disambiguation step provides rules which<br />

decide which of the attributed types for each element is<br />

selected. In the aforementioned example, Brandenburg<br />

would thus be tagged a state and not a city.<br />

The disambiguation rules take the form<br />

(leftNeighbourType, currentType, rightNeighbourType)<br />

where currentType is the target address element type of<br />

the token group under consideration. Either<br />

rightNeighbourType or leftNeighbourType may be empty<br />

(i.e. any type is allowed). If such a rule can be applied,<br />

the token group under consideration will be labelled<br />

with currentType.<br />

4. Experiments<br />

4.1. Data<br />

We conducted our experiments using two different<br />

datasets. The first dataset was collected from the<br />

Internet, the second corpus was a company internal<br />

database. Eleven external annotators collected variations<br />

of location information data from the Internet and<br />

annotated them according to the annotation guidelines<br />

given in Wolf (<strong>2011</strong>). They collected 154 strings for<br />

German, 35 of which were used for development and the<br />

rest for testing. For Australia they collected 143 strings,<br />

41


Multilingual Resources and Multilingual Applications - Regular Papers<br />

34 of which were used for development. The Japanese<br />

data were collected and annotated by the first author. 76<br />

of the 242 data points were used for development.<br />

The company internal database contained 57 examples<br />

for Germany, 162 examples for Australia and 56 for<br />

Japan. They were already (sometimes not correctly)<br />

attributed to 3 database fields address, postal code and<br />

city. To obtain a gold standard, a correct re-ordering of<br />

the elements was done manually by the first author.<br />

4.2. Segmentation<br />

Our first experiment consisted of correctly segmenting<br />

the internet data with our system. As a baseline we used<br />

unsophisticated systems for each language which took<br />

about 1.5 hours to program each and use patterns for<br />

postal code, a small list of endings for street names and<br />

knowledge about the typical order of address elements<br />

in the country. Our evaluation should thus reflect the<br />

superiority of a full-fledged system compared to an adhoc<br />

solution.<br />

Tables 5, 6 and 7 show the evaluation results for the<br />

segmentation task for each country using f-scores based<br />

on recall and precision as computed by the PARSEVAL<br />

measures (cf. Manning & Schütze, 19<strong>96</strong>), which are<br />

suitable for evaluating systems generating bracketed<br />

structures with labels.<br />

42<br />

F-score type baseline system<br />

unlabelled 87.36 <strong>96</strong>.91<br />

labelled 70.23 91.36<br />

Table 5: Evaluation results for German data<br />

F-score type baseline system<br />

unlabelled 68.05 95.85<br />

labelled 64.93 86.60<br />

Table 6: Evaluation results for Australian data<br />

F-score type baseline system<br />

unlabelled 75.45 91.80<br />

labelled 45.47 73.50<br />

Table 7: Evaluation results for Japanese data<br />

The baseline systems showed above all problems with<br />

multi-token address elements (Frankfurt (Main), Bad<br />

Homburg) and addresses that did not conform to the<br />

standard ordering.<br />

The full-fledged system clearly outperforms the<br />

baselines by a difference in f-score (when counting<br />

correct labels and not only correct element boundaries)<br />

of 21 points for Germany, 12 for Australia and 28 for<br />

Japan.<br />

The contribution of the completion patterns was an<br />

increase in f-score of up to 13.03 points for the Japanese<br />

data (unlabelled) and a minimum of 0.28 for Australia<br />

(labelled).<br />

4.3. Data cleansing<br />

In a second experiment we tested whether the system is<br />

suitable for data cleansing. A problem already<br />

mentioned in the introduction is erroneous data<br />

structuring according to the fields of a database. By<br />

using the system for attributing address elements to the<br />

database field we could reduce the rate of elements in an<br />

incorrect database field for the company internal<br />

database by 16.77 percentage points (pp) for German,<br />

19.31pp for Australia, and 29.84pp for Japan.<br />

4.4. Address classification<br />

We also conducted an experiment to find out whether<br />

the system is able to correctly guess the country of a<br />

location information string. Our testing method ignores<br />

country information (Japan, Germany, Australia) if<br />

present, and selects the country by computing the rate of<br />

tokens in the input which could not be classified by the<br />

system, neither by the database nor by the country<br />

specific patterns for suffixes, prefixes, special words or<br />

alphanumeric strings. As a result the system selects the<br />

country with the lowest rate of unlabelled tokens. For<br />

this experiment, we used 518 location information<br />

strings from both the Internet and the company internal<br />

data (166 for German, 271 from Australia, 81 from<br />

Japan), 99.22% of which were correctly attributed to<br />

their country.<br />

5. Discussion and Future Work<br />

We present a system that successfully deals with the<br />

high variability in international textual location<br />

information, by classifying the components of location<br />

strings. The implemented system is robust and easily


Multilingual Resources and Multilingual Applications - Regular Papers<br />

extensible to more countries. We tested the system with<br />

3 countries with strongly diverging standards for the<br />

expression of location information (Germany, Australia<br />

and Japan). New countries can be added within a few<br />

hours, as only certain country specific files have to be<br />

edited and the corresponding OpenStreetMap knowledge<br />

base has to be plugged in. Most European countries are<br />

similar to Germany, and the U.S. and Canada almost<br />

identical to the Australian system, so that a large part of<br />

the world can easily be covered.<br />

The system was shown to successfully improve the<br />

address element segmentation in a company internal<br />

database with high variation in orthography and<br />

formatting, even containing translated names.<br />

Moreover, the system is able to almost always correctly<br />

guess the country that textual location information can<br />

be attributed to.<br />

In future work, the system can be further improved to<br />

deal with a greater variety of typographical or<br />

transcription errors by using phonetic indexing<br />

algorithms as Soundex for English or Traphoty matching<br />

rules (Lisbach, 2010) for international languages.<br />

6. Acknowledgements<br />

We would like to thank all external annotators that<br />

helped gathering and annotating the test data and the<br />

LSS R&D GmbH for making a company internal<br />

address database available to us in order to test the<br />

system.<br />

7. References<br />

Agichtein, E., Ganti, V. (2004): Mining Reference<br />

Tables for Automatic Text Segmentation. In KDD ’04:<br />

Proceedings of the tenth ACM SIGKDD international<br />

conference on Knowledge discovery and data mining,<br />

Seattle, WA, USA, ACM.<br />

Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D., Tyson, M.<br />

(1992): FASTUS: A Finite-state Processor for<br />

Information Extraction from Real-world Text.<br />

Borkar, V.; Deshmukh, K., Sarawagi, S. (2001):<br />

Automatic segmentation of text into structured<br />

records .<br />

Christen, P., Belacic, D. (2005): Automated Probabilistic<br />

Address Standardisation and Veri- fication.<br />

Australasian Data Mining Conference 2005<br />

(AusDM05).<br />

Christen, P.; Churches, T., Zhu, J.X. (2002): Case-<br />

Probabilistic Name and Address Cleaning and<br />

Standardisation. The Australasian Data Mining<br />

Workshop 2002.<br />

Cortez, E., De Moura, E.S. (2010): ONDUX: On-<br />

Demand Unsupervised Learning for Information<br />

Extraction. In Proceedings of the 2010 international<br />

conference on Management of data (SIGMOD ’10 ),<br />

pp. 807–818.<br />

Lisbach, B. (2010): Linguistisches Identity Matching.<br />

Vieweg+Teubner. ISBN 978-3-8348-9791- 6. URL<br />

http://dx.doi.org/10.1007/ 978-3-8348-9791-6\_11.<br />

Manning, C.D., Schütze, H. (1999): Foundations of<br />

Statistical Natural Language Processing. The MIT<br />

Press, Cambridge, Massachusetts.<br />

Marques, N.C., Gon Calves, S. (2004): Applying a Partof-Speech<br />

Tagger to Postal Address Detection on the<br />

Web, 2004.<br />

Peng, F., McCallum, A. (2003): Accurate In- formation<br />

Extraction from Research Papers using Conditional<br />

Random Fields. In: Information Processing<br />

Management.<br />

Riloff, E. (1993): Automatically Constructing a Dictionary<br />

for Information Extraction Tasks, AAAI<br />

Press / MIT Press. pp. 811–816.<br />

Wolf, J. (<strong>2011</strong>): Classifying the components of textual<br />

location information. Diploma Thesis, Department <strong>für</strong><br />

Linguistik, <strong>Universität</strong> Potsdam.<br />

43


Multilingual Resources and Multilingual Applications - Regular Papers<br />

44


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Towards Multilingual Biographical Event Extraction<br />

– Initial Thoughts on the Design of a new Annotation Scheme –<br />

Michaela Geierhos * , Jean-Leon Bouraoui § , Patrick Watrin §<br />

* CIS, Ludwig-Maximilians-<strong>Universität</strong> München, Geschwister-Scholl-Platz 1, D-80539 München, Germany<br />

§ CENTAL, Université Catholique de Louvain, place Blaise Pascal 1, B-1348 Louvain-la-Neuve, Belgium<br />

E-mail: micha@cis.uni-muenchen.de, mehdi.bouraoui@uclouvain.be, patrick.watrin@uclouvain.be<br />

Abstract<br />

Within this paper, we describe the special requirements of a semantic annotation scheme used for biographical event extraction in the<br />

framework of the European collaborative research project Biographe. This annotation scheme supports interlingual search for people<br />

due to its multilingual support covering four languages such as English, German, French and Dutch.<br />

Keywords: biographical event extraction for interlingual people search, semantic annotation scheme<br />

1. Introduction<br />

In everyday life, people search is frequently used for<br />

private interests such as locating classmates and old<br />

friends, finding partners for relationships or checking<br />

someone’s background.<br />

1.1. People Search within Business Context<br />

In a business context, finding the right person with the<br />

appropriate skills and knowledge is often crucial to the<br />

success of projects being undertaken (Mockus &<br />

Herbsleb, 2002). For instance, an employee may want to<br />

ascertain who worked on a particular project to find out<br />

why particular decisions were made without having to<br />

crawl through documentation (if there is any). Or, he may<br />

require a highly trained specialist to consult about a very<br />

specific problem in a particular programming language,<br />

standard, law, etc. Identifying experts may reduce costs<br />

and facilitate a better solution than could be achieved<br />

otherwise.<br />

Possible scenarios could be the following ones:<br />

� A personnel officer wants to find information about a<br />

person who applied for a specific position and has to<br />

collect additional career-related information about<br />

the applicant;<br />

� A company requires a description of the<br />

state-of-the-art in some field and, therefore, wants to<br />

locate an expert this knowledge area;<br />

� An enterprise has to set up an additional team<br />

supporting an existing group and has to find new<br />

employees with similar expertise;<br />

� Organizers of a conference have to match<br />

submissions with reviewers;<br />

� Job centers or even labor bureaus are interested in<br />

mapping appropriate job offers to personal data<br />

sheets.<br />

These scenarios demonstrate that it is a real challenge<br />

within any commercial, scientific, or governmental<br />

organization to manage the expertise of employees such<br />

that experts in a particular area can be identified.<br />

1.2. Background: The Biographe Project<br />

A step beyond document retrieval, people search is<br />

restricted to person-related facts. The Biographe project 1<br />

develops grammar-based analysis tools to extract<br />

person-related facts in four languages (English, German,<br />

French, and Dutch). The project received the Eurostars 2<br />

label in 2009. Kick-off has been given in March 2010 and<br />

the project lasts for 24 months. The research consortium<br />

is composed of four companies and two public research<br />

departments, based in four European countries (France,<br />

Belgium, Germany, and Austria). The team creates a<br />

multipurpose people search platform able to reconstruct<br />

biographies of people. It uses all available information<br />

1<br />

http://www.biographe.org<br />

2 http://www.eurostars-eureka.eu<br />

45


Multilingual Resources and Multilingual Applications - Regular Papers<br />

sources such as profiles on social websites, press articles,<br />

CVs or private documents. The platform collects,<br />

extracts and structures this multilingual information in<br />

indexes and relational databases ready to be used by<br />

different task-oriented people search applications.<br />

In this context, a semantic annotation scheme is<br />

commonly used. But conceiving such a scheme entails<br />

several technical, scientific and task-specific issues,<br />

especially when the platform is multilingual, which is<br />

still quite rare.<br />

1.3. Multilinguality<br />

One innovative feature of our people search platform is<br />

its multilinguality, or – to be precise – its ability to<br />

structure information, coming from the four different<br />

European languages detailed above, in a common<br />

database. By using this multilingual database, it is<br />

possible to create applications searching people through<br />

queries and documents in different languages – a feature<br />

known as interlingual search (or Cross Language<br />

Information Retrieval (CLIR)). Creating a common<br />

multilingual database allows the development of a<br />

pan-European and wholly accessible search engine<br />

offering interfaces in English and in several major<br />

European languages. Besides, this people search engine<br />

is able to handle diacritical marks such as accents<br />

(circumflex, trema, tilde, double grave accent, etc.). This<br />

apparently simple feature is very rare, due to the<br />

dominance of American search engines neglecting all<br />

accents. Accents, diacritics and not-latin symbols are<br />

very important in order to differentiate between people.<br />

1.4. Objectives of the Paper<br />

This extended abstract states our initial thoughts on the<br />

design of a new annotation scheme in such a specific<br />

framework. Since we cooperate with companies<br />

providing business and people search solutions, they<br />

already have established parsing technologies and our<br />

annotation scheme has to fulfill their technical<br />

requirements. Therefore, we only mention one of the<br />

main state-of-the-art schemes that is commonly used for<br />

biographical annotation tasks. We do not give a critical<br />

overview of all existing schemes because we have to<br />

develop an integrated solution and therefore try to<br />

somehow reinvent the wheel. Furthermore, we have to<br />

46<br />

discuss the particular context of multilingual annotation<br />

and finally give an example of our annotation scheme.<br />

2. Yet another Annotation Scheme?<br />

2.1. Linguistic Notion of Biographical Events<br />

We define biographical events as predicative relations<br />

linking several arguments out of which one is an instance<br />

belonging to the argument type . There is no<br />

restriction on the selection of the other elements<br />

participating in a biographical relationship. However, we<br />

observed that other arguments are typically instances of<br />

the semantic classes , , ,<br />

, , ,<br />

, , etc.<br />

a. John Miller retired as senior<br />

accountant in 1909.<br />

b. Michael Caine won the Academy Award<br />

for Best Supporting Actor<br />

.<br />

c. Jim Sweeney will also be joining <br />

AmeriQuest as Vice<br />

President.<br />

2.2. Events in the Information Extraction Tas k<br />

One approach to defining events is used for Information<br />

Extraction (IE), being “the automatic identification of<br />

selected types of entities, relations, or events in free text”<br />

(Grishman, 2003:545). In general, information extraction<br />

tasks use surface-based patterns to identify concepts and<br />

relations between them. Patterns may be handcrafted or<br />

learned automatically, but typically include a<br />

combination of character strings, part of speech or<br />

phrasal information (Grishman, 1997). A succession of<br />

regular expressions is normally used to identify these<br />

structures; they are applied when triggered by keywords<br />

(McDonald, 19<strong>96</strong>). Most information extraction systems<br />

either use hand written extraction patterns or use a<br />

machine learning algorithm that is trained on a manually<br />

annotated corpus. Both of these approaches require<br />

massive human effort and hence prevent information<br />

extraction from becoming more widely applicable.<br />

The problem that we are addressing is related to this<br />

traditional IE task covered by the sixth and seventh<br />

Message Understanding Conferences (MUC) 3<br />

and later<br />

3<br />

http://www-nlpir.nist.gov/related_projects/muc/


Multilingual Resources and Multilingual Applications - Regular Papers<br />

replaced by the Automatic Content Extraction (ACE)<br />

campaigns. According to the MUC campaigns,<br />

identifying an IE event is to extract fillers for a<br />

predefined event template. In this framework, IE events<br />

were identified by rule-based, lexicon-driven, machine<br />

learning or other systems.<br />

2.3. The ACE Annotation Guidelines for Events<br />

Since 1999, ACE (Automatic Content Extraction) 4<br />

has<br />

replaced MUC and has extended the task definition for<br />

the campaigns, including more and more scenarios. For<br />

the ACE task (Doddington et al., 2004), the participating<br />

systems are supposed to recognize several predefined<br />

semantic types of events (life, movement, transaction,<br />

business, conflict, personell, etc.) together with the<br />

constituent parts corresponding to these events (agent,<br />

object, source, target, time, location, etc.). For example,<br />

Table 1 provides an overview of the LIFE event type<br />

(with several subtypes including, BORN, DIED, etc.),<br />

together with the arguments which should be extracted<br />

for these events.<br />

There exist approaches that identify events according to<br />

the TimeML annotation guidelines using rule-based<br />

(Saurí et al., 2005) or machine learning approaches<br />

(Bethard & Martin, 2006). The TimeML specification<br />

language was used to create the TimeBank (Pustejovsky<br />

et al., 2003) corpus.<br />

Life event subtype Arguments<br />

BE-BORN Person, Time, Place<br />

MARRY Person, Time, Place<br />

DIVORCE Person, Time, Place<br />

INJURE<br />

Agent, Victim, Instrument,<br />

Time, Place<br />

DIE<br />

Agent, Victim, Instrument,<br />

Time, Place<br />

Table 1: An overview of ACE LIFE event subtypes<br />

2.4. Limits of ACE annotation scheme<br />

Since we dedicated our research to biographical events,<br />

we only address the LIFE and PERSONELL event types<br />

defined by the ACE English Annotation Guidelines for<br />

Events (Linguistic Data Consortium, 2005, p. 65 and sq.).<br />

Concerning the ACE English Annotation Guidelines for<br />

Events, the number of arguments considered as relevant<br />

4 http://projects.ldc.upenn.edu/ace/<br />

is quite limited. For example, the BE-BORN event type<br />

disregards useful information such as the birth name,<br />

family background, or birth defects. Especially, birth<br />

names are useful to distinguish between people by<br />

identifying that, for example, Stefani Joanne Angelina<br />

Germanotta and Lady Gaga is the same person in the<br />

following context:<br />

d. Lady Gaga was born as Stefani Joanne Angelina<br />

Germanotta on March 28, 1986.<br />

Since we need more detailed information about people,<br />

their work and occupations, we dismiss the ACE<br />

annotation standard for biographical event types. Hence<br />

we propose a more suitable one, which we present in the<br />

next sections.<br />

3. Requirements of the Annotation Scheme<br />

3.1. Compatible with Local Grammars<br />

Within the Biographe Project, we focus on a linguistic<br />

description of biographical events. For example, the<br />

(born ... died ...) parentheses typically used in<br />

biographical articles help us to spot the date of birth and<br />

death in the first line of the biography. However, there are<br />

variations in expressing a lifetime period, e.g. Jane Smith<br />

(June 1<strong>96</strong>5 – September 14, 2001). In this case, the<br />

keywords born and died are totally missing. There are<br />

many syntactic variations in heterogeneous text<br />

expressing the same types of biographical information<br />

(e.g. birth, death) which are reduced to the basics in a<br />

structured representation. Our project partners created<br />

local grammars (Gross, 1997) using the free software tool<br />

Unitex 5<br />

(Paumier, 2010) in order to describe the syntactic<br />

and lexical structures of biographical information.<br />

Formally, local grammars are recursive transition<br />

networks (Woods, 1970), symbolized by graphs<br />

(cf. Figure 1).<br />

In the framework described above, the need for named<br />

entity annotation is evident. Indeed, it relies on the<br />

accurate identification of the named entities and their<br />

corresponding relations. Consequently, it is necessary to<br />

design an annotation scheme that is capable of being<br />

integrated into the local grammar concept and can be<br />

applied to all languages provided by our system.<br />

5 http://www-igm.univ-mlv.fr/~unitex<br />

47


Multilingual Resources and Multilingual Applications - Regular Papers<br />

48<br />

Figure 1: Local grammar for the extraction of persondata fields belonging to the event “Birth”<br />

3.2. Definition of Annotation Units<br />

As stated above, the scheme is used for biographical<br />

information annotation. Hence, we defined a set of<br />

named entity categories as well as the relations between<br />

them. More precisely, this scheme follows three<br />

principles:<br />

1) The definition of “entity patterns”: they are the<br />

basis components of the annotation scheme. They<br />

benefit from the main characteristics that can be used<br />

for describing an entity; e.g. “location”, “date” …<br />

Until now, there are 20 different entity patterns;<br />

2) The next higher level is the definition of “event<br />

patterns”: they are composed of two or more entity<br />

patterns. In “event patterns”, entity patterns play<br />

different roles: one will always be the head of a<br />

pattern. Other optional or mandatory patterns can be<br />

attached to this head. For instance, the event pattern<br />

“awards” has as head the entity pattern “person”,<br />

which arguments are the entity patterns “domain”,<br />

“date”, and optionally another “person” if there is<br />

more than one award winner. At the same level, we<br />

also define so-called “relation patterns”. They<br />

build up the relationship between different entity<br />

partners in order to express the type of relation<br />

between them;<br />

3) The highest level embodies the definition of<br />

“template sets”. They are driven from different<br />

event patterns and/or relation patterns. For example,<br />

the template set “career” comprises two event<br />

patterns: “profession” and “awards”, which<br />

themselves consist of different entity patterns, as we<br />

explained above.<br />

3.3. Annotation Sample<br />

Here is an instance of the use of this annotation scheme.<br />

e. Elio Di Rupo was born on July, 18th, 1951, at<br />

Morlanwelz, from Italian parents who arrived in<br />

Belgium in 1947.<br />

After applying the annotation scheme described above<br />

we get the following result (use of one event pattern and<br />

of two entity patterns):<br />

<br />

{Elio Di Rupo,.N+comp+PERS} was born<br />

on {July 18th 1951,.ADV+Time+Moment}<br />

atMorlanwelz,<br />

from Italian {parents,.N+comp+FAMILY+IMMEDIATE}<br />

who arrived in Belgium {in 1947,.ADV+Time+Moment<br />

+Imprecis}.


3.4. Annotation Features<br />

Multilingual Resources and Multilingual Applications - Regular Papers<br />

The sample sentence (e) annotated in Section 3.3 shows<br />

that the scheme foresees the future application of<br />

anaphora resolution tools. Until now, it only works with<br />

anaphoric pronouns but it is planned to extend its<br />

capabilities to more complex anaphoric terms.<br />

There is a {} notation used beside XML because the<br />

annotation scheme has to be processed by the UNITEX 6<br />

system (Paumier, 2010:44-46) which expects such a kind<br />

of meta-syntax in order to treat multi-word expressions<br />

(e.g. “July 18 th 1951”) on the one hand and assign<br />

lexico-semantic types (e.g. ADV+Time) to text units on<br />

the other hand.<br />

Moreover, the attribute “variable” can be assigned to the<br />

values 0 or 1 if a syntactic variability is possible for a<br />

recognized unit (e.g. “Elio de Rupo”). Since the city<br />

name given in our sample sentence (here: “Morlanwelz”)<br />

cannot change its structure, we assign 0 to the attribute<br />

“variable”. However, “Elio de Rupo” can appear another<br />

time in the text as “de Rupo” or “Elio” or “de Rupo, Elio”<br />

or “Mr de Rupo” and so on. We therefore assign 1 to the<br />

attribute “variable”.<br />

3.5. Technical Basis<br />

The scheme is defined in XML format. It will be applied<br />

to the text to annotate in conjunction with the use of the<br />

DBPedia ontology 7 . Remind that it is an ontology based<br />

on the extraction and the organisation of Wikipedia 8<br />

information. This ontology features different categories,<br />

each one corresponding to the description of a<br />

characteristic of an object or concept. These categories<br />

are linked, using the Web Ontology Language (OWL),<br />

defined by the World Wide Web Consortium (W3C) 9<br />

.<br />

This fine-grained ontology would be largely sufficient to<br />

cover all of the task needs. Besides, it could easily be<br />

used for producing annotations in a Resource Description<br />

10<br />

Format (RDF) triple format , also defined by the W3C.<br />

This entails that it could be easily used for conceiving<br />

and implementing a database.<br />

Besides, such a database could be requested thanks to the<br />

6<br />

http://www-igm.univ-mlv.fr/~unitex<br />

7<br />

http://dbpedia.org<br />

8<br />

http://www.wikipedia.org<br />

9<br />

http://www.w3.org/TR/owl-features<br />

10 http://www.w3.org/RDF<br />

SPARQL language 11<br />

, which is a SQL like query language<br />

especially designed by the W3C to be compatible with<br />

OWL and RDF.<br />

This solution meets all of the specifications required by<br />

the project: knowledge representation, indexation for<br />

most of the human languages (beyond English, German,<br />

French, and Dutch), updatability of the database, etc.<br />

4. Conclusion and future works<br />

This paper described an annotation scheme conceived<br />

and implemented in the framework of a European project.<br />

In regards to other scheme, its main advantages are the<br />

multilingual support and its generality for any named<br />

entity related task.<br />

Our short term perspective is to evaluate its robustness,<br />

especially when automatically applied by local grammars.<br />

In future, we will adopt it to other named entity related<br />

tasks and additional natural languages.<br />

5. Acknowledgements<br />

This work is supported by the Eurostars Programme, a<br />

R&D initiative funded by the European Community, The<br />

Brussels Institute for Research and Innovation<br />

(INNOV IRIS ), and by the German Federal Ministry of<br />

Education and Research (Grant No. 01QE0902B). We<br />

express our sincere thanks to all for financing this<br />

research within the collaborative research project<br />

Biographe E!4621 (http://www.biographe.org).<br />

6. References<br />

Bethard, S., Martin, J. (2006): Identification of event<br />

mentions and their semantic class, in Proceedings of<br />

the Conference on Empirical Methods in Natural<br />

Language Processing (EMNLP–2006), Association for<br />

Computational Linguistics, Sydney, Australia,<br />

pp. 146-154.<br />

Doddington, G., Mitchell, A., Przybocki, M., Ramshaw,<br />

L., Strassel, S., Weischedel, R. (2004): The Automatic<br />

Content Extraction (ACE) Program. Tasks, Data, and<br />

Evaluation, in Proceedings of the Fourth International<br />

Conference on Language Resources and Evaluation<br />

(LREC 2004), Canary Islands, Spain.<br />

Grishman, R. (1997): Information Extraction:<br />

Techniques and Challenges, in M. T. Pazienza (ed.),<br />

Proceedings of the Information Extraction<br />

11 http://www.w3.org/TR/rdf-sparql-query<br />

49


Multilingual Resources and Multilingual Applications - Regular Papers<br />

50<br />

International Summer School (SCIE-97),<br />

Springer-Verlag.<br />

Grishman, R. (2003): Information Extraction, in R.<br />

Mitkov (ed.), The Oxford Handbook of Computational<br />

Linguistics, Oxford University Press, pp. 545-559.<br />

Gross, M. (1997): The Construction of Local Grammars,<br />

in E. Roche & Y. Schabes (eds), Finite-State Language<br />

Processing, MIT Press, Cambridge, Massachusetts,<br />

USA: 329-354.<br />

Linguistic Data Consortium (2005): ACE English<br />

Annotation Guidelines for Events, Version 5.4.3<br />

2005.07.01,<br />

http://www.ldc.upenn.edu/Projects/ACE/docs/English<br />

-Events-Guidelines_v5.4.3.pdf<br />

McDonald, D. (19<strong>96</strong>): Internal and External Evidence in<br />

the Identification and Semantic Categorization of<br />

Proper Names, in Corpus Processing for Lexical<br />

Acquisition: MIT Press, pp. 31-43.<br />

Mockus, A., Herbsleb, J.D. (2002): Expertise browser: a<br />

quantitative approach to identifying expertise. In<br />

ICSE’02: Proceedings of the 24th International<br />

Conference on Software Engineering, pp. 503–512.<br />

Paumier, S. (2010): Unitex User Manual 2.1,<br />

http://igm.univ-mlv.fr/~unitex/UnitexManual2.1.pdf.<br />

Pustejovsky, J., Castaño, J, Ingria, R., Saurí, R.,<br />

Gaizauskas, R., Setzer, A., Katz, G., Radev, D. (2003):<br />

TimeML: A specification language for temporal and<br />

event expressions, in Proceedings of the International<br />

Workshop of Computational Semantics (IWCS–2003),<br />

Tilburg, The Netherlands.<br />

Saurí, R., Verhagen, M., Pustejovsky, J. (2005), Evita: A<br />

robust event recognizer for QA systems, in<br />

Proceedings of the Joint Human Language Technology<br />

Conference and Conference on Empirical Methods in<br />

Natural Language Processing (HLT/EMNLP-2005),<br />

Vancouver, Canada, pp. 700-707.<br />

Woods, W. A. (1970): Transition network grammars for<br />

natural language analysis, in Communications of the<br />

ACM, n° 10, vol. 13, ACM, New York, NY, USA,<br />

pp. 591–606.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

The Corpus of Academic Learner English (CALE): A new resource for the study<br />

of lexico-grammatical variation in advanced learner varieties<br />

Marcus Callies, Ekaterina Zaytseva<br />

Johannes-Gutenberg-<strong>Universität</strong> Mainz, Department of English and Linguistics<br />

Jakob-Welder-Weg 18, 55099 Mainz<br />

E-mail: mcallies@uni-mainz.de, zaytseve@uni-manz.de<br />

Abstract<br />

This paper introduces the Corpus of Academic Learner English (CALE), a Language for Specific Purposes learner corpus that is<br />

currently being compiled for the quantitative and qualitative study of lexico-grammatical variation patterns in advanced learners'<br />

written academic English. CALE is designed to comprise seven academic genres produced by learners of English as a foreign<br />

language in a university setting and thus contains discipline- and genre-specific texts. The corpus will serve as an empirical basis to<br />

produce detailed case studies that examine individual (or the interplay of several) determinants of lexico-grammatical variation, e.g.<br />

semantic, structural, discourse-motivated and processing-related ones, but also those that are potentially more specific to the<br />

acquisition of L2 academic writing such as task setting, genre and writing proficiency. Another major goal is to develop a set of<br />

linguistic criteria for the assessment of advanced proficiency conceived of as "sophisticated language use in context". The research<br />

findings will be applied to teaching English for Academic Purposes by creating a web-based reference tool that will give students<br />

access to typical collocational patterns and recurring phrases used to express rhetorical functions in academic writing.<br />

Keywords: learner English, academic writing, lexico-grammatical variation, advanced proficiency<br />

1. Introduction<br />

Recently, second language acquisition (SLA) research has<br />

seen an increasing interest in advanced stages of<br />

acquisition and questions of near-native competence.<br />

Corpus-based research into learner language (Learner<br />

Corpus Research, LCR) has contributed to a much clearer<br />

picture of advanced interlanguages, providing evidence<br />

that learners of various native language (L1) backgrounds<br />

have similar problems and face similar challenges on their<br />

way to near-native proficiency. Despite the growing<br />

interest in advanced proficiency, the fields of SLA and<br />

LCR are still struggling with i) a definition and<br />

clarification of the concept of "advancedness", ii) an<br />

in-depth description of ALVs, and iii) the<br />

operationalization of such a description in terms of criteria<br />

for the assessment of advancedness. In this paper, we<br />

introduce the Corpus of Academic Learner English<br />

(CALE), a Language for Specific Purposes learner corpus<br />

that is currently being compiled for the quantitative and<br />

qualitative study of lexico-grammatical variation patterns<br />

in advanced learners' written academic English.<br />

2. Corpus design and composition<br />

Already existing learner corpora, such as the<br />

International Corpus of Learner English (Granger et al.,<br />

2009) include learner writing of a general argumentative,<br />

creative or literary nature, and thus not academic writing<br />

in a narrow sense. Thus, several patterns of variation that<br />

predominantly occur in academic prose (or are subject to<br />

the characteristic features of this register) are not<br />

represented at all or not frequently enough in general<br />

learner corpora. CALE is designed to comprise academic<br />

texts produced by learners of English as a foreign<br />

language (EFL) in a university setting. CALE may<br />

therefore be considered a Language for Specific Purposes<br />

learner corpus, containing discipline- and genre-specific<br />

texts (Granger & Paquot, forthcoming). Similar corpora<br />

that contain native speaker (NS) writing and may thus<br />

serve as control corpora for CALE are the Michigan<br />

Corpus of Upper-Level Student Papers (MICUSP, Römer<br />

& Brook O'Donnell, forthcoming) and the British<br />

Academic Written English corpus (BAWE, Alsop &<br />

Nesi, 2009).<br />

51


Multilingual Resources and Multilingual Applications - Regular Papers<br />

CALE's seven academic text types ("genres") are written<br />

as assignments by EFL learners in university courses, see<br />

Figure 1.<br />

52<br />

Figure 1: Academic text types in CALE<br />

We are currently collecting texts and bio data from<br />

German, Chinese and Portuguese students, and are<br />

planning to include data from EFL learners of other L1<br />

backgrounds to be able to draw cross-linguistic and<br />

typological comparisons as to potential L1 influence.<br />

The text classification we have developed for CALE is<br />

comparable with the NS control corpora mentioned<br />

above, but we have created clear(er) textual profiles,<br />

adopting the situational characteristics and linguistic<br />

features identified for academic prose by Biber and<br />

Conrad (2009). A text's communicative purpose or goal<br />

serves as the main classifying principle, which helps to<br />

set apart the seven genres in terms of<br />

a) text's general purpose<br />

b) its specific purpose(s)<br />

c) the skills the author demonstrates, and<br />

d) the author's stance.<br />

In addition, we list the major features of each text type<br />

as to<br />

a) structural features<br />

b) length, and<br />

c) functional features.<br />

3. Corpus annotation<br />

Students submit their texts in electronic form (typically<br />

in .doc, .docx or .pdf file format). Thus, some manual<br />

pre-processing of these incoming files is necessary.<br />

Extensive "non-linguistic" information (such as table of<br />

contents, list of references, tables and figures, etc.) is<br />

deleted and substituted by placeholder tags around their<br />

headings or captions. The body of the text is then<br />

annotated for meta-textual, i.e. underlying structural<br />

features (section titles, paragraphs, quotations, examples,<br />

etc.) with the help of annotation tools. The texts are also<br />

annotated (in a file header) for metadata, i.e. learner<br />

variables such as L1, age, gender, etc. which are collected<br />

through a written questionnaire. The file header also<br />

includes metadata that pertain to each individual text<br />

such as genre, type of course and discipline the text was<br />

written in, the setting in which the text was produced etc.<br />

This information is also collected with the help of a<br />

questionnaire that accompanies each text submitted to the<br />

corpus. In the future, we also intend to implement further<br />

linguistic levels of annotation, e.g. for rhetorical function<br />

or sentence type.<br />

4. Research program<br />

In the following sections, we outline our research<br />

program. We adopt a variationist perspective on SLA,<br />

combining a learner corpus approach with research on<br />

interlanguage variation and near-native competence.<br />

4.1. The study of variation in SLA research<br />

Interlanguages (ILs) as varieties in their own right are<br />

characterized by variability even more than native<br />

languages. Research on IL-variation since the late 1970s<br />

has typically focused on beginning and intermediate<br />

learners and on variational patterns in pronunciation and<br />

morphosyntax, i.e. the (un-)successful learning of<br />

actually invariant linguistic forms and the occurrence of<br />

alternations between native and non-native equivalent<br />

forms. Such studies revealed developmental patterns,<br />

interpreted as indicators of learners' stages of acquisition,<br />

and produced evidence that IL-variation co-varies with<br />

linguistic, social/situational and psycholinguistic context,<br />

and is also subject to a variety of other factors like<br />

individual learner characteristics and biographical<br />

variables (e.g. form and length of exposure to the L2).<br />

Since the early 2000s there has been an increasing<br />

interest in issues of sociolinguistic and sociopragmatic<br />

variation in advanced L2 learners (frequently referred to<br />

as sociolinguistic competence), e.g. learners' use of<br />

dialectal forms or pragmatic markers (mostly in L2<br />

French, see e.g. Mougeon & Dewaele, 2004; Regan,<br />

Howard & Lemée, 2009). This has marked both a shift


Multilingual Resources and Multilingual Applications - Regular Papers<br />

from the study of beginning and intermediate to advanced<br />

learners, and a shift from the study of norm-violations to<br />

the investigation of differential knowledge as evidence of<br />

conscious awareness of (socio-)linguistic variation.<br />

4.2. Advanced Learner Varieties (ALVs)<br />

There is evidence that advanced learners of various<br />

language backgrounds have similar problems and face<br />

similar challenges on their way to near-native proficiency.<br />

In view of these assumed similarities, some of which will<br />

be discussed in the following, we conceive of the<br />

interlanguage of these learners as Advanced Learner<br />

Varieties (ALVs).<br />

In a recent overview of the field, Granger (2008:269)<br />

defines advanced (written) interlanguage as "the result of<br />

a highly complex interplay of factors: developmental,<br />

teaching-induced and transfer-related, some shared by<br />

several learner populations, others more specific".<br />

According to her, typical features of ALVs are overuse of<br />

high frequency vocabulary and a limited number of<br />

prefabs, a much higher degree of personal involvement,<br />

as well as stylistic deficiencies, "often characterized by<br />

an overly spoken style or a somewhat puzzling mixture of<br />

formal and informal markers".<br />

Moreover, advanced learners typically struggle with the<br />

acquisition of optional and/or highly L2-specific<br />

linguistic phenomena, often located at interfaces of<br />

linguistic subfields (e.g. syntax-semantics, syntaxpragmatics,<br />

see e.g. DeKeyser, 2005:7ff). As to academic<br />

writing, many of their observed difficulties are caused by<br />

a lack of understanding of the conventions of academic<br />

writing, or a lack of practice, but are not necessarily a<br />

result of interference from L1 academic conventions<br />

(McCrostie, 2008:112).<br />

4.3. Patterns and determinants of variation in L2<br />

academic writing<br />

Our research program involves the study of L2 learners’<br />

acquisition of the influence of several factors on<br />

constituent order and the choice of constructional<br />

variants (e.g. genitive and dative alternation,<br />

verb-particle placement, focus constructions). One<br />

reason for this is that such variation is often located at the<br />

interfaces of linguistic subsystems, an area where<br />

advanced learners still face difficulties. Moreover,<br />

grammatical variation in L2 has not been well researched<br />

to date and is only beginning to attract researchers'<br />

attention (Callies, 2008, 2009; Callies & Szczesniak,<br />

2008).<br />

There are a number of semantic, structural,<br />

discourse-motivated and processing-related determinants<br />

that influence lexico-grammatical variation whose<br />

interplay and influence on speakers' and writers'<br />

constructional choices has been widely studied in<br />

corpus-based research on L1 English. Generally speaking,<br />

in L2 English these determinants play together with<br />

several IL-specific ones such as mother tongue (L1) and<br />

proficiency level, and in (academic) writing, some<br />

further task-specific factors like imagined audience (the<br />

people to whom the text is addressed), setting, and genre<br />

add to this complex interplay of factors, see Figure 2.<br />

Figure 2: Determinants of variation in L1 and L2<br />

academic writing<br />

It is important to note at this point that differences<br />

between texts produced by L1 and L2 writers that are<br />

often attributed to the influence of the learners' L1 may in<br />

fact turn out to result from differences in task-setting<br />

(prompt, timing, access to reference works, see Ädel,<br />

2008), and possibly task-instruction and imagined<br />

audience (see Ädel, 2006:201ff for a discussion of corpus<br />

comparability). Similarly, research findings as to<br />

learners' use of features that are more typical of speech<br />

than of academic prose have been interpreted as<br />

unawareness of register differences, but there is some<br />

evidence that the occurrence of such forms may also be<br />

caused by the influence of factors like the development of<br />

writing proficiency over time (novice writers vs. experts,<br />

see Gilquin & Paquot, 2008; Wulff & Römer, 2009),<br />

task-setting and -instruction, imagined audience and<br />

register/genre (e.g. academic vs. argumentative writing,<br />

see Zaytseva, <strong>2011</strong>).<br />

53


Multilingual Resources and Multilingual Applications - Regular Papers<br />

4.4. Case study<br />

In this section, we provide an example of how<br />

lexico-grammatical variation plays out in L2 academic<br />

writing. In a CALE pilot study of the (non-)<br />

representation of authorship in research papers written by<br />

advanced German EFL learners, Callies (2010) examined<br />

agentivity as a determinant of lexico-grammatical<br />

variation in academic prose. He hypothesized that even<br />

advanced students were insecure about the representation<br />

of authorship due to a mixture of several reasons:<br />

conflicting advice by teachers, textbooks and style guides,<br />

the diverse conventions of different academic disciplines,<br />

students' relative unfamiliarity with academic text types<br />

and lack of linguistic resources to report events and<br />

findings without mentioning an agent. Interestingly, the<br />

study found both an overrepresentation of the first person<br />

pronouns I and we, but also an overrepresentation of the<br />

highly impersonal subject-placeholders it and there<br />

(often used in the passive voice) as default strategies to<br />

suppress the agent, see examples (1) and (2).<br />

(1) There are two things to be discussed in this section.<br />

(2) It has been shown that…<br />

While this finding seems to be contradictory, it can be<br />

explained by a third major finding, namely the significant<br />

underrepresentation of inanimate subjects which are,<br />

according to Biber and Conrad (2009:162), preferred<br />

reporting strategies in L1 academic English, exemplified<br />

in (3) and (4).<br />

(3) This paper discusses…<br />

(4) Table 5 shows that…<br />

Callies (2010) concluded that L2 writers have a narrower<br />

inventory of linguistic resources to report events and<br />

findings without an overt agent, and their insecurity and<br />

unfamiliarity with academic texts adds to the observed<br />

imbalanced clustering of first person pronouns,<br />

dummy-subjects and passives. The findings of this study<br />

also suggest that previous studies that frequently explain<br />

observed overrepresentations of informal, speech-like<br />

features by pointing to learners' higher degree of<br />

subjectivity and personal involvement (Granger, 2008) or<br />

unawareness of register differences (Gilquin & Paquot,<br />

2008), may need to be supplemented by studies taking<br />

54<br />

into account a more complex interplay of factors that also<br />

includes the limited choice of alternative strategies<br />

available to L2 writers.<br />

5. Implications for language teaching<br />

and assessment<br />

The project we have outlined in this paper has some<br />

major implications for EFL teaching and assessment. The<br />

research findings will be used to provide<br />

recommendations for EFL teachers and learners by<br />

developing materials for teaching units in practical<br />

language courses on academic writing and English for<br />

Academic Purposes. In the long run, we plan to create a<br />

web-based reference tool that will help students look up<br />

typical collocations and recurring phrases used to express<br />

rhetorical moves/functions in academic writing (e.g.<br />

giving examples, expressing contrast, drawing<br />

conclusions etc.). This application will be geared towards<br />

students' needs and can be used as a self-study reference<br />

tool at all stages of writing an academic text. Users will<br />

be able to access information in two ways:<br />

1) form-to-function, i.e. looking up words and phrases in<br />

an alphabetical index to see how they can express<br />

rhetorical functions, and 2) function-to-form, i.e.<br />

accessing a list of rhetorical functions to find words and<br />

phrases that are typically used to encode them.<br />

Most importantly, the tool will present in a comparative<br />

manner structures that emerged as problematic in<br />

advanced learners' writing, for example untypical lexical<br />

co-occurrence patterns and over- or underrepresented<br />

words and phrases, side by side with those structures that<br />

typically occur in expert academic writing. This will<br />

include information on the immediate and wider context<br />

of use of single items and multi-word-units.<br />

While the outcome is thus particularly relevant for future<br />

teachers of English, it may also be useful for students and<br />

academics in other disciplines who have to write and<br />

publish in English. Unlike in the Anglo-American<br />

education system, German secondary schools and<br />

universities do not usually provide courses in academic<br />

writing in the students' mother tongue, so that first-year<br />

students have basically no training in academic writing<br />

at all.<br />

It has been mentioned earlier that the operationalization<br />

of a quantitatively and qualitatively well-founded<br />

description of advanced proficiency in terms of criteria


Multilingual Resources and Multilingual Applications - Regular Papers<br />

for the assessment of advancedness is still lacking. Thus,<br />

a major aim of the project is to develop a set of linguistic<br />

descriptors for the assessment of advanced proficiency.<br />

The descriptors and can-do-statements of the Common<br />

European Framework of Reference (CEFR) often appear<br />

too global and general to be of practical value for<br />

language assessment in general, and for describing<br />

advanced learners' competence as to academic writing in<br />

particular. Ortega and Byrnes (2008) discuss four ways in<br />

which advancedness has commonly been operationalised,<br />

ultimately favouring what they call "sophisticated<br />

language use in context", a construct that includes e.g. the<br />

choice among registers, repertoires and voice. This<br />

concept can serve as a basis for the development of<br />

linguistic descriptors that are characteristic of academic<br />

prose, e.g. the use of syntactic structures like inanimate<br />

subjects, phrases to express rhetorical functions (e.g. by<br />

contrast, to conclude, in fact), reporting verbs (discuss,<br />

claim, suggest, argue, propose etc.), and lexical<br />

co-occurrence patterns (e.g. conduct, carry out and<br />

undertake as typical verbal collocates of experiment,<br />

analysis and research).<br />

6. References<br />

Ädel, A. (2006): Metadiscourse in L1 and L2 English.<br />

Amsterdam: Benjamins.<br />

Ädel, A. (2008): Involvement features in writing: do time<br />

and interaction trump register awareness? In G.<br />

Gilquin, S. Papp, & M.B. Diez-Bedmar (Eds.),<br />

Linking up Contrastive and Learner Corpus Research.<br />

Amsterdam: Rodopi, pp. 35-53.<br />

Alsop, S., Nesi, H. (2009): Issues in the development of<br />

the British Academic Written English (BAWE) corpus.<br />

Corpora, 4(1), pp. 71-83.<br />

Biber, D., S. Conrad (2009): Register, Genre, and Style.<br />

Cambridge: Cambridge University Press.<br />

Callies, M. (2008): Easy to understand but difficult to use?<br />

Raising constructions and information packaging in<br />

the advanced learner variety. In G. Gilquin, S. Papp &<br />

M.B. Diez-Bedmar (Eds.), Linking up Contrastive and<br />

Learner Corpus Research. Amsterdam: Rodopi,<br />

pp. 201-226.<br />

Callies, M. (2009): Information Highlighting in<br />

Advanced Learner English. Amsterdam: Benjamins.<br />

Callies, M. (2010): The (non-)representation of<br />

authorship in L2 academic writing. Paper presented at<br />

ICAME 31 "Corpus Linguistics and Variation in<br />

English", 26-30 May 2010, Giessen/Germany.<br />

Callies, M., Szczesniak, K. (2008): Argument realization,<br />

information status and syntactic weight - A<br />

learner-corpus study of the dative alternation. In P.<br />

Grommes & M. Walter (Eds.), Fortgeschrittene<br />

Lernervarietäten. Korpuslinguistik und<br />

Zweitspracherwerbsforschung. Tübingen: Niemeyer,<br />

pp. 165-187.<br />

DeKeyser, R. (2005): What makes learning second<br />

language grammar difficult? A review of issues.<br />

Language Learning, 55(s1), pp. 1-25.<br />

Gilquin, G., Paquot, M. (2008): Too chatty: Learner<br />

academic writing and register variation. English Text<br />

Construction, 1(1), pp. 41-61.<br />

Granger, S. (2008): Learner corpora. In A. Lüdeling & M.<br />

Kytö (Eds.), Corpus Linguistics. An international<br />

handbook, Vol. 1. Berlin & New York: Mouton de<br />

Gruyter, pp. 259-275.<br />

Granger, S., Paquot, M. (forthcoming): Language for<br />

Specific Purposes learner corpora. In T.A. Upton & U.<br />

Connor (Eds.), Language for Specific Purposes. The<br />

Encyclopedia of Applied Linguistics. New York:<br />

Blackwell.<br />

Granger, S., Dagneaux, E., Meunier, F., Paquot, M.<br />

(2009): The International Corpus of Learner English.<br />

Version 2. Handbook and CD-ROM.<br />

Louvain-la-Neuve: Presses Universitaires de Louvain.<br />

McCrostie, J. (2008): Writer visibility in EFL learner<br />

academic writing: A corpus-based study. ICAME<br />

Journal, 32, pp. 97-114.<br />

Mougeon, R., Dewaele, J.-M. (2004): Patterns of<br />

variation in the interlanguage of advanced second<br />

language learners. Special issue of International<br />

Review of Applied Linguistics in Language Teaching<br />

(IRAL), 42(4).<br />

Ortega, L., Byrnes, H. (2008): The longitudinal study of<br />

advanced L2 capacities: An introduction. In L. Ortega<br />

& H. Byrnes (Eds.), The Longitudinal Study of<br />

Advanced L2 Capacities. New York: Routledge/Taylor<br />

& Francis, pp. 3-20.<br />

Regan, V., Howard, M., Lemée, I. (2009): The<br />

Acquisition of Sociolinguistic Competence in a Study<br />

Abroad Context. Clevedon: Multilingual Matters.<br />

Römer, U., Brook O’Donnell, M. (forthcoming): From<br />

student hard drive to web corpus: The design,<br />

55


Multilingual Resources and Multilingual Applications - Regular Papers<br />

56<br />

compilation, annotation and online distribution of<br />

MICUSP. Corpora.<br />

Wulff, S., Römer, U. (2009): Becoming a proficient<br />

academic writer: Shifting lexical preferences in the use<br />

of the progressive. Corpora, 4(2), pp. 115-133.<br />

Zaytseva, E. (<strong>2011</strong>): Register, genre, rhetorical functions:<br />

Variation in English native-speaker and learner writing.<br />

Hamburg Working Paper in Multilingualism.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

From Multilingual Web-Archives to Parallel Treebanks in Five Minutes<br />

Markus Killer, Rico Sennrich, Martin Volk<br />

University of Zurich<br />

Institute of Computational Linguistics, Binzmühlestrasse 14, CH-8050 Zurich, Switzerland<br />

E-mail: markus.killer@uzh.ch, sennrich@cl.uzh.ch, volk@cl.uzh.ch<br />

Abstract<br />

The Tree-to-Tree (t2t) Alignment Pipe is a collection of Python scripts, generating automatically aligned parallel treebanks from<br />

multilingual web resources or existing parallel corpora. The pipe contains wrappers for a number of freely available NLP software<br />

programs. Once these third party programs have been installed and the system and corpus specific details have been updated, the<br />

pipe is designed to generate a parallel treebank with a single program call from a unix command line. We discuss alignment quality<br />

on a fully automatically processed parallel corpus.<br />

Keywords: parallel treebank, automatic tree-to-tree alignment, TreeAligner, Text-und-Berg<br />

1. Introduction<br />

The process of creating parallel treebanks used to be a<br />

tedious task, involving a tremendous amount of manual<br />

annotation (see e.g. Samuelsson & Volk, 2007). Zhechev<br />

and Way (2008:1) state that ”[b]ecause of this, only a<br />

few parallel treebanks exist and none are of sufficient<br />

size for productive use in any statistical MT<br />

application”. Since Zhechev (2009) introduced the Sub-<br />

Tree Aligner, a program for the automatic generation of<br />

parallel treebanks, the feasibility of obtaining large scale<br />

annotated parallel treebanks has increased. However, the<br />

amount of preprocessing needed as well as the missing<br />

conversion of the output into a more human readable<br />

format might have kept potential users of the Sub-Tree<br />

Aligner at a distance. The collection of Python scripts<br />

combined in the Tree-to-Tree Alignment Pipe (t2t-pipe)<br />

described below takes care of all necessary pre- and<br />

postprocessing of Zhechev’s Sub-Tree Aligner,<br />

supporting German, French and English as source and<br />

target languages. The focus of this paper is on the<br />

following two questions, both aimed at maximizing the<br />

quality of the automatic alignments:<br />

� How big does the parallel corpus have to be in order<br />

to get satisfactory results?<br />

� What can be said about the role of the text<br />

domain/topic of the parallel corpus?<br />

2. Related Work<br />

Zhechev (2009) and Koehn (2009) provide an overview<br />

of recent developments in tree-to-tree alignment, subtree<br />

alignment and the subsequent generation of parallel<br />

treebanks for use in statistical machine translation<br />

systems.<br />

Tiedemann and Kotzé (2009) and Tiedemann (2010)<br />

propose a supervised approach to tree-to-tree alignment,<br />

requiring a small manually aligned or manually<br />

corrected treebank of at least 100 sentence pairs1 for<br />

training purposes.<br />

In terms of script design, the training-script for the<br />

Moses SMT system (Koehn, 2010b) inspired the<br />

organization of the t2t-pipe into several steps that can be<br />

run independently.<br />

3. Parallel Corpora<br />

In an ideal world, one could be inclined to take a<br />

number of parallel articles from a bilingual text<br />

collection and let the t2t-pipe combined with the Sub-<br />

Tree Aligner do the rest. Yet this is only possible if a<br />

suitable word alignment model2 is available, as we will<br />

show in section 5 .<br />

1 See http://stp.lingfil.uu.se/~joerg/Lingua/index.html<br />

(accessed: 21/08/11)<br />

2 All word alignment models used in this paper can be<br />

downloaded from: http://t2t-pipe.svn.sourceforge.net/<br />

(accessed: 21/08/11)<br />

57


Multilingual Resources and Multilingual Applications - Regular Papers<br />

With the aim of collecting information on the role of<br />

corpus size and text domain/topic in creating an<br />

automatically aligned parallel treebank, the following<br />

corpora were used:<br />

3.1. Corpus for Tree-to-Tree Alignment<br />

A subcorpus of the Text+Berg corpus (Volk et al., 2010)<br />

consisting of four parallel articles from the Swiss Alpine<br />

Club Yearbook 1977 served as test corpus (see [TUB-4-<br />

ART] in table 1). Details on the corpus with regard to<br />

the extraction of parallel articles and sentence pairs are<br />

described in Sennrich and Volk (2010). For the purpose<br />

of this paper it is sufficient to note that the vast majority<br />

of texts can be attributed to the journalistic textual<br />

domains article/report/review with a strong topical focus<br />

on activities performed by members of the Swiss Alpine<br />

Club (climbing, hiking, trekking) and the alpine<br />

environment in general. As the corpus has been digitised<br />

from printed books it contains OCR errors.<br />

Corpus Lang. Tokens Sentence Pairs<br />

[TUB-<br />

4-ART]<br />

DE<br />

FR<br />

21,689<br />

25,388<br />

1,171<br />

(GIZA++: 1,023)<br />

58<br />

[TUB] DE 1,617,301 92,518<br />

FR 1,921,583 (GIZA++: 80,698)<br />

[EPARL] DE 35,371,164 1,562,563<br />

FR 42,427,755 (GIZA++: 1,190,609)<br />

Table 1: Parallel Corpora<br />

[TUB-4-ART] Text+Berg Corpus 4 Articles SAC YB 1977<br />

[TUB] Text+Berg Corpus SAC Yearbooks 1957-1982<br />

[EPARL] Europarl Corpus 19<strong>96</strong>-2009<br />

3.2. Corpora for Word Alignment<br />

Additionally, we used the complete Text+Berg corpus<br />

[TUB] , the Europarl corpus (Koehn, 2010a) [EPARL]<br />

and combinations of these two corpora to compute<br />

different word alignment models (see table 1 for basic<br />

corpus information). Word alignment is automatically<br />

computed through GIZA++ (Och & Ney, 2003), which<br />

implements the IBM word alignment models. For<br />

performance reasons, we set the maximum sentence<br />

length to 40 tokens3 . Therefore, we used only 83% of<br />

3 See http://www.statmt.org/wmt11/baseline.html<br />

(accessed: 21/08/11)<br />

the of the [TUB] corpus and 76% of the [EPARL]<br />

corpus to estimate word alignment probabilities (see<br />

table 1 for absolute values in brackets).<br />

We used [EPARL] to test the impact of corpus size on<br />

the results. Moreover, texts from the [EPARL] corpus<br />

belong to a completely different textual domain<br />

(parliament proceedings) and cover a wide range of<br />

political, economic and cultural topics (see Koehn,<br />

2009:53), making it possible to use the data to figure out<br />

the role of text domain/topic in the alignment process.<br />

4. The t2t-pipe<br />

Taking an existing parallel corpus4 as input, the t2t-pipe<br />

runs through seven steps to generate automatic<br />

alignments for individual words and syntactic<br />

constituents in each parallel sentence pair. The<br />

configuration file is deliberately designed in a way that a<br />

number of different third party programs can be chosen<br />

for most of the steps, enabling easy switching between<br />

different configurations. In the brief outline of the<br />

following steps, the configuration that worked best is<br />

indicated (please refer to the t2t-pipe README file5 for<br />

details on all 12 programs used):<br />

4.1. Steps 1-5 – Preprocessing<br />

1) Extraction of Parallel Articles<br />

2) Tokenization<br />

(Python NLTK Punkt-Tokenizer)<br />

Rudimentary OCR cleaning/<br />

Fixing of word division errors<br />

3) Sentence Alignment<br />

(Hunalign with dict.cc dictionary)<br />

4) Statistical Phrase Structure Parsing<br />

(Stanford Parser for German,<br />

Berkeley Parser for French)<br />

5) Word Alignment<br />

(GIZA++ through Moses training script,<br />

enhanced with dict.cc dictionary,<br />

see section 4.2 for an example),<br />

data not lower-cased<br />

4 If no parallel corpus is available, the pipe includes scripts for<br />

the on-the-fly construction of a parallel corpus from the web<br />

archives of the bilingual Swiss Alpine Club magazine<br />

(German-French).<br />

5 Available from: http://t2t-pipe.svn.sourceforge.net/<br />

(accessed: 21/08/11)


Multilingual Resources and Multilingual Applications - Regular Papers<br />

4.2. Step 6 - Tree-to-Tree Alignment<br />

This is the most important step in a complete run of the<br />

t2t-pipe, as the automatic alignments are generated by<br />

Zechev's Sub-Tree Aligner. The process can best be<br />

described by looking at a parallel sentence pair, taken<br />

from [TUB-4-ART]:<br />

1) German sentence: Man versuche einmal einen<br />

solchen Mann abzubremsen.<br />

2) French sentence: Essayez donc de freiner un tel<br />

homme. 6<br />

� Input:<br />

a. Bracketed parse trees of source and target language<br />

(output of the two parsers combined into one file):<br />

(ROOT (NUR (S (PIS Man) (VVFIN versuche) (ADV<br />

einmal) (VP (NP (ART einen) (PIDAT solchen) (NN<br />

Mann)) (VVIZU abzubremsen))) ($. !))) \n<br />

(ROOT (SENT (VN (V Essayez)) (ADV donc) (VPinf (P<br />

de) (VN (V freiner)) (NP (D un) (A tel) (N<br />

homme))) (. !)))\n\n\n<br />

b. Two lexical translation files generated by the Moses<br />

training script and GIZA++, enhanced using a<br />

dict.cc dictionary:<br />

lex.e2f (French – German – Probability)<br />

Homme Mann 1.0000000<br />

homme Mann 1.0000000<br />

mari Mann 1.0000000<br />

ralentir abzubremsen 0.0666667<br />

freiner abzubremsen 0.0666667<br />

lex.f2e (German – French – Probability)<br />

abzubremsen ralentir 0.0053476<br />

abzubremsen freiner 0.0035842<br />

Mann Homme 1.0000000<br />

Mann homme 1.0000000<br />

Mann mari 1.0000000<br />

� Output:<br />

Indexed bracketed parse trees of source and target<br />

language with alignment indices on a separate line<br />

(see Figure 1 for graphical alignments). In our<br />

example sentence, the Sub-Tree Aligner produced<br />

one wrong alignment, linking the German personal<br />

pronoun man to the French finite verb essayez<br />

(emphasised below):<br />

6 Sentences 1) and 2) translate roughly as: [(Why don't) you try<br />

to slow down a man like that (a heavy man)!]<br />

(ROOT::NUR-2 (S-3 (PIS-4 Man)(VVFIN-5 versuche)<br />

(ADV-6 einmal)(VP-7 (NP-8 (ART-9 einen)(PIDAT-10<br />

solchen)(NN-11 Mann))(VVIZU-12 abzubremsen)))($.-<br />

13 !)) \n<br />

(ROOT::SENT-2 (VN::V-4 Essayez)(ADV-5 donc)<br />

(VPinf-6 (P-7 de)(VN::V-9 freiner)(NP-10 (D-11<br />

un)(A-12 tel)(N-13 homme)))(.-14 !)) \n<br />

2 2 4 4 6 5 7 6 8 10 9 11 10 12 11 13 12 9 13 14<br />

4.3. Step 7 - Conversion to TigerXML/TMX<br />

We converted the output of Zhechev’s Sub-Tree Aligner<br />

into two language specific TigerXML files and an<br />

additional XML file containing information on node<br />

alignments. These files can be easily imported into the<br />

graphical interface of the Stockholm TreeAligner<br />

(Lundborg et al., 2007). Figure 1 shows the previously<br />

introduced sentence pair – including the automatically<br />

computed links – in the treebank browser perspective of<br />

the Stockholm TreeAligner.<br />

Figure 1: Automatically aligned sentence pair in<br />

Stockholm TreeAligner<br />

The second supported output format is TMX, a format<br />

for current translation memory systems (tested with<br />

OmegaT7 ).<br />

5. Treebank Alignment Quality<br />

We ran six experiments (summarized in table 2) on the<br />

test corpus [TUB-4-ART] (see table 1). In each<br />

experiment, the corpus used to compute the lexical<br />

translation probabilities with GIZA++ either differed<br />

7 Available from: http://www.omegat.org (accessed: 21/08/11)<br />

59


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Corpus 1 [TUB-4-<br />

60<br />

ART]<br />

2 [TUB-4-<br />

ART]<br />

3 [EPARL] 4 [TUB] 5 [TUB-<br />

EPARL]<br />

6 [TUB-<br />

EPARL]<br />

Corpus Size GIZA++ 1,023 SP 1,023 SP 1,190,609 SP 80,698 SP 258,971 SP 1,271,307 SP<br />

In-domain (%) 100.0% 100.0% 0.0% 100.0% 31.0% 6.0%<br />

Dict.cc SA/WA NO YES YES YES YES YES<br />

Precision WA 57.8% 61.1% 51.3% 65.9% 69.1% 69.2%<br />

Precision PhA 58.3% 65.4% 51.8% 81.7% 79.5% 80.4%<br />

Precision allA 57.9% 62.1% 51.4% 69.2% 71.3% 71.7%<br />

Correct links per SP 8.66 9.63 9.02 12.48 13.64 13.98<br />

Table 2: Alignment precision and average number of correct links in treebank of [TUB-4-ART] corpus (1,171<br />

sentence pairs) with respect to size, enhancement through additional lexical resources and textual domain of the<br />

corpus used to compute the lexical translation probabilities.<br />

Precision = Correct Alignments / Suggested Alignments, SP: Sentence Pair(s) SA: Sentence Alignment, WA: Word Alignment,<br />

PhA: Phrase Alignment, allA: Word & Phrase Alignments, In-domain: domain correspondence of treebank and WA corpus<br />

with respect to corpus size and textual domain or<br />

enhancement by external lexical resources (dict.cc<br />

dictionary).<br />

We manually checked an average of 545 alignments<br />

(77% word alignments 23% phrase alignments) in 32<br />

randomly selected sentence pairs8 for each of the six<br />

resulting treebanks, using the Stockholm TreeAligner.<br />

Our information on changes in recall is based on the<br />

absolute number of correct links in the manually<br />

checked sentence pairs (average no. of correct links =<br />

average no. of all links9 x precision10 ).<br />

5.1. Corpus Size<br />

Looking at the configuration outlined in section 4, three<br />

of the seven steps in the t2t-pipe directly depend on the<br />

corpus size (Tokenization (Dehyphenation), Sentence<br />

Alignment and Word Alignment). The analysis of the<br />

alignment quality in the resulting parallel treebank<br />

shows that roughly 1000 sentence pairs are not enough<br />

to get satisfactory results with an overall precision of<br />

57.9% (see table 2, experiment 1). Initial tests have<br />

shown that Zhechev’s Sub-Tree Aligner is highly<br />

8 This number proved to be sufficient to include at least 100<br />

Phrase Alignments in the sample. The identity of the treebank<br />

was masked for the manual evaluation.<br />

9 computed by Sub-Tree Aligner for the whole treebank<br />

10 computed from manually checked sentence pairs<br />

dependent on the quality of the word alignments<br />

supplied. Even though the algorithm does not directly<br />

replicate the GIZA++ alignments:<br />

[M]y system uses a probabilistic bilingual<br />

dictionary derived from the GIZA++ word<br />

alignments, thus being able to side-step errors<br />

present in the original word-alignment data and<br />

to find new possible alignments that GIZA++<br />

had skipped for the particular sentence pair.<br />

(Zhechev, 2009:73)<br />

We employed two measures to increase the precision of<br />

the alignments:<br />

1) We enhanced the lexical translation probabilities<br />

computed by GIZA++ by extracting all 1-to-1 word<br />

translations from the freely available dict.cc<br />

dictionary (DE-FR), leading to a substantial<br />

increase in precision (+ 4.2%) and in recall (+ 0.97<br />

correct links per sentence pair).<br />

2) Step-by-step, we increased the corpus size, making<br />

use of all available resources. In experiment 3 it<br />

becomes clear that a huge increase of corpus size<br />

alone is no guarantee for better alignment results:<br />

When we use the 1,190,609 sentence pair [EPARL]<br />

corpus on its own, the recall drops by 0.61 correct


Multilingual Resources and Multilingual Applications - Regular Papers<br />

links per sentence pair and the precision by 10.7%<br />

compared to experiment 2. However, increasing the<br />

size of the [TUB] corpus from 1,023 to 80,698<br />

sentence pairs as a basis for the word alignment<br />

model leads to the biggest leap in the experiment<br />

sequence in both precision (+ 7.1%) and recall<br />

(+2.85 correct links per sentence pair) compared to<br />

experiment 2.<br />

5.2. Domain/Topic Specific Content<br />

The data collected in table 2 suggests that when using<br />

the unsupervised approach proposed by Zhechev (2009)<br />

the domain of the corpus used to compute the lexical<br />

translation probabilities seems to be of great importance.<br />

In experiment 3, we observe the poorest precision of all<br />

experiments with the second biggest corpus [EPARL].<br />

Apart from a few common lexical items (e.g. mountain,<br />

valley, river, ...) there is hardly any overlap in terms of<br />

textual domain/topic (see section 3) and the [TUB- 4-<br />

ART] corpus itself was not used to compute lexical<br />

probabilities in experiment 3 (hence the 0%<br />

correspondence between the two corpora).<br />

Comparing these results to the supervised approach by<br />

Tiedemann and Kotzé (2009), there seems to be an<br />

important difference, as they observe ”only a slight drop<br />

in performance when training on a different textual<br />

domain” (204) . The main reason for this might be that<br />

in the supervised approach the program trains phrase<br />

alignments from manually aligned training data<br />

(relatively domain/topic independent), whereas in the<br />

unsupervised approach the parallel corpus is used to<br />

compute lexical translation probabilities (heavily<br />

dependent on domain/topic).<br />

5.3. The Right Balance of Corpus Size and<br />

Domain/Topic Specific Content<br />

Bearing this difference of the two approaches in mind, it<br />

is not surprising that balancing (in terms of textual<br />

domain/topic - experiment 5) or expanding (maximising<br />

corpus size - experiment 6) the word alignment model<br />

affects the results in a different way:<br />

When using a better model for estimating<br />

lexical probabilities (more data:<br />

Europarl+SMULTRON) the performance<br />

improves only slightly to about 58.64% [F-<br />

Score compared to 57.57%]<br />

(Tiedemann & Kotzé, 2009:204)<br />

In the unsupervised approach (used in the t2t-pipe)<br />

however, the use of a better word alignment model<br />

[TUB-EPARL] increases the recall by another 1.16 and<br />

1.50 correct links per sentence pair, respectively<br />

(experiments 5/6), compared to the largest corpus with a<br />

100% domain correspondence (experiment 4). For<br />

phrase alignments, we achieved a precision of roughly<br />

80% from a corpus size of approx. 80,000 sentence pairs<br />

of the same domain (experiments 4-6). The maximum<br />

precision of word alignments in this set-up (data not<br />

being lower-cased) seems to be around 70% from a<br />

corpus size of about 250,000 sentence pairs, while the<br />

recall can still be slightly increased by supplying more<br />

and more data to estimate lexical probabilities. As long<br />

as there is a solid basis of several 10,000 sentence pairs<br />

belonging to the same textual domain as the parallel<br />

corpus to be aligned, expanding the corpus used to<br />

compute lexical probabilities with material of another<br />

textual domain does not seem to harm the results but can<br />

still help to increase overall precision and recall by a<br />

small margin.<br />

6. Conclusion and Outlook<br />

We designed the t2t-pipe considering the following areas<br />

of application:<br />

1) Assisting human annotators of a parallel treebank<br />

by supplying good alignment suggestions: The<br />

results discussed in section 5 have shown that this<br />

can be achieved by employing a large enough<br />

parallel corpus of approx. 250,000 sentence pairs<br />

with data of the same textual domain. If the corpus<br />

is not big enough, the results can be improved by<br />

adding language material of a completely different<br />

textual domain. We achieved an overall precision of<br />

71.7% (approx. 80% for phrase alignments). Using<br />

a corpus of 500-1,000 sentence pairs (a common<br />

size for human annotated parallel treebanks) or a<br />

word alignment model trained solely on a different<br />

textual domain does not lead to reasonable<br />

automatic alignments. However, if there already is a<br />

suitable word alignment model for a specific text<br />

61


Multilingual Resources and Multilingual Applications - Regular Papers<br />

domain/topic, the generation of a brand new<br />

treebank is just five minutes away.<br />

2) Visualisation/manual evaluation of the results of<br />

different components of a tree-based SMT system<br />

(e.g. Parsing, Word/Phrase Alignment): The data<br />

collected and analysed in section 5 is one possible<br />

application of the t2t-pipe in this category.<br />

3) As a by-product, the t2t-pipe produces phrase<br />

alignments for translation memory systems: With a<br />

corpus of approx. 80,000 sentence pairs, the<br />

precision of the alignments is around 80%. These<br />

alignments can be manually checked and a new<br />

TMX file can be easily generated from the corrected<br />

alignment data.<br />

In future versions of the program, the two approaches<br />

presented by Zhechev (2009) and Tiedemann and Kotzé<br />

(2009) could be combined. We see additional potential<br />

for improvement in using lower-cased data and a corpus<br />

free of OCR errors for word and subtree alignment.<br />

7. References<br />

Koehn, P. (2009): Statistical Machine Translation.<br />

Cambridge: Cambridge University Press.<br />

Koehn, P. (2010a): European Parliament Proceedings<br />

Parallel Corpus 19<strong>96</strong>-2009. Release v5. TXT- Format.<br />

Descrition in: Europarl: A Parallel Corpus for<br />

Statistical Machine Translation, Philipp Koehn, MT<br />

Summit 2005. URL: http://www.statmt.org/europarl.<br />

Koehn, P. (2010b): MOSES. Statistical Machine<br />

Translation System. User Manual and Code Guide,<br />

November. URL:<br />

http://www.statmt.org/moses/manual/manual.pdf.<br />

Lundborg J., Marek T., Mettler M., Volk, M. (2007):<br />

Using the Stockholm TreeAligner. In Proceedings of<br />

the Sixth International Workshop on Treebanks and<br />

Linguistic Theories (TLT’06). Bergen, Norway:<br />

Northern European Association for Language<br />

Technology, pp. 73–78.<br />

Och, F. J., Ney, H. (2003): A Systematic Comparison of<br />

Various Statistical Alignment Models. Computational<br />

Linguistics 29, pp. 19–51.<br />

Samuelsson, Y. , Volk, M. (2007): Alignment Tools for<br />

Parallel Treebanks. In Proceedings of the GLDV<br />

Frühjahrstagung, Tübingen, Germany.<br />

62<br />

Sennrich R., Volk, M. (2010): MT-based Sentence<br />

Alignment for OCR-generated Parallel Texts. In<br />

Proceedings of the Ninth Conference of the<br />

Association for Machine Translation in the Americas<br />

(AMTA 2010).<br />

Tiedemann J., Kotzé, G. (2009): Building a Large<br />

Machine-Aligned Parallel Treebank. In Proceedings<br />

of the Eighth International Workshop on Treebanks<br />

and Linguistic Theories (TLT’08). Milano, Italy:<br />

EDUCatt: pp. 197–208.<br />

Tiedemann J. (2010): Lingua-Align: An Experimental<br />

Toolbox for Automatic Tree-to-Tree Alignment. In<br />

Proceedings of the 7th International Conference on<br />

Language Resources and Evaluation (LREC’2010),<br />

Valetta, Malta.<br />

Volk, M., Bubenhofer, N., Althaus A., Bangerter, M.,<br />

Marek T., Ruef, B. (2010): Text+Berg-Korpus (Pre-<br />

Release 118+ Digitale Edition Die Alpen 1957-1982).<br />

XML-Format, May. Digitale Edition des Jahrbuch des<br />

SAC 1864-1923 und Die Alpen 1925-1995. URL:<br />

http://www.textberg.ch.<br />

Zhechev V., Way, A. (2008): Automatic Generation of<br />

Parallel Treebanks. In Proceedings of the 22nd<br />

International Conference on Computational<br />

Linstuistics. Manchester, UK: pp. 1105–1112.<br />

Zhechev, V. (2009): Automatic Generation of Parallel<br />

Treebanks. An Efficient Unsupervised System.<br />

Dissertation, School of Computing, Dublin City<br />

University.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Querying multilevel annotation and alignment for<br />

detecting grammatical valence divergencies<br />

Oliver Čulo<br />

FTSK, <strong>Universität</strong> Mainz<br />

An der Hochschule 2, 76726 Germersheim<br />

E-mail: culo@uni-mainz.de<br />

Abstract<br />

The valence concept has been used in machine translation as well as didactics on order to build up valence dictionaries for the<br />

respective uses. Most valence dictionaries have been built up manually, but given the growing number of parallel resources, it<br />

would be desirable to automatically exploit them as basis for building up bilingual valence dictionaries. The present contribution<br />

reports on a pilot study on a German-English parallel corpus. In this study, patterns of verb plus grammatical functions were<br />

extracted from parallel sentences. The paper reports on some of the basic findings of this extraction, regarding divergencies both in<br />

valence patterns as well as syntactic realisations of the predicate, i.e. the verb. These findings set the agenda for further research,<br />

which should focus on how to detect semantic shifts of valence carriers in translation and how this affects valence.<br />

Keywords: valence, valence extraction, parallel corpora, translation<br />

1. Introduction<br />

The concept of valence (Tesnière, 1959) has been<br />

endorsed in multilingual research domains in various<br />

ways. Various machine translation systems use some<br />

notion of valence in the core of their analysis and<br />

transfer structures (see relevant descriptions e.g. for<br />

EUROTRA (Steiner, Schmidt & Zelinsky-Wibbelt,<br />

1988), METAL (Gebruers, 1988), Verbmobil (Emele et<br />

al., 2000) or TectoMT (Žabokrtský, Ptáček & Pajas<br />

2008)). For didactic purposes, various bilingual valence<br />

dictionaries have been compiled (D. Rall, Rall, &<br />

Zorrilla, 1980; Engel & Savin, 1983; Bianco, 19<strong>96</strong>;<br />

Simon-Vandenbergen, Taeldeman & Willems 19<strong>96</strong>).<br />

Most of the valence resources mentioned are based on<br />

manually compiled valence dictionaries. Nowadays, as<br />

ever more and larger parallel corpus resources are<br />

available, it is desirable to exploit these in order to gain<br />

more data for bilingual valence dictionary creation.<br />

There have been various attempts at extracting bilingual<br />

valence dictionaries from parallel corpora. In some<br />

cases, the extraction process is tackled from a high-level<br />

semantic level, as in the case of bilingual frame<br />

semantic dictionaries (Boas, 2002; 2005). Other<br />

approaches choose a syntactic annotation, as in the case<br />

of the Prague Czech-English Dependency Treebank<br />

(Čmejrek et al., 2004). In both cases, the semantic or<br />

„deep“ dependency (or tectogrammatical, see (Sgall,<br />

Hajičová & Panevová, 1986)) annotation abstracts away<br />

from syntactic variation, making the extraction task<br />

somewhat less complex. In the course of the FUSEproject<br />

(Cyrus, 2006), predicate-argument annotation<br />

and alignment between German and English sentences<br />

serves as basis for the study of both syntactic and<br />

semantic valence divergencies. Padó (2007) investigates<br />

the (frame) semantic dimension of valence divergencies.<br />

In the former case, the annotation is very specifically<br />

tailored to the project itself, making the methods harder<br />

to reproduce when applied to other corpora. In the latter<br />

study, the level of investigation again abstracts away<br />

from syntactic variation.<br />

The study presented here focusses on grammatical<br />

differences in valence pattern between German and<br />

English. Both for the detection and description of<br />

differences, top-level grammatical function like subject,<br />

direct object etc. are used. This follows the tradition of<br />

using grammatical functions rather than syntactic<br />

63


Multilingual Resources and Multilingual Applications - Regular Papers<br />

categories as e.g. in the previously listed bilingual<br />

valence dictionaries. Grammatical functions abstract<br />

away from syntactic variation but as compared to e.g.<br />

the tectogrammatical approach of (Čmejrek et al., 2004),<br />

no deep annotation is needed in order to retrieve<br />

grammatical functions of a sentence.<br />

The corpus used in the study is annotated and aligned on<br />

multiple linguistic levels, but not with a specific focus<br />

on valence. Also, the method of querying multiple<br />

annotation and alignment levels at once is outlined. On<br />

top of that, valence divergencies are discussed with<br />

respect to factors like contrastive differences, register or<br />

translation properties and strategies.<br />

64<br />

2. Study setup<br />

2.1. The corpus<br />

The corpus used in the study was built to investigate<br />

contrastive commonalities and differences between<br />

English and German as well as peculiarities in<br />

translations. It consists of English originals (EO), their<br />

German translations (GTrans) as well as German<br />

originals (GO) and their English translations (ETrans).<br />

Both translation directions are represented in eight<br />

registers with at least 10 texts totalling 31,250 words per<br />

register. In the present paper, examples are taken from<br />

the registers SHARE (corporate communications),<br />

SPEECH (political speeches) and FICTION (fictional<br />

texts). Altogether, the corpus comprises one million<br />

words. Additionally, register-neutral reference corpora<br />

are included for German and English including 2,000<br />

word samples from 17 registers.<br />

All texts are annotated with part-of-speech information<br />

using the TnT tagger (Brants, 2000), morphology using<br />

MPRO (Maas, Rösener & Theofilidis, 2009), and<br />

grammatical functions and chunk categories, manually<br />

annotated with MMAX2 (Müller & Strube, 2006).<br />

Furthermore, all texts are aligned on word level using<br />

GIZA++ (Och & Ney, 2003), on chunk level indirectly<br />

by mapping the grammatical functions onto each other,<br />

on clause level manually again using MMAX2, and on<br />

sentence level using the WinAlign component of the<br />

Trados Translator’s Workbench (Heyn, 19<strong>96</strong>) with<br />

additional manual correction.<br />

2.2. A format independent API for multilevel<br />

queries<br />

The API designed for the corpus is made up of three<br />

parts. On top, there is the interface, containing control<br />

methods with basic read/write and iteration calls for the<br />

corpus. Under the hood, a package called CORETOOL is<br />

used to represent linguistic structures in stratified layers,<br />

and the parallel structures (e.g. aligned words,<br />

sentences, etc.) as sets of pairs. The intermediate level<br />

handles the XML-based data format of the corpus.<br />

Queries are mainly written using the format-independent<br />

CORETOOL data structures and are thus re-usable for<br />

other corpora as well. The layers dealing with corpus<br />

management and format handling can, in theory, be<br />

exchanged depending on the corpus used. This<br />

stratificational approach is a major difference between<br />

this corpus API and other APIs, where programming<br />

data structures and underlying data format are more<br />

closely linked.<br />

Fundamental within CORETOOL is the notion of TEXT. A<br />

CORPUS is made up of an ordered collection of TEXTS,<br />

which again is made up of an ordered collection of<br />

SENTENCES, which again is made up of an ordered<br />

collection of TOKENS. This structure is so to speak the<br />

backbone of CORETOOL and the minimum of data that we<br />

expect in a corpus. In addition, a CORPUS can be divided<br />

into REGISTERS which also relate to collections of TEXTS<br />

(from the CORPUS). Likewise, a SENTENCE can contain<br />

CLAUSES or CHUNKS which relate to the TOKENS of the<br />

SENTENCE. For each of these sub-units of a text (including<br />

TOKENS), it is possible to have aligned counterparts.<br />

Every single alignment is represented as a pair; so if unit<br />

U is aligned with U' and U'', there will be two pairs<br />


Multilingual Resources and Multilingual Applications - Regular Papers<br />

for every wordPair in wordPairs<br />

end for<br />

slWord := getSlWord(wordPair)<br />

tlWord := getTlWord(wordPair)<br />

slChunk := getChunkForWord(slWord)<br />

tlChunk := getChunkForWord(tlWord)<br />

if (not mappable(getGramFunc(slChunk), getGramFunc(tlChunk))<br />

then markCrossingLine(slWord, tlWord, slChunk, tlChunk)<br />

end if<br />

Figure 1: Pseudo-Code of the query for crossing lines between grammatical functions and<br />

words<br />

the linguistic representation of CORETOOL is currently<br />

restricted to syntactic structures. However, the need to<br />

extend the package with further functionalities, e.g. in to<br />

be able to operate with semantic annotation as well, may<br />

or will hopefully soon be rendered unnecessary by latest<br />

developments of query tools like e.g. ANNIS21 .<br />

2.3. Querying for empty links and crossing<br />

lines<br />

Two concepts are used to detect instances of valence<br />

divergencies. These concepts are based on well-known<br />

concepts from translation studies. Elements which have<br />

no alignment exhibit an empty link. Such 0:1-equivalents<br />

have been described e.g. by Koller (2001). Elements<br />

which are aligned, but which are embedded in higher<br />

units that are not aligned, result in crossing lines. This<br />

would e.g. be the case for two aligned words which are<br />

embedded in different grammatical functions. Crossing<br />

lines relate to the concept of shifts (in the given example<br />

a shift in grammatical function) as described e.g. by<br />

Catford (1<strong>96</strong>5).<br />

The corpus is queried for empty links and crossing lines<br />

using the CORETOOL package. Empty links can be<br />

detected by simply querying one alignment level. For<br />

crossing lines, querying combinations of both annotation<br />

and alignment levels is necessary. A query for a shift in<br />

function requires (1) going through pairs of aligned<br />

words, (2) for each pair: getting the chunks the aligned<br />

words are embedded in, and (3) checking the mapping<br />

of these chunks, i.e. check whether the grammatical<br />

1 http://www.sfb632.uni-potsdam.de/d1/annis/<br />

functions they’ve been assigned are compatible (cf.<br />

figure 1). As in this study setup the same set of<br />

grammatical functions was used for German and<br />

English, mapping was straightforward.<br />

3. Divergencies in valence patterns for<br />

grammatical functions<br />

The ideal situation for valence extraction from parallel<br />

corpora would be that of sentence pairs with equivalent<br />

verbs at their core and perfectly matching syntactic<br />

patterns. Minor shifts, e.g. in the type of grammatical<br />

functions governed by the verb, can easily be accounted<br />

for. However, besides differences in realisation of<br />

arguments, there may also be differences in the<br />

realisation of the predicate. Such a typical shift is the<br />

head switch, in examples like Ich schwimme gern – I<br />

like swimming, where the German adverb gern<br />

‘willingly, with pleasure’ becomes the full verb like in<br />

English. As we will see, there may be other factors for<br />

different kinds of shifts in the verb. We will be looking<br />

at more semantically/pragmatically triggered shifts, for a<br />

more syntactic investigation especially of shifts in the<br />

realisation of the predicate, e.g. support verb<br />

constructions versus full verbs, see (Čulo, 2010).<br />

Probably the simplest case for a valence divergency on<br />

the level of grammatical functions is that of differences<br />

in the kinds of grammatical function as which an<br />

argument is realised. Compare, for instance, the<br />

sentence pair in figure 2, with the English original on<br />

top and the German translation at the bottom, and let us<br />

focus on the phrase “Most admired Company in<br />

65


Multilingual Resources and Multilingual Applications - Regular Papers<br />

America”. This phrase is embedded in a predicative<br />

complement (tag: COMPL) in English, as is governed<br />

by verbs like name, appoint, elect etc. The COMPL<br />

function has no equivalent in German, resulting in an<br />

empty link (indicated by the vertical lines with only<br />

linked to only one box). In order to understand, though,<br />

what is happening in that case, one has to evaluate the<br />

links from within the phrase: the word Company, for<br />

instance, is aligned with the equivalent word<br />

Unternehmen which is, however, embedded in a<br />

prepositional object (PROBJ) in German. The cause for<br />

this shift lies in a contrastive difference in the valence<br />

patterns of a whole class of verbs (namely the APPOINT<br />

class, following Levin (1993)). But, as there currently is<br />

no semantic annotation present in the corpus, there is no<br />

automatic way of linking the verb sense to this particular<br />

66<br />

Figure 2: A crossing line for the words Company and Unternehmen and the grammatical<br />

functions COMPL and PROBJ<br />

shift. We will come back to this point when discussing<br />

the last example.<br />

A similar shift from COMPL to a different function is<br />

shown in figure 3. Here, however, the shift is not<br />

triggered by the fact that two equivalent verbs have<br />

different valence patterns, but by a change of the main<br />

verb which does not match known concepts like head<br />

switches.<br />

be → sein be → sein<br />

E2G_SHARE 37 % (126) 63 % (215)<br />

E2G_FICTION 45 % (138) 54 % (168)<br />

E2G_SPEECH 60 % (224) 40 % (147)<br />

Table 1: Proportions of be translated as either sein or<br />

with a different verb than sein<br />

Figure 3: From English copular verb to German full verb


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Figure 4: Multiple shifts as a result of translation strategies<br />

The English copular verb be is translated with the<br />

transitive verb betragen in German. This particular kind<br />

of verb shift can be observed very often in the register<br />

SHARE, as shown in table 1. The reason for this lies in<br />

differences in style between English and German<br />

SHARE texts: English uses a more colloquial style<br />

where German puts rather formulaic expressions, using<br />

more full verbs than copular verbs.<br />

Many of the shifts found in translations can be attributed<br />

to translation strategies as described e.g. by (Vinay &<br />

Darbelnet, 1958) for French and English. An example of<br />

a modulation can be seen in figure 4. Here, what can be<br />

described by looking at the surface realisation, is that the<br />

word order from the German original has been kept in<br />

the English translation, probably to preserve the stress<br />

which is put on the phrase Die Frauen`the women`. But,<br />

while in German the first constituent is a direct object,<br />

this order of grammatical functions cannot be easily<br />

reproduced in English. A possible solution, as presented<br />

in the given example, is to shift the direct object to<br />

another function, here: the subject. In the given<br />

example, the verb is shifted, too, from transitive<br />

gemacht `made` to the copular weren’t. One could<br />

hypothesise that this happens in order to adapt to the<br />

different configuration of functions and their semantic<br />

content. However, in order to really explain the more<br />

complex cases of multiple shifts in one sentence, further<br />

data /annotations may be needed.<br />

If, for instance, we add frame semantic annotation, we<br />

may be able to describe the shift of the verb with<br />

relation to shifts in semantic content. In the example in<br />

figure 4, one could annotate the first sentence with the<br />

Cause_change frame (with das as Cause and Die<br />

Frauen as Entity), the second one with the<br />

state_of_entity frame. The English sentence could thus<br />

be interpreted as a translation of only a partial<br />

component of the sense of the original sentence: the<br />

English translation focusses on the outcome of the<br />

Cause_change process in the German original, giving<br />

more stress to the Entity (the women) in the<br />

State_of_entity by placing it to the sentence initial<br />

position. How to deal with such shifts – whether to<br />

include them in an extraction process or not – remains a<br />

matter of discussion. Data from process-based<br />

translation experiments may prove helpful for shedding<br />

light on the reasons for such a “partial” translation.<br />

4. Conclusion and outlook<br />

As has been shown, empty links and crossing lines have<br />

proven to be reliable indicators for detecting and in<br />

some cases a basis for describing differences in<br />

grammatical valence patterns. Furthermore, it has been<br />

shown that annotation and alignment on multiple levels<br />

can be used for studying valence divergencies and<br />

possibly for extracting bilingual valence dictionaries,<br />

without resorting to an annotation scheme specialised on<br />

these purposes only.<br />

67


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Future work shall concentrate on a broader<br />

categorisation of valence divergencies with respect to<br />

more factors than those listed in this paper. In order to<br />

be able to link verb senses and certain types of shifts, the<br />

next step is to add (frame) semantic annotation to the<br />

corpus. Also, the purely product based data presented<br />

here could be complemented by process-based studies in<br />

the future, which should yield a more sound explanation<br />

of shifts as depicted in figure 4.<br />

5. References<br />

Bianco, M. T. (19<strong>96</strong>): Valenzlexikon deutsch-italienisch.<br />

Deutsch im Kontrast 17. Heidelberg: Julius Groos.<br />

Boas, H. C. (2002): Bilingual FrameNet dictionaries for<br />

machine translation. In Proccedings of the third<br />

international conference on language resources and<br />

evaluation, 4:1364-1371. Las Palmas, Spanien.<br />

---- (2005): Semantic frames as interlingual<br />

representations for multilingual lexical databases.<br />

International Journal of Lexicography 4, no. 18: 445-<br />

478.<br />

Catford, J. C. (1<strong>96</strong>5): A linguistic theory of translation.<br />

an essay in applied linguistics. Oxford: Oxford<br />

University Press.<br />

Čmejrek, M., Cuřín, J., Havelka, J., Hajič, J., Kubon. V.<br />

(2004): Prague Czech-English dependency treebank:<br />

syntactically annotated resources for machine<br />

translation. In Proceedings of LREC 2004, 5:1597-<br />

1600. Lisbon, Portugal.<br />

Čulo, O. (2010): Valency, translation and the syntactic<br />

realisation of the predicate. In D. Vitaš and C. Krstev,<br />

Proceedings of the 29th International Conference on<br />

Lexis and Grammar (LGC), 73-82. Belgrade, Serbia.<br />

Cyrus, L. (2006): Building a resource for studying<br />

translation shifts. In Proceedings of LREC 2006.<br />

Emele, M. C., Dorna, M., Lüdeling, A., Zinsmeister, H.,<br />

Rohrer, C. (2000): Semantic-based transfer. In W.<br />

Wahlster (ed.), Verbmobil, 359-376. Artificial<br />

intelligence. Berlin ; Heidelberg [u.a.]: Springer.<br />

Engel, U., Savin, E. (1983): Valenzlexikon deutschrumänisch.<br />

Deutsch im Kontrast 3. Heidelberg: Julius<br />

Groos.<br />

Gebruers, R. (1988): Valency and MT: recent<br />

developments in the METAL system. In Proceedings<br />

68<br />

of the second conference on applied natural language<br />

processing, 168-175.<br />

Koller, W. (2001): Einführung in die<br />

Übersetzungswissenschaft. Narr Studienbücher.<br />

Tübingen: Gunter Narr.<br />

Levin, B. (1993): English verb classes and alternations.<br />

The University Chicago Press.<br />

Padó, S. (2007): Translational equivalence and crosslingual<br />

parallelism: the case of framenet frames. In<br />

Proceedings of the nodalida workshop on building<br />

frame semantics resources for scandinavian and baltic<br />

languages. Tartu, Estonia.<br />

Rall, D., Rall, M., Zorrilla, O. (1980): Diccionario de<br />

valencias verbales: aleman-español. Tübingen: Gunter<br />

Narr.<br />

Sgall, P., Hajičová, E., Panevová, J. (1986): The<br />

meaning of the sentence in its semantic and pragmatic<br />

aspects. Springer Netherland.<br />

Simon-Vandenbergen, A.-M., Taeldeman, J. ,Willems,<br />

D. (eds) (19<strong>96</strong>): Aspects of contrastive verb valency.<br />

Studia Germanica Gandensia 40.<br />

Steiner, E., Schmidt, P., Zelinsky-Wibbelt. C. (1988):<br />

From syntax to semantics: insights from machine<br />

translation. London: Francis Pinter.<br />

Tesnière, L. (1959): Éléments de syntaxe structurale.<br />

Paris: Klinksieck.<br />

Vinay, J.-P., Darbelnet, J. (1958): Stylistique comparée<br />

du français et de lʼanglais. Méthode de translation.<br />

Paris: Didier.<br />

Žabokrtský, Z., Ptáček, J., Pajas. P. (2008). TectoMT:<br />

highly modular MT system with tectogrammatics<br />

used as transfer layer. In Proceedings of WMT 2008.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

SPIGA - A Multilingual News Aggregator<br />

Leonhard Hennig † , Danuta Ploch † , Daniel Prawdzik § , Benjamin Armbruster § , Christoph<br />

Büscher § , Ernesto William De Luca † , Holger Düwiger § , Sahin Albayrak †<br />

† DAI-Labor, TU Berlin<br />

§ Neofonie GmbH<br />

Berlin, Germany Berlin, Germany<br />

E-mail: {leonhard.hennig,danuta.ploch,ernesto.deluca,sahin.albayrak}@dai-labor.de,<br />

{daniel.prawdzik,benjamin.armbruster,christoph.buescher,holger.düwiger}@neofonie.de<br />

Abstract<br />

News aggregation web sites collect and group news articles from a multitude of sources in order to help users navigate and consume<br />

large amounts of news material. In this context, Topic Detection and Tracking (TDT) methods address the challenges of identifying<br />

new events in streams of news articles, and of threading together related articles. We propose a novel model for a multilingual news<br />

aggregator that groups together news articles in different languages, and thus allows users to get an overview of important events and<br />

their reception in different countries. Our model combines a vector space model representation of documents based on a multilingual<br />

lexicon of Wikipedia-derived concepts with named entity disambiguation and multilingual clustering methods for TDT. We describe<br />

an implementation of our approach on a large-scale, real-life data stream of English and German newswire sources, and present an<br />

evaluation of the Named Entity Disambiguation module, which achieves state-of-the-art performance on a German and an English<br />

evaluation dataset.<br />

Keywords: topic detection and tracking, named entity disambiguation, multilingual clustering, news personalization<br />

1. Introduction<br />

News aggregation web sites such as Google News 1 and<br />

Yahoo! News 2<br />

collect and group news articles from a<br />

multitude of sources in order to help users navigate and<br />

consume large amounts of news material. Such systems<br />

allow users to stay informed on current events, and to<br />

follow a news story as it evolves over time. In this<br />

context, an event is defined as something that happens at<br />

a specific time and place (Fiscus & Doddington, 2002),<br />

e.g. “the earthquake that struck Japan on March 11th,<br />

<strong>2011</strong>”.<br />

Topic Detection and Tracking (TDT) methods address<br />

two main challenges of such systems: The detection of<br />

new events (topics) and the tracking of articles related to<br />

a known topic in newswire streams (Allan, 2002).<br />

Addressing these tasks typically requires a comparison of<br />

text models. In topic tracking, the comparison is between<br />

a document and a topic, which is often represented as a<br />

centroid vector of the topic’s documents. Topic detection<br />

compares a document to all known topics, to decide if the<br />

1<br />

http://news.google.com<br />

2<br />

http://news.yahoo.com<br />

document is about a novel topic. Text models are often<br />

based on the Vector Space Model, or are represented as<br />

language models (Larkey, 2004).<br />

Going one step further, multilingual news aggregation<br />

enables users to get an overview of the press coverage of<br />

an event in different countries and languages, and has<br />

been a part of TDT evaluations since 1999 (Wayne, 2000).<br />

For multilingual TDT, topic and document comparisons<br />

require the use of multilingual text models, or<br />

alternatively the translation of documents (Larkey, 2004).<br />

Previous research has typically used machine translation<br />

to convert stories to a base language (Wayne, 2000).<br />

Machine-translated documents, however, are of lower<br />

quality than human-translated documents, and<br />

full-fledged machine translation of complete documents<br />

is costly in terms of required models and linguistic tools<br />

(Larkey, 2004). Moreover, real-life TDT systems have to<br />

filter large amounts of new documents as they arrive over<br />

time, and thus require the use of efficient, scalable<br />

approaches.<br />

As news stories typically revolve around people, places,<br />

and other named entities, Shah et al. (2006) show that<br />

using concepts, such as named entities and topical<br />

69


Multilingual Resources and Multilingual Applications - Regular Papers<br />

keywords, rather than all words for vector representations<br />

can lead to a higher TDT performance. While there are<br />

many ways to extract concepts from documents,<br />

Wikipedia has gained much interest recently as a lexical<br />

resource (Mihalcea, 2007), as it covers concepts from a<br />

wide range of domains and is freely available in many<br />

languages. Furthermore, Wikipedia’s inter-language<br />

links can be used to translate multilingual concepts.<br />

However, previous research in multilingual TDT has not<br />

attempted to utilize Wikipedia as a resource for concept<br />

extraction and translation.<br />

Representing documents as concept vectors raises the<br />

additional challenge of dealing with natural language<br />

ambiguities, such as ambiguous name mentions and the<br />

use of synonyms (Cucerzan, 2007). For example, the<br />

name mention ‘Jordan’ may refer to several different<br />

persons, a river, and a country. As these phenomena<br />

lower the quality of vector representations, it is necessary<br />

to resolve ambiguous name mentions against their correct<br />

real-world referent. This task is known as Named Entity<br />

Disambiguation (NED) (Bunescu & Pasca, 2006).<br />

State-of-the-art approaches to NED employ supervised<br />

machine learning algorithms to combine features based<br />

on document context knowledge with entity information<br />

stored in an encyclopedic knowledge base (KB)<br />

(Bunescu & Pasca, 2006; Zhang et al., 2010). Common<br />

features include popularity (Dredze et al., 2010),<br />

similarity metrics exploring Wikipedia’s concept<br />

relations (Han & Zhao, 2009), and string similarity. In<br />

current research, NED has mainly been considered as an<br />

isolated task (Ji & Grishman, <strong>2011</strong>), and has not yet been<br />

applied in the context of TDT.<br />

The contributions of this paper are twofold: We propose a<br />

novel model for a multilingual news aggregator that<br />

combines Wikipedia-based concept extraction, named<br />

entity disambiguation, and multilingual TDT (Section 2).<br />

Our model is based on a representation of documents and<br />

topics as vectors of concepts. This choice of<br />

representation, combined with concept translation,<br />

enables the application of a wide range of well-known<br />

TDT algorithms regardless of the language of the input<br />

documents, and leads to efficient and scalable<br />

implementations. We also describe an implementation of<br />

our model on a large-scale, multilingual news stream.<br />

Furthermore, we extend our NED algorithm previously<br />

proposed in (Ploch, 2010) to a German KB, and present<br />

70<br />

an evaluation of the Named Entity Disambiguation<br />

module on a newly-created German dataset (Section 3).<br />

2. Multilingual News Aggregation Model<br />

Our approach to multilingual TDT is schematically<br />

outlined in Figure 1. For each news article, we<br />

successively perform language-dependent concept<br />

extraction (Section 2.1), NED (Section 2.2) and<br />

multilingual TDT (Section 2.3). In addition, we outline<br />

an algorithm for news personalization in Section 2.4.<br />

Finally, we give details of the implementation of our<br />

model in Section 2.5, and describe a user interface for the<br />

presentation of news stories in Section 2.6.<br />

Figure 1: Multilingual News Aggregation Model<br />

2.1. Concept extraction<br />

We create a lexicon of terms, phrases and named entities<br />

by collecting titles, internal anchor texts, and redirects,<br />

from Wikipedia articles. The use of Wikipedia as the<br />

basis of our lexicon allows us to construct concept<br />

vectors for news articles in different languages, and<br />

facilitates the creation of new lexicons. We utilize the<br />

inter-language tables of Wikipedia to create a mapping<br />

between concepts in different languages. In the final<br />

lexicon, each concept is represented by an image, which<br />

is used to uniquely identify the concept, and a list of<br />

linguistic variants (inflected forms, synonyms and<br />

abbreviations). For example, the concept ‘Jordan<br />

(Country)’ may be referred to by ‘Jordan’, ‘Urdun’, or<br />

‘Hashemite Kingdom of Jordan’.<br />

After concept extraction, each news article is represented<br />

as a weighted bag-of-concepts. All other words contained


Multilingual Resources and Multilingual Applications - Regular Papers<br />

in the document are discarded. We weight concepts using<br />

a variant of the traditional tf.idf-weighting scheme (Allan,<br />

2005). The document frequency is calculated over a<br />

sliding time window in order to better reflect the<br />

changing significance of terms in a dynamic collection of<br />

news articles:<br />

w(<br />

c , d<br />

i<br />

j<br />

)<br />

n(<br />

ci<br />

, d j )<br />

=<br />

n(<br />

c , d ) + 0.<br />

5 + 1.<br />

5 × d<br />

i<br />

j<br />

log(( D + 0.<br />

5)<br />

/ nD<br />

( ci<br />

))<br />

×<br />

,<br />

log( 1 + D )<br />

j<br />

/ d<br />

where 𝑤𝑤𝑤𝑤(𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖, 𝑑𝑑𝑑𝑑 𝑗𝑗𝑗𝑗 ) is the weight of concept 𝑖𝑖𝑖𝑖 in document 𝑗𝑗𝑗𝑗,<br />

𝐷𝐷𝐷𝐷 is the collection of documents, 𝑛𝑛𝑛𝑛(𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖, 𝑑𝑑𝑑𝑑 𝑗𝑗𝑗𝑗 ) is the<br />

frequency of concept 𝑖𝑖𝑖𝑖 in document 𝑗𝑗𝑗𝑗 and 𝑛𝑛𝑛𝑛 𝐷𝐷𝐷𝐷(𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖) is the<br />

number of documents containing 𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖.<br />

2.2. Multilingual Named Entity Disambiguation<br />

The concept vector of a document may initially<br />

encompass ambiguous concepts, and in particular<br />

ambiguous name mentions. If a document contains e.g.<br />

the name mention ‘Michael Jordan’ the real-world<br />

referent might be the famous basketball player, but also<br />

the researcher in machine learning known under this<br />

name. The same document may also refer to ‘Air Jordan’,<br />

which is a synonymous name for the basketball player. In<br />

both cases the challenge is to figure out the correct<br />

meaning of the name mention for clearly constructing the<br />

concept vector of the document.<br />

Our approach to NED is based on our earlier work<br />

described in (Ploch, 2010), which we extend here to a<br />

German KB. We disambiguate name mentions found in a<br />

text by utilizing an encyclopedic reference knowledge<br />

base (KB) to link a name mention to at most one entry in<br />

the KB (Bunescu & Pasca, 2006). Furthermore, we also<br />

determine if a name mention refers to an entity not<br />

covered by the KB, which is known as Out-of-KB<br />

detection (Dredze et al., 2010). This may occur for less<br />

popular but still newsworthy entities with no<br />

corresponding KB entry. Especially challenging is the<br />

disambiguation of common names, like for instance ‘Paul<br />

Smith’, of unknown entities sharing their name with a<br />

popular namesake.<br />

Our approach to NED is based on the observation that<br />

entities in texts co-occur with other entities. We therefore<br />

utilize the entities surrounding an ambiguous name for<br />

their resolution. On the basis of Wikipedia’s internal link<br />

graph we create a reference KB containing for each entity<br />

its known surface forms (i.e. name variants) and its links<br />

to other entities and concepts (Wikipedia articles).<br />

Given a name mention identified in a document, the<br />

candidate selection component retrieves a set of<br />

candidate entities from the KB, using a fuzzy, weighted<br />

search on index fields storing article titles, redirect titles,<br />

and name variants. We cast NED as a supervised<br />

classification task and train two Support Vector Machine<br />

(SVM) classifiers (Vapnik, 1995). The first classifier<br />

ranks the candidate KB entities for a given surface form.<br />

Subsequently, the second classifier determines whether<br />

the surface form refers to an Out-of-KB entity. Besides<br />

calculating well-known NED features like the<br />

bag-of-words similarity, the popularity of an entity given<br />

a specific surface form and the string similarity (baseline<br />

feature set), we implement features that exploit<br />

Wikipedia’s link graph. To this end, we represent the<br />

document context of an ambiguous entity and each<br />

candidate as a vector of links that are associated with the<br />

candidate entities in our KB, and compute several<br />

similarity features using the resulting bag-of-links<br />

vectors. The full approach is described in more detail in<br />

(Ploch, 2010).<br />

2.3. Multilingual Topic Detection and Tracking<br />

Given the disambiguated concept vector representation<br />

of a document, we employ a hierarchical agglomerative<br />

clustering approach for TDT. The centroid vector of a<br />

topic is created by averaging the concept weights of the<br />

documents assigned to that topic. The clustering<br />

algorithm then compares a new document to the centroid<br />

vectors of existing topics using a combination of the two<br />

vectors’ cosine similarity and a time-dependent penalty.<br />

The time factor is included to prefer assigning new<br />

documents to more recent events, and to limit the infinite<br />

growth of old events (Nallapati et al., 2004). If a<br />

document’s similarity to all clusters is lower than a<br />

predefined threshold, we assume that this document deals<br />

with a new event, and starts a new cluster.<br />

In order to cluster documents from different languages,<br />

we utilize the inter-language mappings and translate the<br />

concept vectors to a single language. Thus, the document<br />

concept vectors as well as the cluster centroid vectors<br />

share a common space of concepts, to which we can<br />

apply our clustering approach.<br />

71


Multilingual Resources and Multilingual Applications - Regular Papers<br />

2.4. News Personalization<br />

The Personal News Agent (PNA) enables the user to<br />

personalize the news stream to match her information<br />

need. We define a user profile as a weighted vector 𝑢𝑢𝑢𝑢<br />

consisting of components 𝑢𝑢𝑢𝑢 + and 𝑢𝑢𝑢𝑢 − , which represent<br />

the concepts that a user is interested respectively not<br />

interested in. We include 𝑢𝑢𝑢𝑢 − to allow for a more<br />

fine-grained control of news selection. Similar to the<br />

centroid vectors of document clusters, this approach<br />

enables a language-independent representation of a<br />

user’s information needs.<br />

The process of identifying relevant news articles is<br />

performed analogously to the TDT algorithm described<br />

in the previous section. The relevance of a new document<br />

with respect to the user profile is calculated as the cosine<br />

similarity of the document’s concept vector and 𝑢𝑢𝑢𝑢 .<br />

Documents with a similarity higher than a predefined<br />

threshold are assumed to match a user’s information need,<br />

and presented to the user.<br />

2.5. System Implementation<br />

Our implementation of the approach described in the<br />

previous sections consists of three main components, and<br />

is shown in Figure 1. We used a crawler that collects<br />

news articles and associated metadata from<br />

approximately 1400 German and English newswire<br />

sources. The news articles are processed in a pipeline<br />

based on the Apache UIMA framework 3<br />

. Events and the<br />

news articles associated with them are presented to the<br />

user via a web interface. The system is geared towards<br />

large-scale processing of newswire streams in near<br />

real-time. It processes approximately 70.000 news<br />

articles per day, and manages up to 200.000 event<br />

clusters over a time span of four weeks.<br />

The current system processes English and German news,<br />

using a lexicon of 1.5 and 1.1 million concepts<br />

respectively, and is planned to include French, Italian and<br />

Spanish news sources. The usable intersection between<br />

the German and English lexicons amounts to 700K<br />

concepts. Concepts are identified in text with a<br />

longest-matching substring strategy (Gusfield, 1999).<br />

The concept weighting uses a time span of 4 weeks to<br />

determine document frequency.<br />

Our implementation of the NED module utilizes<br />

3 Apache UIMA– Unstructured Information Management<br />

Architecture (http://uima.apache.org/)<br />

72<br />

classifier models trained on the TAC-KBP 2009 dataset<br />

and a German dataset (see Section 3), both of which are<br />

based on newswire documents.<br />

The TDT component’s parameters, such as cluster<br />

similarity thresholds and time penalty values, are<br />

currently tuned manually based on an analysis of the<br />

clusters produced by the algorithm. We utilize the<br />

concept set of the German Wikipedia as the basis for<br />

translating the concept vectors of English news articles.<br />

In addition, concept types are weighted differently, as for<br />

example places and person names are more helpful than<br />

general topics to detect events in news streams.<br />

For the news personalization component, the creation of<br />

a user profile is based on the selection of news articles by<br />

the user according to her interests. Concept vectors are<br />

extracted from user-selected articles as described in<br />

Section 2.1. The concept vectors are then merged and<br />

weighted to create a centroid vector 𝑢𝑢𝑢𝑢, with concepts<br />

having a negative weight representing the component 𝑢𝑢𝑢𝑢 − .<br />

The news personalization module uses a slightly different<br />

weighting scheme than the TDT component, assigning a<br />

higher weight to general topics (e.g. elections, tax cuts)<br />

than to named entities.<br />

2.6. User Interface<br />

We present events and news articles to users via a web<br />

interface. The interface includes a start page giving an<br />

overview of the most important events in several news<br />

categories, as well as pages for each category. Given the<br />

large amount of news stories published every day, our<br />

system implements several methods to rank event<br />

clusters for presentation to the user. These include<br />

measures based on cluster novelty, size, and hotness. The<br />

hotness measure is calculated as a weighted combination<br />

of a cluster’s total growth since its creation time, and its<br />

recent growth in a sliding time window. For our system,<br />

we determined the weights experimentally over a range<br />

of settings. This approach ensures that breaking news are<br />

presented first both on the start page and on category<br />

pages. In addition, we implement a filtering strategy for<br />

news articles to provide users with an in-depth,<br />

diversity-oriented overview of each event, instead of<br />

merely listing an event’s news articles in order of their<br />

age. Figure 2 shows the overview page of an example<br />

event, displaying the event’s lead article as well as two<br />

earlier news articles in German and English.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Figure 2: A sample multilingual news cluster<br />

3. Evaluation of NED<br />

We evaluate the quality of our NED approach on two<br />

datasets to examine how its performance compares to<br />

other state-of-the art systems, and which accuracy it<br />

achieves for different languages.<br />

The first dataset is the TAC-KBP 2009 dataset for<br />

English (Simpson et al., 2009). It consists of 3,904<br />

queries (name mention-document pairs) with 57%<br />

queries for Out-of-KB entities. The KB queries are<br />

divided into 69% queries for organizations and 15%<br />

queries for persons and geopolitical entities each. In<br />

addition to the English NED dataset we created a German<br />

dataset with 2,359 queries. This dataset consists of 30%<br />

Out-of-KB queries and 70% KB queries, where 46 % of<br />

the queries relate to organizations, 27% to persons and 24%<br />

to geopolitical entities. 3% are of an unknown type<br />

‘UKN’.<br />

Micro-averaged accuracy<br />

1.00<br />

0.95<br />

0.90<br />

0.85<br />

0.80<br />

0.75<br />

0.70<br />

0.65<br />

0.60<br />

0.55<br />

0.50<br />

Baseline Best feature set Dredze et al. Zheng et al.<br />

All queries KB Out-of-KB<br />

Figure 3: Micro-averaged accuracy of different<br />

approaches to English NED for the TAC-KBP 2009<br />

dataset on all, KB and Out-of-KB queries.<br />

For both datasets, we perform 10-fold cross-validation by<br />

training the SVM classifiers on 90% of the queries and<br />

testing on the remaining 10%. Results reported in this<br />

paper are then averaged across the test folds. We utilize<br />

the official TAC-KBP 2009 evaluation measure of<br />

micro-averaged accuracy, which is computed as the<br />

fraction of correctly answered queries.<br />

Figure 3 and Figure 4 show the micro-averaged<br />

accuracies for all, KB and Out-Of-KB queries. As shown<br />

in Figure 3 for the English dataset, our best feature set<br />

improves the accuracy of the baseline model by 2.7%,<br />

and achieves a micro-averaged accuracy of 0.84.<br />

Regarding other systems tested on the same dataset<br />

(Dredze et al., 2010; Zheng et al., 2010), our results<br />

compare favorably. In particular, the detection of<br />

Out-of-KB entities outperforms that of other systems.<br />

The experiments confirm our assumption that<br />

co-occurring entities and their relations are suitable for<br />

NED. Similar results are obtained for the German dataset,<br />

as shown in Figure 4. The overall accuracy of 0.77 on this<br />

dataset is slightly lower than for the TAC 2009 dataset.<br />

Again, the accuracy for Out-of-KB queries is higher than<br />

the disambiguation accuracy for KB queries, but<br />

compared to TAC 2009 the results are more balanced.<br />

Micro-averaged accuracy<br />

1.00<br />

0.95<br />

0.90<br />

0.85<br />

0.80<br />

0.75<br />

0.70<br />

0.65<br />

0.60<br />

0.55<br />

0.50<br />

Englisch TAC-KBP 2009 German dataset <strong>2011</strong><br />

All queries KB Out-of-KB<br />

Figure 4: Comparison of micro-averaged NED accuracy<br />

on the English TAC-KBP 2009 and the German dataset.<br />

4. Conclusions<br />

We described a model for a multilingual news aggregator<br />

which combines Wikipedia-based concept extraction,<br />

named entity disambiguation and multilingual TDT to<br />

detect and track events in multilingual news streams. Our<br />

approach exploits Wikipedia as a large-scale,<br />

multilingual knowledge source both for representing<br />

documents as concept vectors and for resolving<br />

ambiguous named entities. We also described a<br />

73


Multilingual Resources and Multilingual Applications - Regular Papers<br />

fully-operational implementation of our approach on a<br />

real-life, large scale multilingual news stream. Finally,<br />

we presented an evaluation of the Named Entity<br />

Disambiguation module on a German and an English<br />

dataset. Our approach achieves state-of-the-art results on<br />

the TAC-KBP 2009 dataset, and shows similar<br />

performance on a German dataset.<br />

In future work, we plan to evaluate the Topic Detection<br />

and Tracking component using the TDT 3 dataset (Wayne,<br />

2000), in order to verify the validity of our overall<br />

approach. We also plan to evaluate the effect of NED on<br />

the performance of the TDT algorithm.<br />

Furthermore, we intend to include more languages to<br />

provide a pan-European overview of news events. This<br />

will raise additional challenges related to the mapping of<br />

concepts in different languages, the disambiguation of<br />

named entities, and the clustering strategies applicable to<br />

the resulting vector representation, since many Wikipedia<br />

versions are often significantly smaller than the English<br />

one. For example, we plan to extend our link-based NED<br />

approach by exploiting cross-lingual information.<br />

5. Acknowledgments<br />

The authors wish to express their thanks to the Neofonie<br />

GmbH team who strongly contributed to this work. The<br />

project SPIGA is funded by the Federal Ministry of<br />

Economics and Technology (BMWi).<br />

6. References<br />

Allan, J. (2002): Introduction to topic detection and<br />

tracking. In: Topic detection and tracking, pp. 1–16.<br />

Kluwer Academic Publishers.<br />

Allan, J., Harding, S., Fisher, D., Bolivar, A.,<br />

Guzman-Lara, S., Amstutz, P. (2005): Taking topic<br />

detection from evaluation to practice. In: Proc. of<br />

HICSS ’05.<br />

Bunescu, R., Pasca, M. (2006): Using encyclopedic<br />

knowledge for named entity disambiguation. In: Proc.<br />

of EACL-06, pp. 9–16.<br />

Cucerzan, S. (2007): Large-Scale named entity<br />

disambiguation based on Wikipedia data. In: Proc. of<br />

EMNLP-CoNLL’07, pp. 708–716.<br />

Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.<br />

(2010): Entity disambiguation for knowledge base<br />

population. In: Proc. of Coling 2010, pp. 277–285.<br />

Fiscus, J., Doddington G. (2002): Topic detection and<br />

74<br />

tracking evaluation overview. In: Topic detection and<br />

tracking, pp. 17-31. Kluwer Academic Publishers.<br />

Gusfield, D. (1999): Algorithms on Strings, Trees and<br />

Sequences: Computer Science and Computational<br />

Biology. Cambridge University Press.<br />

Han, X., Zhao, J. (2009): Named entity disambiguation<br />

by leveraging wikipedia semantic knowledge. In: Proc.<br />

of CIKM 2009, pp. 215–224.<br />

Ji, H., Grishman, R. (<strong>2011</strong>): Knowledge Base Population:<br />

Successful Approaches and Challenges. In: Proc. of<br />

ACL <strong>2011</strong>, pp. 1148-1158.<br />

Larkey, L.S., Feng, F., Connell, M., Lavrenko, V. (2004):<br />

Language-specific models in multilingual topic<br />

tracking. In: Proc. of SIGIR '04, pp. 402-409.<br />

Mihalcea, R., Csomai, A. (2007): Wikify!: linking<br />

documents to encyclopedic knowledge. In: Proc. of<br />

CIKM '07, pp. 233-242.<br />

Nallapati, R., Feng, A., Peng, F., Allan, J. (2004): Event<br />

threading within news topics. In: Proc. of CIKM 2004,<br />

pp. 446–453.<br />

Ploch, D. (<strong>2011</strong>): Exploring Entity Relations for Named<br />

Entity Disambiguation. In: Proc. of ACL <strong>2011</strong>, pp.<br />

18–23.<br />

Shah, C., Croft, W., Jensen, D. (2006): Representing<br />

documents with named entities for story link detection<br />

(SLD). In: Proc. of CIKM ’06, pp. 868-869.<br />

Simpson, H., Strassel, S., Parker, R., McNamee, P. (2009):<br />

Wikipedia and the web of confusable entities:<br />

Experience from entity linking query creation for TAC<br />

2009 knowledge base population. In: Proc. of<br />

LREC ’10.<br />

Vapnik, V.N. (1995): The nature of statistical learning<br />

theory. Springer-Verlag, New York, NY, USA.<br />

Wayne, C. (2000): Multilingual topic detection and<br />

tracking: Successful research enabled by corpora and<br />

evaluation. In: Proc. of LREC ’00.<br />

Zhang, W., Su, J., Lim, C., Tan W., Wang, T. (2010):<br />

Entity linking leveraging automatically generated<br />

annotation. In: Proc. of Coling 2010, pp. 1290–1298.<br />

Zheng, Z., Li, F., Huang, M., Zhu, X. (2010): Learning to<br />

link entities with knowledge base. In: Proc. of<br />

NAACL-HLT ’10, pp. 483–491.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

From Historic Books to Annotated XML:<br />

Building a Large Multilingual Diachronic Corpus<br />

Magdalena Jitca, Rico Sennrich, Martin Volk<br />

Institute of Computational Linguistics, University of Zurich<br />

Binzmühlestrasse 14, 8050 Zürich<br />

E-mail: mjitca, sennrich, volk @ifi.uzh.ch<br />

Abstract<br />

This paper introduces our approach towards annotating a large heritage corpus, which spans over 100 years of alpine literature. The<br />

corpus consists of over 16.000 articles from the yearbooks of the Swiss Alpine Club, 60% of which represent German texts, 38%<br />

French, 1% Italian and the remaining 1% Swiss German and Romansh. The present work describes the inherent difficulties in<br />

processing a multilingual corpus by referring to the most challenging annotation phases such as article identification, correction of<br />

optical character recognition (OCR) errors, tokenization, and language identification. The paper aims to raise awareness for the<br />

efforts in building and annotating multilingual corpora rather than to evaluate each individual annotation phase.<br />

Keywords: multilingual corpora, cultural heritage, corpus annotation, text digitization<br />

1. Introduction<br />

In the project Text+Berg 1 we are digitizing publications<br />

of the Alpine clubs from various European countries,<br />

which consist mainly of reports on the following topics:<br />

mountain expeditions, the Alpine culture, the flora,<br />

fauna and geology of the mountains.<br />

The resulting corpus is a valuable knowledge base to<br />

study the changes in all these areas. Moreover, it enables<br />

the quantitative analysis of diachronic language changes<br />

as well as the study of typical language structures,<br />

linguistic topoi, and figures of speech in the<br />

mountaineering domain.<br />

This paper describes the particularities of our corpus and<br />

gives an overview of the annotation process. It presents<br />

the most interesting challenges that our multilingual<br />

corpus brought up, such as text structure identification,<br />

optical character recognition (OCR), tokenization, and<br />

language identification. We focus on how the<br />

multilingual nature of the text collection poses new<br />

problems in apparently trivial processing steps (e.g.<br />

tokenization).<br />

1 See www.textberg.ch<br />

2. The Text+Berg Corpus<br />

The focus of the Text+Berg project is to digitize the<br />

yearbooks of the Swiss Alpine Club from 1864 until<br />

today. The resulting corpus contains texts which focus<br />

on conquering and understanding the mountains and<br />

covers a wide variety of text genres such as expedition<br />

reports, (popular) scientific papers, book reviews, etc.<br />

The corpus is multilingual and contains articles in<br />

German (some also in Swiss German), French, Italian<br />

and even Romansh. Initially, the yearbooks contained<br />

mostly German articles and few in French. Since 1957<br />

the books appeared in parallel German and French<br />

versions (with some Italian articles), summing up to a<br />

total of 53 parallel editions German-French and 90<br />

additional multilingual yearbooks. The corpus contains<br />

16.000 articles, 60% of which represent German texts,<br />

38% French, 1% Italian and the remaining 1% Swiss<br />

German and Romansh. This brings our corpus to 35,75<br />

million words extracted from almost 87.000 book pages,<br />

10% of which representing parallel texts. This feature of<br />

the corpus allows for interesting cross-language<br />

comparisons and has been used as training material for<br />

Statistical Machine Translation systems (Sennrich &<br />

Volk, 2010).<br />

75


Multilingual Resources and Multilingual Applications - Regular Papers<br />

76<br />

3. The Annotation Phases<br />

This section introduces our pipeline for processing and<br />

annotating the Text+Berg corpus. More specifically, the<br />

input consists of HTML files containing the scanned<br />

yearbooks (for yearbooks in paper format), as they are<br />

exported by the OCR software. We work with two stateof-the-art<br />

OCR programs (Abbyy FineReader 7 and<br />

OmniPage 17) in order to convert the scan images into<br />

text and then export the files in HTML format. Our<br />

processing pipeline takes them through ten consecutive<br />

stages: 1) HTML cleanup, 2) structure reducing, 3) OCR<br />

merging, 4) article identification, 5) parallel book<br />

combination, 6) tokenization, 7) correction of OCR<br />

errors, 8) named entity recognition, 9) Part of Speech<br />

(POS) tagging and 10) additional lemmatization for<br />

German. The final output consists of XML documents<br />

which mark the article structure (title, author), as well as<br />

sentence boundaries, tokens, named entities (restricted<br />

to mountain, glacier and cabin names), POS tags and<br />

lemmas. Our document processing approach is similar to<br />

other annotation pipelines, such as GATE (Cunningham<br />

et al., 2002), but it is customized for our alpine corpus.<br />

In terms of space complexity, the annotated output files<br />

require almost three times more storage space than the<br />

input HTML files and 2,3 times more space than the<br />

tokenized XML files, respectively.<br />

In the following subsections we expand on the<br />

processing stages that are especially challenging for a<br />

multilingual corpus.<br />

3.1. Article Identification<br />

The identification of articles in the text is performed<br />

during the fourth processing stage. The text is annotated<br />

conforming to an XML schema which marks the article<br />

boundaries (start, end), its title and author, paragraphs,<br />

page breaks, footnotes and captions. Some of the text<br />

structure information can be checked against the table of<br />

contents (ToC) and table of figures (where available),<br />

which are manually corrected in order to have a clean<br />

database of all articles in the corpus. Another relevant<br />

resource for the article boundary identification is the<br />

page mapping file that is automatically generated in the<br />

second stage, which relates the number printed on the<br />

original book page with the page number assigned<br />

during scanning. The process of matching entries from<br />

the table of contents to the article headers in the books is<br />

not trivial, as it requires that the article title, the author<br />

name(s) and the page number in the book are correctly<br />

recognized. We allow small variations and OCR errors,<br />

as long as they are below a specific threshold (usually a<br />

maximum deviation of 20% of characters is allowed).<br />

For example, the string K/aIbard -Eine Reise in die<br />

Eiszeit. will be considered a match for the ToC entry<br />

Svalbard - Eine Reise in die Eiszeit, although not all<br />

their characters coincide.<br />

Proper text structuring relies on the accurate<br />

identification of layout elements such as article<br />

boundaries, graphics and captions, headers and<br />

footnotes. Over the 145 years the layout of the<br />

yearbooks has changed significantly. Therefore we had<br />

to adapt different processing steps for all the various<br />

designs. The particularities of these layouts have been<br />

discussed in (Volk et al., 2010a).<br />

The yearbooks since 19<strong>96</strong> are a collection of monthly<br />

editions and their pagination is no longer continuous (it<br />

starts over every month). This change affects the page<br />

mapping process, which performs well only when page<br />

numbers are monotonically increasing. Moreover, article<br />

boundaries are hard to determine when a single page<br />

contains several small articles and not all of them<br />

specify their author's name. These particularities are also<br />

reflected in the layout, as the header lines (where<br />

existing) no longer contain information about author or<br />

title, but about the article genre. Under these<br />

circumstances, we still achieved a percentage of 80%<br />

identified articles for these new yearbooks, a value<br />

comparable to the overall percentage of the corpus.<br />

3.2. Correction of OCR Errors<br />

The correction process aims to detect and overcome the<br />

errors introduced by the OCR systems and is carried out<br />

in two different stages of the annotation process. The<br />

first revision is done in the third stage (OCR merging),<br />

where the input is still raw text, with no additional<br />

information about either the structure or the language of<br />

the articles. At this stage we combine the output of our<br />

two OCR systems. The algorithm computes the<br />

alignments in a page-level comparison of the input files<br />

provided by each system and searches the Longest<br />

Common Subsequence in a n-character window. In case


Multilingual Resources and Multilingual Applications - Regular Papers<br />

of mismatch, the system disambiguates among the<br />

different candidates and selects the word with the<br />

highest probability in that context (computed based on<br />

the word's frequency in the Text+Berg corpus). The<br />

implemented algorithm and the evaluation results are<br />

thoroughly discussed in (Volk et al., 2010b).<br />

OCR-merging is a worthwhile approach since there are<br />

many situations where one system can fix the other's<br />

errors. Our experience has shown that Abbyy<br />

FineReader performs the better OCR, with over 99%<br />

accuracy (Volk et al., 2010b). But there are also cases<br />

where it fails to provide the correct output, whereas<br />

OmniPage provides the right one. For example, the<br />

sequence Cependant, les cartes disponibles sont squvent<br />

approximatives (English: However, the available maps<br />

are often approximate) is provided by FineReader. The<br />

system has introduced the spelling mistake squvent,<br />

which doesn't appear in the output of the second system<br />

(here souvent). This triggers the replacement of the nonword<br />

squvent with the correct version souvent.<br />

During the seventh annotation stage, after tokenization,<br />

we correct errors caused by graphemic similarities. The<br />

automatic correction is performed at the word-level by<br />

pattern matching over sequences of characters. In order<br />

to achieve this, we have compiled lists of common error<br />

patterns and their possible replacements. For example, a<br />

word-initial 'R' is often misinterpreted as 'K', resulting in<br />

words such as Kedaktion instead of Redaktion (English:<br />

editorial office). For each tentative replacement we<br />

check against the word frequency list in order to decide<br />

whether a candidate word appears in the corpus more<br />

frequently than the original or the other possible<br />

replacement candidates. In this case, Redaktion has 1127<br />

occurrences in the corpus, whereas Kedaktion only 9.<br />

Reynaert (2008) describes a similar statistical approach<br />

for both historical and contemporary texts.<br />

As the yearbooks until 1957 contained articles written in<br />

several languages, we have used a single word<br />

frequency dictionary for all of them (German, French<br />

and Italian). The dictionary has been built from the<br />

Text+Berg corpus and thus contains all the encountered<br />

word types and their corresponding frequencies,<br />

computed over the same corpus. The interesting aspect<br />

about this dictionary is its reliability, in spite of being<br />

trained with noisy data (text containing OCR-errors).<br />

Correctly spelled words will typically have a higher<br />

frequency than the ones containing OCR errors. The list<br />

contains predominantly German words due to the high<br />

percentage of German articles in the first 90 yearbooks,<br />

thus the frequency of German words is usually higher<br />

than that of French words. This can lead to wrong<br />

substitution choices, such as a German word in a French<br />

sentence (e.g. Neu (approx. 4400 hits) instead of lieu<br />

(approx. 3000 hits)). Therefore we have decided to<br />

create a separate frequency dictionary for French words,<br />

which is used only for the monolingual French editions.<br />

3.3. Tokenization<br />

In this stage the paragraphs of the text are split into<br />

sentences and words, respectively. Tokenization is<br />

considered to be a straightforward problem that can be<br />

solved by applying a simple strategy such as split on all<br />

non-alphanumeric characters (e.g. spaces, punctuation<br />

marks). Studies have shown, however, that this is not a<br />

trivial issue when dealing with hyphenated compound<br />

words or other combinations of letters and special<br />

characters (e.g. apostrophes, slashes, periods etc.). He<br />

and Kayaalp (2006) present a comparative study of<br />

several tokenizers for English, showing that their output<br />

varies widely even for the same input language. We<br />

would expect a similar performance from a general<br />

purpose tokenizer dealing with several languages.<br />

We will exemplify the language-specific issues with the<br />

use of apostrophes. In many languages, they are used for<br />

contractions between different parts of speech, such as<br />

verb + personal pronoun es in German (e.g. hab's →<br />

habe + es) or determiner and noun in French or Italian<br />

(e.g. l'abri → le + abri). On the other hand, in old<br />

German written until 1900, like in modern English, it<br />

can also express possession (e.g. Goldschmied's,<br />

Theobald's, Mozart's). Under these circumstances,<br />

which is the desired tokenization, before or after the<br />

apostrophe? The answer is language-dependent and this<br />

underlies our approach towards tokenization.<br />

We use a two-step tokenization and perform the<br />

language recognition in between. The advantage of this<br />

approach is that we can deliver a language-specific<br />

tokenization of any input text (given that it is written in<br />

the supported languages). In the first step we carry out a<br />

rough tokenization of the text and then identify sentence<br />

77


Multilingual Resources and Multilingual Applications - Regular Papers<br />

boundaries. Once this is achieved, we can proceed to the<br />

language identification, which will be discussed in<br />

section 3.4.<br />

Afterwards we do another round of tokenization focused<br />

on word-level, where the language-specific rules come<br />

into play. We have implemented a set of heuristic rules<br />

in order to deal with special characters in a multilingual<br />

context, such as abbreviations, apostrophes or hyphens.<br />

For example, each acronym whose letters are separated<br />

by periods (e.g. C.A.S. or A.A.C.Z.) is considered a<br />

single token, if it is listed in our abbreviations<br />

dictionary. A German apostrophe is split from the<br />

preceding word (e.g. geht's → geht + 's), whereas in<br />

French and Italian it remains with the first word (e.g.<br />

dell'aqua → dell' + aqua, l'eau → l' + eau). Besides, we<br />

have compiled a small set of French apostrophe words<br />

which shouldn't be separated at all (e.g. aujourd'hui).<br />

Disambiguation for hyphens occurring in the middle of a<br />

word is performed by means of the general word<br />

frequency dictionary. For example, if nordouest has 14<br />

hits and nord-ouest 957 hits, we conclude that the<br />

hyphen is part of the compound and thus nord-ouest<br />

should be regarded as a single token. On the other hand,<br />

hyphens marking line breaks may also appear in the<br />

middle, like in the word rou-te. In this case, the<br />

hyphenated word appears 3 times in the dictionary,<br />

whereas the one without, route, 6335 times. Therefore<br />

the hyphen will be removed from the word.<br />

3.4. Language Identification<br />

The accuracy of the language identification is crucial for<br />

the automatic text analysis performed during the<br />

annotation process, such as tokenization, part-of-speech<br />

tagging, lemmatization or named entity identification.<br />

Therefore we perform a fine-grained analysis, at<br />

sentence level. We work with a statistical language<br />

identifier2 based on the approach presented in (Dunning,<br />

1994). The module uses two classifiers: one to<br />

distinguish between German, French, English and Italian<br />

and another one in order to discriminate between Italian<br />

and Romansh. In case the identified language is<br />

German, a further analysis based on the frequency<br />

dictionary is being carried out in order to decide whether<br />

or not it is Swiss German (CH-DE). This dictionary<br />

2 http://search.cpan.org/dist/Lingua-Ident/Ident.pm<br />

78<br />

contains frequently used Swiss German dialect words<br />

which do not have homographs in standard German.<br />

Whenever a sentence contains more than 10% dialect<br />

words from this list, the language of the sentence is set<br />

to CH-DE.<br />

However, the statistical language identification is not<br />

reliable for very short sentences. In order to achieve<br />

higher accuracy, we apply the heuristic rule that only<br />

sentences longer than 40 characters are fed to the<br />

language identifier. All the others are assigned the<br />

language of the article, as it appears in the ToC. The<br />

correctness of this decision relies on the fact that all ToC<br />

files are proofed manually, so that we do not introduce<br />

noisy data.<br />

Table 1 gives an overview of the distribution of the<br />

identified languages in the articles from the Text+Berg<br />

corpus. We present here only the composition of<br />

German and French articles, as they represent the great<br />

majority of our corpus (approximatively 98%). The<br />

values are not 100% accurate, as they are automatically<br />

computed by means of statistical methods. However,<br />

they mirror the global tendencies of the corpus that over<br />

95% of the sentences in an article are in the language of<br />

the article, a conclusion which corresponds to our<br />

expectations. An interesting finding is the percentage<br />

variation of foreign sentences. For example, German<br />

sentences are two times more frequent in French articles<br />

than the French sentences in German articles (in<br />

percentage terms). One reason for this is the fact that<br />

some French articles are translated from German and<br />

preserve the original bibliographical references, captions<br />

or footnotes. Other sources of language mixture are<br />

quotations and direct speech, aspects which can be<br />

encountered in both German and French articles.<br />

3.5. Linguistic Processing<br />

In the last two annotation stages we perform some<br />

linguistic processing, namely lemmatization and part of<br />

speech tagging. The markup is done by the TreeTagger3 .<br />

For our corpus, we have applied the standard<br />

configuration files for German, English and Italian. In<br />

the case of French we adopted a different approach, and<br />

we have trained our own parameter files based on the Le<br />

Monde-Treebank (Abeillé, 2003).<br />

3 www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Article language Number of sentences per language<br />

de en fr it rm ch-de total<br />

DE 1.166.141 1035 11.607 1481 1490 799 1.182.553<br />

FR 12.392 607 670.599 1187 1277 2 686.064<br />

Table 1: The language distribution of the sentences in the Text+Berg corpus<br />

Figure 1: An annotation snippet<br />

Romansh is not yet supported due to the lack of a<br />

sufficiently large annotated corpus for training the<br />

corresponding parameter file. Figure 1 shows a sample<br />

output: an annotated sentence in XML format.<br />

The TreeTagger assigns only lemmas for word forms<br />

that it knows (that have been encountered during the<br />

training). This results in a substantial number of word<br />

forms with unknown lemmas. Therefore we use an<br />

additional lemmatization tool, in order to increase the<br />

coverage of lemmatization. This approach has been<br />

implemented for German only because of its large<br />

number of compounds.<br />

We use the system Gertwol4 to insert missing German<br />

lemmas. Towards this goal we collect all word form<br />

types from the corpus and have Gertwol analyse them. If<br />

the TreeTagger does not assign a lemma to a word,<br />

whereas Gertwol provides an appropriate alternative, we<br />

choose the output of the latter system. This has resulted<br />

in approximately 700.000 additional lemmas, 80%<br />

percent of which represent noun lemmas, 15%<br />

adjectives and the remaining 5% other parts of speech.<br />

After performing this step, the remaining unknown<br />

4 http://www2.lingsoft.fi/cgi-bin/gertwol<br />

lemmas are mostly names and words containing OCR<br />

errors. We are interested in extending this strategy for<br />

French and Italian, in order to further increase the<br />

coverage of the annotation.<br />

4. Tools for Accessing the Corpus<br />

The Text+Berg corpus can be accessed through several<br />

search systems. For example, we have stored our<br />

annotated corpus in the Corpus Query Workbench<br />

(Christ, 1994), which allows us to browse it via a web<br />

interface5 . The queries follow the POSIX EGREP syntax<br />

for regular expressions. The advantage of this system is<br />

that it provides more precise results than usual search<br />

engines (which perform a full text search) due to our<br />

detailed annotations. For example, it is possible to query<br />

for all mountain names ending in horn that were<br />

mentioned before 1900. Moreover, it is also possible to<br />

restrict queries to particular languages or POS tags.<br />

In addition, we have built a tool for word alignment<br />

searches in our parallel corpus6 . Given a German search<br />

term, the tool displays all hits in the German part of the<br />

corpus together with the corresponding French sentences<br />

with the aligned word(s) highlighted. Other than being a<br />

word alignment visualization tool, it also serves as<br />

bilingual concordance tool to find mountaineering<br />

terminology in usage examples. In this way it is easy to<br />

determine the appropriate translation for words like<br />

Haken (English: hook) or Steigeisen (English: crampon).<br />

Moreover, it enables a consistent view of the possible<br />

translations of ambiguous words as Kiefer (English: jaw,<br />

pine) or Mönch (English: monk, mountain name). Figure<br />

2 depicts the output of the system for the word Leiter,<br />

which can either refer to leader or ladder.<br />

5 Access to the CQW is password-protected. See<br />

http://www.textberg.ch/index.php?id=4&lang=en for<br />

registration.<br />

6 http://kitt.ifi.uzh.ch/kitt/alignsearch/<br />

79


Multilingual Resources and Multilingual Applications - Regular Papers<br />

80<br />

Figure 2: Different translations of the German word Leiter in the Text+Berg corpus<br />

5. Conclusion<br />

In this paper we have given an overview of the<br />

annotation workflow of the Text+Berg corpus. The<br />

pipeline is capable of processing multilingual documents<br />

and dealing with both diachronic varieties in language<br />

and noisy data (OCR errors). The flexible architecture of<br />

the pipeline allows us to extend the corpus with more<br />

alpine literature and to process it in a similar manner,<br />

with little overhead.<br />

We have provided insights into the multilingual<br />

challenges in the annotation process, such as OCR<br />

correction, tokenization or language identification. We<br />

intend to further reduce the number of OCR errors by<br />

launching a crowd correction wiki page, where the<br />

members of the Swiss Alpine Club will be able to<br />

correct such mistakes. Regarding linguistic processing,<br />

we will continue investing efforts in improving the<br />

quality of the existing annotation tools with languagespecific<br />

resources (e.g. frequency dictionaries,<br />

additional lemmatizers). We will also work on<br />

improving the language models for Romansh and Swiss<br />

German dialects, in order to increase the reliability of<br />

the language identifier.<br />

6. References<br />

Abeillé, A., Clément, L., Toussenel, F. (2003): Building<br />

a Treebank for French. In Building and Using Parsed<br />

Corpora, Text, Speech and Language Technology(20),<br />

pp. 65–187.<br />

Christ, O. (1994): The IMS Corpus Workbench<br />

Technical Manual. Institut <strong>für</strong> maschinelle<br />

Sprachverarbeitung, <strong>Universität</strong> Stuttgart.<br />

Cunningham, H., Maynard, D., Bontcheva, K. (2002):<br />

GATE: A framework and graphical development<br />

environment for robust NLP tools and applications. In<br />

Proceedings of the 40th Anniversary Meeting of the<br />

Association for Computational Linguistics.<br />

Dunning, T. (1994): Statistical identification of<br />

language. Technical Report MCCS-94-273, New<br />

Mexico State University.<br />

He, Y., Kayaalp, M. (2006): A comparison of 13<br />

tokenizers on MEDLINE. Technical Report<br />

LHNCBC-TR-2006-003, The Lister Hill National<br />

Center for Biomedical Communications.<br />

Reynaert, M. (2008): Non-interactive OCR postcorrection<br />

for giga-scale digitization projects. In A.<br />

Gelbukh (Ed.), Proceedings of the Computational<br />

Linguistics and Intelligent Text Processing 9th<br />

International Conference, Lecture Notes in Computer<br />

Science. Berlin, Springer, pp. 617–630.<br />

Sennrich, R., Volk, M. (2010): MT-based sentence<br />

alignment for OCR-generated parallel texts. In<br />

Proceedings of AMTA. Denver.<br />

Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M.,<br />

Furrer, L., Ruef, B. (2010a): Challenges in building a<br />

multilingual alpine heritage corpus. In Proceedings of<br />

the Seventh international conference on Language<br />

Resources and Evaluation (LREC).<br />

Volk, M., Marek, T., Sennrich, R. (2010b): Reducing<br />

OCR errors by combining two OCR systems. In<br />

Proceedings of the ECAI 2010 Workshop on<br />

Language Technology for Cultural Heritage, Social<br />

Sciences, and Humanities (LaTeCH 2010).


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Visualizing Dependency Structures<br />

Chris Culy, Verena Lyding, Henrik Dittmann<br />

European Academy of Bozen/Bolzano<br />

viale Druso 1, 39100 Bolzano, Italy<br />

E-mail: chris@chrisculy.net, verena.lyding@eurac.edu, henrik.dittmann@eurac.edu<br />

Abstract<br />

In this paper we present an advanced visualization tool specialized for the presentation and interactive analysis of language structure,<br />

namely dependency structures. Extended Linguistic Dependency Diagrams (xLDDs) is a flexible tool that provides for the visual<br />

presentation of dependency structures and connected information according to the users’ preferences. We will explain how xLDD<br />

makes use of visual variables like color, shape and position to display different aspects of the data. We will provide details on the<br />

technical background and discuss issues with the conversion of dependency structures from different dependency banks. Insights<br />

from a small user study will be presented and we will discuss future directions and application contexts for xLDD.<br />

Keywords: dependency structures, dependency diagrams, visualization<br />

1. Introduction<br />

Dependency banks, and hence dependency structures, are<br />

becoming ever more widely available for different<br />

languages and are popular for a range of applications,<br />

from theoretical and applied linguistics research to<br />

pedagogy in linguistics and language learning (cf. e.g. the<br />

VISL project 1<br />

; (Hajič et al., 2001; Nivre et al., 2007)). In<br />

this context, also a number of (usually static)<br />

visualizations of dependency structures have been<br />

presented (Gerdes & Kahane, 2009; Nivre et al., 2006).<br />

Generally, visualizations of language and linguistic<br />

information (“LInfoVis”, from “Linguistic Information<br />

Visualization”) are becoming more widespread (see<br />

(Rohrdantz et al., 2010) for an overview), but<br />

visualizations targeted specifically at linguists and their<br />

informational needs are still not very common. Current<br />

attempts to visualize language data are usually either<br />

visually very simple or linguistically uninformed, and<br />

often very much bound to a specific application context.<br />

We are trying to improve this situation with a series of<br />

advanced LInfoVis tools. In this paper, we present<br />

Extended Linguistic Dependency Diagrams (xLDDs), an<br />

example of a LInfoVis tool which combines advanced<br />

visualization techniques with linguistic knowledge to<br />

create a new kind of interactive dependency diagram.<br />

This tool can be easily adapted for a variety of uses in a<br />

1 http://visl.sdu.dk/visl/en/parsing/automatic/dependency.php<br />

variety of environments and can be used with a range of<br />

dependency structure formats.<br />

2. Dependency Structures and<br />

Dependency Diagrams<br />

We will distinguish between dependency structures,<br />

which are mathematical objects (graphs), and<br />

dependency diagrams, which are visual representations<br />

of dependency structures. Unfortunately, the linguistics<br />

literature does not always maintain this distinction, but it<br />

is an important one, since the same dependency structure<br />

can have many different visual representations (see e.g.<br />

ANNIS2 2<br />

for multiple visual representations of the same<br />

structure).<br />

While there is no standard, or even general agreement,<br />

about what information should or should not be included<br />

in a dependency structure, essentially dependency<br />

structures are directed (usually acyclic) graphs that<br />

indicate binary head-dependent relations between parts<br />

of a sentence (see (Hudson, 1984) for early examples of<br />

dependency structures). We will call a dependency<br />

structure basic if it consists only of the tokens of the<br />

sentence and the relations between them, without any<br />

additional information. However, almost all dependency<br />

structures have more information than just relations<br />

between tokens (e.g. often there is lemma or part of<br />

speech (POS) information associated with the tokens).<br />

2 http://www.sfb632.uni-potsdam.de/~d1/annis<br />

81


Multilingual Resources and Multilingual Applications - Regular Papers<br />

We will refer to these dependency structures as<br />

advanced.<br />

We will call a dependency diagram linearized if it shows<br />

the tokens of the sentence in their typical presentation<br />

direction (e.g. left to right for German, right to left for<br />

Arabic). Basic dependency structures allow for basic<br />

diagrams only, as the information to visualize is restricted<br />

to tokens and dependency relations. Figure 1 shows an<br />

advanced dependency diagram of an advanced<br />

dependency structure, in that it includes a variety of<br />

information, including POS information in addition to the<br />

tokens and dependency relations. The dependency<br />

relations are indicated by directed arcs between the<br />

tokens, and the directions of the arrows follow the<br />

EAGLES 3<br />

recommendation of having the arrow pointing<br />

towards the head.<br />

It goes beyond the presentation of a typical linearized<br />

diagram in the use of color and in the positioning of the<br />

arcs. The POS of the words are encoded by colored nodes<br />

and tokens, and hovering over a token shows a tooltip<br />

with the POS type, as in Figure 1 NN (noun) for the word<br />

“Absage”. Color is also used to distinguish different<br />

dependency relations, blue arcs indicate verb–object<br />

relations, red arcs indicate verb-subject relations, green is<br />

used for modifier relations, gray for determiner-noun<br />

relations and black for the root dependency. Furthermore,<br />

the positioning of the arcs above and below the text<br />

visually separates subject and object relations (arcs<br />

below text) from any other type of relation (arcs above<br />

text). The example in Figure 1 is based on Boyd et al.’s<br />

4<br />

(2007) reanalysis (to Decca-XML format) of sentences<br />

from the Tiger Dependency Bank (TiGerDB) (Brants et<br />

al., 2002). We will have more to say about it shortly.<br />

3 http://www.ilc.cnr.it/EAGLES<strong>96</strong>/segsasg1/node44.html<br />

4 We would like to thank Adriane Boyd and Detmar Meurers<br />

for kindly providing us with the data they describe in<br />

(Boyd et al., 2007).<br />

82<br />

Figure 1: Basic linearized xLDD with color coding of parts of speech and<br />

dependency types; TiGerDB 8046, structure as in (Boyd et al., 2007)<br />

3. Extended Linguistic Dependency<br />

Diagrams<br />

3.1. Visual encoding of information<br />

One of the key ideas of information visualization is that<br />

we can use different visual features to encode different<br />

aspects of the information being visualized. Dependency<br />

structures, especially advanced dependency structures,<br />

provide lots of information that we can represent in<br />

various ways. xLDDs use three main visual properties to<br />

encode information in addition to the basic token and<br />

dependency information: position, color and size. These<br />

three visual variables are preattentive, meaning that we<br />

perceive strong differences without having to search for<br />

them actively. Information that is encoded in this way<br />

stands out among the other information present in the<br />

diagram and hence is much easier to locate and identify<br />

by the user. For example, in Figure 1 we can immediately<br />

find the verbal argument relations by the position of their<br />

arcs below the text, and the subject relation by its<br />

red color.<br />

Position is used in two ways. First, we can position the<br />

arcs above or below the text, using any kind of property,<br />

simple or calculated, to determine which arcs are below<br />

and which above (as in Figure 1). The second use of<br />

position is that of the vertical placement of tokens. By<br />

varying the standard vertical placement of tokens (i.e. not<br />

all on the same horizontal line) we can also encode<br />

certain kinds of information, as e.g. in Figure 2, where<br />

words that are split into several tokens are placed one<br />

level below the other text. This example shows an<br />

alternative reanalysis of the sentence from Figure 1, here<br />

based on By’s (2009) reanalysis of sentences from the<br />

TiGerDB. By made different choices from Boyd et al. He<br />

did not include POS information, and he split compound<br />

nouns into several tokens. Hence, we are provided with


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Figure 2: Advanced xLDD, encoding by levels words that are split into multiple<br />

tokens; TiGerDB 8046, structure as in (By, 2009)<br />

different information for the visualization. While in<br />

Figure 1 color is used on nodes and tokens to encode<br />

token-related information and on arcs to encode<br />

information on the dependency relations, in Figure 2 the<br />

coloring of nodes indicates the linear position of<br />

subject/object nodes relative to their heads:<br />

subject/object nodes left of their head are colored in red,<br />

right of their head in green, between multiple heads in<br />

yellow. Nodes of other relations are colored gray.<br />

The visual feature size is employed in xLDDs in form of<br />

the thickness of lines of arcs. In Figure 2 it is used to<br />

distinguish arcs between sub-words (thin) and any other<br />

arc (thick). As in Figure 1, arcs of subject and object<br />

relations are placed below the text and others above.<br />

There are several other visual aspects that we could use to<br />

encode information. We could, for example, also use the<br />

size or style/font of the text, or the shape of the nodes<br />

corresponding to the tokens to encode other information.<br />

All of these visual encodings in xLDDs, especially the<br />

preattentive ones, help (potentially) the user see patterns<br />

more quickly and more accurately than a monochrome,<br />

uniformly positioned dependency diagram.<br />

3.2. Visual presentation and interaction<br />

Another major hallmark of contemporary visualizations<br />

is their adjustability and interactivity. Some aspects of the<br />

visualization may not encode information but can be<br />

modified to improve readability, or cater to the individual<br />

user’s preferences. These include curvature and style of<br />

arcs, positioning of words, text size, shape of arrow heads.<br />

More or less circular arcs, staggered words, and smaller<br />

text size help to create compact displays that fit more<br />

information on the screen, which can be an advantage for<br />

displaying long sentences. Note that the same visual<br />

property, e.g. arc width, may either be facultative (when<br />

it doesn’t vary within one xLDD diagram) or may be used<br />

to encode information (when it does vary). Which setup<br />

is most helpful depends on the data to be visualized as<br />

well as on the user. Giving the user the flexibility to set<br />

those variables, besides setting variables for the visual<br />

encoding, is a major benefit of xLDD.<br />

In addition, by interacting with the visualization the user<br />

can get more information about the underlying data than<br />

can be seen in a static diagram. In the case of xLDDs, the<br />

application can provide different kinds of information in<br />

response to actions aimed at different parts of the<br />

diagram, for example clicking on a token, or its<br />

corresponding node, or moving the mouse over an arc or<br />

token. In Figure 2, we see that hovering over an arc<br />

brings up a tooltip with its relation type (here oa (direct<br />

object) between “Absage” and “erteilten”).<br />

Double-clicking on the node for “Absage”, shows<br />

token-related information, that is case, number, gender<br />

and index information, but no POS information, since it<br />

is not available in the underlying data. Since this<br />

information does not involve two tokens, it is not<br />

represented via arcs in the main diagram. It would also be<br />

possible to interactively suppress information, e.g.<br />

eliminating all arcs except the ones of interest. As with<br />

the visual features, which kinds of interaction serve what<br />

kinds of information depends on the particular<br />

application, the particular data, as well as on user<br />

preferences.<br />

3.3. Architecture and technical details<br />

xLDD is implemented in JavaScript, using the Protovis<br />

toolkit (Bostock & Heer, 2009). We have created a simple<br />

JSON exchange format for dependency structures (JSDS).<br />

Input dependency structures, whether from a fixed local<br />

source or from a dynamic web service, are converted into<br />

83


Multilingual Resources and Multilingual Applications - Regular Papers<br />

JSDS before being visualized by the xLDD framework.<br />

The xLDD framework contains an extensible visual<br />

encoding and interactive component, which allow the<br />

application developer complete control over what kinds<br />

of information are visually encoded and how, and<br />

similarly, what kinds of interactions there are. xLDD is<br />

thus intended as a tool that will be incorporated into a<br />

website or web application.<br />

Unfortunately, not all dependency structures contain the<br />

tokens of the source sentence or their order. Dependency<br />

structures following the example of the PARC 700 (King<br />

et al., 2003), for example, do not. These structures cannot<br />

be visualized as linearized dependency diagrams since<br />

they lack the relevant information, and since xLDDs are<br />

necessarily linearized, structures of this type cannot be<br />

visualized using xLDD. However, often these<br />

non-linearizable structures can be converted into<br />

linearizable ones. In fact, both of the presented examples<br />

are based on the TiGerDB, which does not contain the<br />

original tokens, following the model of the PARC 700. In<br />

both cases, the original dependency structures have been<br />

reanalyzed by other researchers to include the original<br />

token and token order information, cf. (Boyd et al., 2007)<br />

for Figure 1 and (By, 2009) for Figure 2. However, these<br />

conversions to a linearizable form are not trivial, and<br />

cannot necessarily be fully automated. An additional<br />

point is that the two conversions make different decisions<br />

about things like tokenization and POS, and so the<br />

resulting dependency structures are different from each<br />

other as well as from the original structures.<br />

Thus, in order for a dependency structure to be usable by<br />

xLDD, it must meet two conditions: first it must be<br />

linearizable (or converted to a linearizable form), and<br />

second it must be converted to the JSDS exchange format.<br />

Regarding the required exchange format, we have<br />

already written converters to JSDS for the CoNLL 2007<br />

Dependency Parsing format 5 , as well as for By’s formats<br />

and the Decca-XML format (Boyd et al., 2007). Our<br />

target format (JSDS format) is quite simple, so that<br />

converters for other (linearizable) formats to JSDS (e.g<br />

MALT-XML 6<br />

) would be easy to write.<br />

5 http://nextens.uvt.nl/depparse-wiki/DataFormat<br />

6 http://w3.msi.vxu.se/~nivre/research/MaltXML.html<br />

84<br />

4. User evaluation, future directions,<br />

and conclusion<br />

In other work (Culy et al., <strong>2011</strong>), we report on an<br />

evaluation study that we did of an earlier version of<br />

xLDD. Two usability tests plus the collection of<br />

subsequent evaluative feedback were carried out with<br />

four subjects with linguistics and language didactics<br />

background. For testing the use of the different visual<br />

features in xLDD the subjects were asked to find<br />

specified dependency relations in nine different xLDD<br />

displays (e.g. with and without the coloring of arcs, with<br />

different types of leveled and staggered text, etc.). In the<br />

tests the users’ reactions to xLDD (thinking aloud) and<br />

their performance (time and errors for task completion)<br />

were recorded. In general, users preferred visual cues<br />

over text-based indications (e.g. details in the pop-up<br />

window for each lemma) for solving the given tasks.<br />

They found color-coding and placement of the arcs to be<br />

very useful, with vertical positioning of the text<br />

somewhat less so. They also would have preferred to<br />

have some control over the visual encodings, which was<br />

not possible in the test situation, but is integrated into<br />

some of the current sample applications of xLDD in<br />

response to the users' requests. Since users did not<br />

understand what, if anything, was being encoded by<br />

vertical positioning, giving them control over the vertical<br />

positioning might have made it more useful. The main<br />

negative reaction was to problems with overlapping<br />

arrows and text, especially when the figure is zoomed out<br />

(i.e. gets smaller). Back on the positive side, there was<br />

consensus that xLDD would be useful in language<br />

learning and teaching.<br />

Finally, there are issues about how to visualize<br />

mismatches between the dependency structure and the<br />

original sentence (which are also issues for linearization).<br />

One case is that of punctuation, which may not be<br />

included in the dependency structure, but which is in the<br />

original sentence. While we might visualize only the<br />

dependency structure proper, it seems useful for some<br />

applications (e.g. language learning) to include the<br />

original punctuation.<br />

A second case is that of null elements of a sentence that<br />

are included in some dependency structures, e.g. the<br />

TiGerDB. For example, the dependency structure for<br />

“Was nicht zur Politik wird, hat keinen Zweck.”


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Figure 3: Presentation of corpus query results in the prototype thumbnails application; sentences matching the query<br />

(here “Heute” in corpus of German press releases) are presented as small xLDDs side by side with plain text.<br />

(TiGerDB 8247) has a null subject of “hat”. Since these<br />

null elements are not visible parts of the original sentence<br />

(no token is representing them), it is not clear how to<br />

visualize them. A similar question arises in dealing with<br />

multiple information contained within a single token. By<br />

(2009) and Boyd et al. (2007) make different decisions in<br />

how they handle these cases. For example, By (2009)<br />

inserts a null token following “zur” in the same example,<br />

corresponding to an empty determiner “der” (dative form<br />

of “die”) in the original Tiger structure, but Boyd et al.<br />

(2007) do not. This underscores our earlier comment that<br />

there is no agreement about the nature of dependency<br />

structures. A related issue has to do with abstract nodes,<br />

nodes which correspond to a syntactic category rather<br />

than to a null token. For example, the dependency<br />

diagram in TiGerDB for “Dazu bedarf es Kompetenz und<br />

eines gewissen Apparates.” (TiGerDB 8020) contains a<br />

node “coord” which is the head of a “coord_form”<br />

dependency with “und” as the dependent. “coord” is also<br />

the head of two “cj” dependencies with “Apparat” and<br />

“Kompetenz” as the dependents. Since “coord_form” is<br />

not a token in the sentence, it is not clear how to visualize<br />

it and its relations.<br />

A third visualization issue is where tokenization does not<br />

agree with orthographic boundaries (e.g. compounds in<br />

Tiger, where the compounds are separate elements in the<br />

original and in (By, 2009), but not in (Boyd et al., 2007)).<br />

We have done some preliminary experiments concerning<br />

these mismatches, but we plan on testing a wider range of<br />

examples. Finally, we can point out that all of these<br />

mismatches arise from ideas about dependency structures<br />

that vary from the idea of representing relations between<br />

words.<br />

In addition to addressing the functional difficulties<br />

evident in the evaluation, we have created a series of<br />

examples and prototype applications using xLDD that<br />

also take into account some of the other results of the<br />

evaluation. Several of the examples allow the user to<br />

specify which linguistic properties are encoded by which<br />

visual variables. While we can give the user full control<br />

over these encodings, often it is sufficient to use simple<br />

specifications of arc position and/or color of the arcs or<br />

tokens. Using too many visual variables is just as<br />

confusing as using none, or even more so. The specific<br />

choices of visual encodings depend on what the user is<br />

interested in – there is no single best encoding that<br />

encompasses all tasks and interests.<br />

One of the prototypes is an interactive diagram<br />

constructor for an on-line textbook. Given a sentence, the<br />

student can specify the relations among tokens, and the<br />

diagram will be constructed incrementally. It can also be<br />

verified against a correct diagram provided by the<br />

instructor. A second prototype combines a corpus query<br />

engine with xLDD. The search results (obtained via a<br />

web service) are presented as a table of the sentences and<br />

small versions of the diagrams (as shown in Figure 3). All<br />

85


Multilingual Resources and Multilingual Applications - Regular Papers<br />

these small diagrams can (simultaneously) have their<br />

visual encodings adjusted, and on clicking on any of them<br />

a larger version of that diagram is presented. These two<br />

prototypes underline the point that xLDD is a component<br />

which can be customized and used in any number of<br />

ways, and we hope that it will be adopted and adapted by<br />

others (e.g. in the context of CLARIN 7<br />

).<br />

In sum, xLDD is a new way of visualizing dependency<br />

structures, which incorporates advanced visualization<br />

techniques and provides flexibility for customizing the<br />

visualization. Color and position are used to encode<br />

information which is omitted or difficult to see in other<br />

dependency diagrams. Interaction provides even more<br />

opportunities to efficiently explore the structure. The<br />

preliminary results of a small-scale user study are<br />

promising, and give indications about what needs to be<br />

focused on for integration into specialized applications.<br />

86<br />

5. References<br />

Bostock, M., Heer, J. (2009): Protovis: A Graphical<br />

Toolkit for Visualization. IEEE Transactions on<br />

Visualization and Computer Graphics, 15(6), pp.<br />

1121–1128.<br />

Boyd, A., Dickinson, M., Meurers, D. (2007): On<br />

representing dependency relations – Insights from<br />

converting the German TiGerDB. In Proceedings of<br />

the Sixth International Workshop on Treebanks and<br />

Linguistic Theories (TLT 2007, Bergen, Norway), pp.<br />

31–42.<br />

Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.<br />

(2002): The TIGER Treebank. In Proceedings of the<br />

First Workshop on Treebanks and Linguistic Theories<br />

(TLT 2002, Sozopol, Bulgaria), pp. 24–41.<br />

Buchholz, S., Marsi, E. (2006): CoNLL-X Shared Task<br />

on Multilingual Dependency Parsing. In Proceedings<br />

of the Tenth Conference on Computational Natural<br />

Language Learning (CoNLL-X, New York City, NY,<br />

USA), pp. 149–164.<br />

By, T. (2009): The TiGer Dependency Bank in Prolog<br />

format. In Proceedings of Recent Advances in<br />

Intelligent Information Systems (IIS’09, Warsaw,<br />

Poland), pp. 119–129.<br />

Culy, C., Lyding, V., Dittmann, H. (<strong>2011</strong>): xLDD:<br />

Extended Linguistic Dependency Diagrams. In<br />

7<br />

European Research Infrastructure CLARIN,<br />

http://www.clarin.eu<br />

Information Visualization: Proceedings of the 15th<br />

International Conference on Information Visualization<br />

(IV <strong>2011</strong>, London, UK), pp. 164–169.<br />

Gerdes, K., Kahane, S. (2009): Speaking in Piles:<br />

Paradigmatic annotation of French spoken corpus. In<br />

Proceedings of the Corpus Linguistics Conference<br />

(CL2009, Liverpool, UK).<br />

Hajič, J., Vidová Hladká, B., Pajas, P. (2001): The Prague<br />

Dependency Treebank: Annotation Structure and<br />

Support. In Proceedings of the IRCS Workshop on<br />

Linguistic Databases (Philadelphia, PA, USA), pp.<br />

105–114.<br />

Hudson, R. (1984): English Word Grammar. London:<br />

Blackwell.<br />

King, T.H., Crouch, R., Riezler, S., Dalrymple, M.,<br />

Kaplan, R.M. (2003): The PARC 700 Dependency<br />

Bank. In Proceedings of the 4th International<br />

Workshop on Linguistically Interpreted Corpora<br />

(LINC-03, Budapest, Hungary).<br />

Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J.,<br />

Riel, S., Yuret, D. (2007): The CoNLL 2007 Shared<br />

Task on Dependency Parsing. In Proceedings of the<br />

CoNLL Shared Task Session of EMNLP-CoNLL 2007<br />

(Prague, Czech Republic), pp. 915–932.<br />

Nivre, J., Hall, J., Nilsson, J. (2006): Maltparser: A<br />

data-driven parser-generator for dependency parsing.<br />

In Proceedings of the Fifth International Conference<br />

on Language Resources and Evaluation (LREC 2006,<br />

Genoa, Italy), pp. 2216–2219.<br />

Rohrdantz, C., Koch, S., Jochim, C., Heyer, G.,<br />

Scheuermann, G., Ertl, T., Schütze, H., Keim, D.A.<br />

(2010): Visuelle Textanalyse. Informatik-Spektrum,<br />

33(6), pp. 601–611.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

A functional database framework for querying very large multi-layer corpora<br />

Roman Schneider<br />

Institut <strong>für</strong> deutsche Sprache<br />

R5 6-13, D-68161 Mannheim<br />

schneider@ids-mannheim.de<br />

Abstract<br />

Linguistic query systems are special purpose IR applications. We present a novel state-of-the-art approach for the efficient<br />

exploitation of very large linguistic corpora, combining the advantages of relational database management systems (RDBMS) with<br />

the functional MapReduce programming model. Our implementation uses the German DEREKO reference corpus with multi-layer<br />

linguistic annotations and several types of text-specific metadata, but the proposed strategy is language-independent and adaptable<br />

to large-scale multilingual corpora.<br />

Keywords: corpus storage, multi-layer corpora, corpus retrieval, database systems<br />

1. Introduction<br />

In recent years, the quantitative examination of natural<br />

language phenomena has become one of the<br />

predominant paradigms within (computational)<br />

linguistics. Both fundamental research on the basic<br />

principles of human language, as well as the<br />

development of speech and language technology,<br />

increasingly rely on the empirical verification of<br />

assumptions, rules, and theories. More data are better<br />

data (Church, & Mercer, 1993): Consequently, we notice<br />

a growing number of national initiatives related to the<br />

building of large representative datasets for<br />

contemporary world languages. Besides written (and<br />

sometimes spoken) language samples, these corpora<br />

usually contain vast collections of morphosyntactic,<br />

phonetic, semantic etc. annotations, plus text- or corpusspecific<br />

metadata. The downside of this trend is<br />

obvious: Even with specialized applications, our ability<br />

to store linguistic data is often bigger than the ability to<br />

process all this data.<br />

A lot of essential work towards the querying of<br />

linguistic corpora goes into data representation,<br />

integration of different annotation systems, and the<br />

formulation of query languages (e.g., Rehm et al., 2008;<br />

Zeldes et al., 2009; Kepser, Mönnich & Morawietz,<br />

2010). But the scaling problem still remains: As we go<br />

beyond corpus sizes of some million words, and at the<br />

same time increase the number of annotation systems<br />

and search keys, query costs rise disproportionately.<br />

This is due to the fact that unlike traditional IR systems,<br />

corpus retrieval systems not only have to deal with the<br />

“horizontal” representation of textual data, but with<br />

heterogeneous metadata on all levels of linguistic<br />

description. And, of course, the exploration of interrelationships<br />

between annotations becomes more and<br />

more challenging as the number of annotation systems<br />

increases. Given this context, we present a novel<br />

approach to scale up to billion-word corpora, using the<br />

example of the multi-layer annotated German Reference<br />

Corpus DEREKO.<br />

2. The Data<br />

The German Reference Corpus DEREKO currently<br />

comprises more than four billion words and constitutes<br />

the largest linguistically motivated collection of<br />

contemporary German. It contains fictional, scientific,<br />

and newspaper texts – as well as several other text types<br />

– and is annotated morphosyntactically with three<br />

competing systems (Connexor, Xerox, TreeTagger). The<br />

automated enrichment with additional metadata is<br />

underway.<br />

87


Multilingual Resources and Multilingual Applications - Regular Papers<br />

88<br />

Figure 1: Response times for nested SQL queries with three search keys (logarithmic scaled axis)<br />

3. Existing Approaches<br />

We empirically evaluated the most prominent existing<br />

querying approaches, and contrasted them with our<br />

functional model (the full paper will contain our detailed<br />

series of measurements). Given the reasonable<br />

assumptions that XML/SGML-based markup languages<br />

are more suitable for data exchange than for efficient<br />

storing and retrieval, and that traditional file-based data<br />

storage is less robust and powerful than database<br />

management systems, we focused on the following<br />

strategies:<br />

i. In-Memory Search: Due to the fact that a<br />

computer’s main memory is still the fastest form of<br />

data storage, there are attempts to implement inmemory<br />

databases even for considerably large<br />

corpora (Pomikálek, Rychlý & Kilgarriff, 2009).<br />

These indexless systems perform well for unparsed<br />

texts, but are strongly limited in terms of storage<br />

size and therefore cannot deal with data-intensive<br />

multi-layer annotations.<br />

ii. N-Gram Tables: In order to overcome physical<br />

limitations, newer approaches use database<br />

management systems and decompose sequences of<br />

strings into indexed n-gram tables (Davies, 2005).<br />

This allows queries over a limited number of search<br />

expressions, but space requirements for increasing<br />

values of n are enormous. Sentence-external queries<br />

with regular expressions or NOT-queries – both are<br />

crucial for comprehensive linguistic exploration –<br />

cannot use the n-gram-based indexes and thus<br />

perform rather poor.<br />

iii. Advanced SQL: Another strategy is to make use of<br />

the relational power of sub-queries and joins within<br />

a RDBMS. Chiarcos et al. (2008) use an<br />

intermediate language between query formulation<br />

and database backend; Bird et al. (2005) present an<br />

algorithm for the direct translation of linguistic<br />

queries into SQL. This approach uses absolute word<br />

positions, and therefore allows proximity queries<br />

without limitation of word distances. But again,<br />

even with the aid of the integrated cost-based<br />

optimizer (CBO), response times for increasing<br />

numbers of search keys become extremely long. We<br />

evaluated the proposed strategy on 1, 10, 100, 1000,


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Figure 2: MapReduce processes for a concatenated query with eight search keys<br />

iv. and 4000 million word corpora with rare-, low-,<br />

mid-, high-, and top-level search keys and found out<br />

that concatenated queries soon exceed the capability<br />

of our reference server because nested loops<br />

generate an immense workload. Figure 1 shows the<br />

response times in seconds for the query “select<br />

count(t1.co_sentenceid) from tb_token t1, (select<br />

co_id, co_sentenceid from tb_token where<br />

co_token=token1) t3, (select co_id, co_sentenceid<br />

from tb_token where co_token = token2) t2 where<br />

co_token = token3 and t1.co_sentenceid =<br />

t2.co_sentenceid and t1.co_sentenceid =<br />

t3.co_sentenceid and t1.co_id > t2.co_id and<br />

t2.co_id > t3.co_id;”, using three search keys on<br />

identical metadata types and a single-column index.<br />

This query simply counts the number of sentences<br />

that contain three specified tokens (token1, token2,<br />

token3) in a fixed order. Compared to a similar<br />

query on the 4000 Mio corpus with one search key<br />

(5s for a top-level search) or two search keys (56s),<br />

the increase of response time is obviously<br />

disproportional (301s). It gets remarkably less<br />

performant for searches on different metadata types<br />

(token, lemma, part-of-speech etc.) using multi-<br />

column indexes. Furthermore, by adding textspecific<br />

metada restrictions like text type or<br />

publication year, this querying strategy produces<br />

response times of several hours and thereby<br />

becomes fully unacceptable for real-time<br />

applications.<br />

4. Design and Implementation<br />

As our evaluation shows, existing approaches do not<br />

handle queries with complex metadata on very large<br />

datasets sufficiently. In order to overcome bottlenecks,<br />

we propose a strategy that allows the distribution of data<br />

and processor-intensive computation over several<br />

processor cores – or even cluster of machines – and<br />

facilitates the partition of complex queries at runtime<br />

into independent single queries that can be executed in<br />

parallel. It is based on two presuppositions:<br />

i. Mature relational DBMS can be used effectively to<br />

maintain parsed texts and linguistic metadata. We<br />

intensively evaluated different types of tables (heap<br />

tables, partitioned tables, index organized tables) as<br />

well as different index types (B-tree, bitmap,<br />

concatenated, functional) for the distributed storing<br />

and retrieval of linguistic data.<br />

89


Multilingual Resources and Multilingual Applications - Regular Papers<br />

ii. The MapReduce programming model supports<br />

90<br />

distributed programming and tackles large-data<br />

problems. Though MapReduce is already in use in a<br />

wide range of data-intensive applications (Lin &<br />

Dyer, 2010), its principle of “divide and conquer”<br />

has not been employed for corpus retrieval yet.<br />

In order to prove the feasibility of our approach, we<br />

implemented our corpus storage and retrieval framework<br />

on a commodity low-end server (quad-core<br />

microprocessor with 2.67 GHz clock rate, 16GB RAM).<br />

For the reliable measurement of query execution times,<br />

and especially to avoid caching effects, we always used<br />

a cold-started 64-bit database engine.)<br />

Figure 2 illustrates the map/reduce processes for a<br />

complex query, using eight dictinct search keys on<br />

Figure 3: Web-based retrieval form with our sample query<br />

different metadata types: Find all sentences containing a<br />

determiner immediately followed by a proper noun<br />

ending on “er”, immediately followed by a noun,<br />

immediately follwed by the lemma “oder”, followed by<br />

a determiner (any distance), immediately followed by a<br />

plural noun, followed by the lemma “sein” (any<br />

distance). Within a “map” step, the original query is<br />

partitioned into eight separate key-value pairs. Keys<br />

represent linguistic units (position, token, lemma, partof-speech,<br />

etc.), values may be the actual content. Thus,<br />

we can simulate regular expressions (a feature that is<br />

often demanded for advanced corpus retrieval systems,<br />

but difficult to implement for very large datasets).<br />

The queries can be processed in parallel and pass their<br />

results (sentence/position) to temporary tables. The


Multilingual Resources and Multilingual Applications - Regular Papers<br />

subsequent “reduce” processes filter out inappropriate<br />

results step by step. Usually, this cannot be executed in<br />

parallel, because each reduction produces the basis for<br />

the next step. But our framework, implemented with the<br />

help of stored procedures within the RDBMS,<br />

overcomes this restriction by dividing the process tree<br />

into multiple sub-trees. The reduce processes for each<br />

sub-tree are scheduled simultaneously, and aggregate<br />

their results after they are finished. So the seven reduce<br />

steps of our example can be executed within only four<br />

parallel stages.<br />

Our concatenated sample query with eight muti-type<br />

search keys on a four billion word corpus took less than<br />

four minutes, compared with several hours when<br />

employing SQL joins as in 3 (iii). The parallel<br />

MapReduce framework is invoked by an extensible<br />

web-based retrieval form (see figure 3) and stores the<br />

search results within the RDBMS, thus making it easy to<br />

reuse them for further statistical processing. Additional<br />

metadata restrictions (genre, topic, location, date) are<br />

translated into separate map processes and<br />

reduced/merged in parallel to the main search.<br />

5. Summary<br />

The results of our study demonstrate that the joining of<br />

relational DBMS technology with a functional/parallel<br />

computing framework like MapReduce combines the<br />

best of both worlds for linguistically motivated largescale<br />

corpus retrieval. On our reference server, it clearly<br />

outperforms other existing approaches. For the future,<br />

we plan some scheduling refinements of our parallel<br />

framework, as well as support for additional levels of<br />

linguistic description and metadata types.<br />

6. References<br />

Church, K., Mercer, R. (1993): Introduction to the<br />

Special Issue on Computational Linguistics Using<br />

Large Corpora. Computational Linguistics 19:1,<br />

pp. 1-24.<br />

Rehm, G., Schonefeld, O., Witt, A., Chiarcos, C.,<br />

Lehmberg, T. (2008): A Web-Platform for Preserving,<br />

Exploring, Visualising and Querying Linguistic<br />

Corpora and other Resources. Procesamiento del<br />

Lenguaje Natural 41, pp. 155-162.<br />

Zeldes, A., Ritz, J., Lüdeling, A., Chiarcos, C. (2009):<br />

ANNIS: A Search Tool for Multi-Layer Annotated<br />

Corpora. Proceedings of Corpus Linguistics 2009.<br />

July 20-23, Liverpool, UK.<br />

Kepser, S., Mönnich, U., Morawietz, F. (2010): Regular<br />

Query Techniques for XML-Documents. Metzing, D.,<br />

Witt, A. (Eds): Linguistic modeling of information<br />

and Markup Languages, Springer, pp. 249-266.<br />

Pomikálek, J., Rychlý, P., Kilgarriff, A. (2009): Scaling<br />

to Billion-plus Word Corpora. Advances in<br />

Computational Linguistics 41, pp. 3-13.<br />

Davies, M. (2005): The advantage of using relational<br />

databases for large corpora. International Journal of<br />

Corpus Linguistics 10 (3), pp. 307-334.<br />

Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling,<br />

A., Ritz, J., Stede, M. (2008): A Flexible<br />

Framework for Integrating Annotations from<br />

Different Tools and Tag Sets. Traitement Automatique<br />

des Langues 49(2), pp. 271-293.<br />

Bird, S., Chen, Y., Davidson, S., Lee, H., Zhen, Y.<br />

(2005): Extending Xpath to Support Linguistic<br />

Queries. Workshop on Programming Language<br />

Technologies for XML (Plan-X).<br />

Lin, J., Dyer, C. (2010): Data-Intensive Text Processing<br />

with MapReduce. Morgan & Claypool Synthesis<br />

Lectures on Human Language Technologies.<br />

91


Multilingual Resources and Multilingual Applications - Regular Papers<br />

92


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Hybrid Machine Translation for German in taraXÜ:<br />

Can translation costs be decreased without degrading quality?<br />

Aljoscha Burchardt, Christian Federmann, Hans Uszkoreit<br />

DFKI Language Technology Lab<br />

Saarbrücken& Berlin, Germany<br />

E-mail: {burchardt,cfedermann,uszkoreit}@dfki.de<br />

Abstract<br />

A breakthrough in Machine Translation is only possible if human translators are taken into the loop. While mechanisms for automatic<br />

evaluation and scoring such as BLEU have enabled fast development of systems, these systems have to be used in practice to get<br />

feedback for improvement and fine-tuning. However, it is not clear if and how systems can meet quality requirements in real-world,<br />

industrial translation scenarios. taraXÜ paves the way for wide usage of hybrid machine translation for German. In a joint consortium<br />

of research and industry partners, taraXÜ integrates human translators into the development process from the very beginning in a<br />

post-editing scenario collecting feedback for improvement of its core translation engines and selection mechanism. taraXÜ also<br />

performs pioneering work by integrating languages like Czech, Chinese, or Russian, that are not well studied to-date.<br />

Keywords: Hybrid Machine Translation, Human Evaluation, Post-Editing<br />

1. Introduction<br />

Machine Translation (MT) is a prime application of<br />

Language Technology. Research on Rule-Based MT<br />

(RBMT) goes back the early days of Artificial Intelligence<br />

in the 1<strong>96</strong>0s and some systems have reached a high<br />

level of sophistication (e.g. Schwall & Thurmair, 1997;<br />

Alonso & Thurmair, 2003). Since the mid 1990, Statistical<br />

MT (SMT) has become the prevalent paradigm in<br />

the research community (e.g. Koehn et al., 2007; Li et al.,<br />

2010). In the translation and localization industry,<br />

Translation Memory Systems (TMS) are used to support<br />

human translators by making informed suggestions for<br />

recurrent material that has to be translated.<br />

As human translators can no longer satisfy the constantly<br />

raising translation need, important questions that need to<br />

be investigated are:<br />

1) How good is MT quality today, especially for translation<br />

from and to German?<br />

2) Which paradigm is the most promising one?<br />

3) Can MT aid human translators and can it help to<br />

reduce translation costs without sacrificing quality?<br />

These questions are not easy to answer and it is clear that<br />

research on the matter is needed. The quality of MT<br />

output cannot be objectively assessed in a<br />

once-and-for-all measure (see e.g. Callison-Burch et al.,<br />

2006) and it also strongly depends on the nature of the<br />

input material. Various MT paradigms have different<br />

strengths and shortcomings, not only regarding quality.<br />

For example, RBMT allows for a good control of the<br />

overall translation process, but setting up and maintaining<br />

such a system is very costly as it requires trained<br />

specialists. SMT is cheap, but it requires huge amounts of<br />

compute power and training data, which can make it<br />

difficult to include new languages and domains. TMS can<br />

produce human quality, but are limited in coverage due to<br />

their underlying design. Finally, the question of how<br />

human translators can optimally be supported in their<br />

translation workflow has largely been untouched.<br />

Machine Translation for German The number of<br />

available mono- and bi-lingual resources for German is<br />

quite high. In the “EuroMatrix” 1<br />

which collects resources,<br />

corpora, and systems for a large number of language pairs,<br />

German ranges on the third place behind English and<br />

French. Still, only little research has been focused on MT<br />

for language pairs including German, especially for<br />

translation tasks to and from languages other than English.<br />

1 http://www.euromatrixplus.net/matrix/<br />

93


Multilingual Resources and Multilingual Applications - Regular Papers<br />

This paper reports on taraXÜ 2 , which aims to address<br />

the aforementioned questions in a consortium consisting<br />

of partners from both research and industry. taraXÜ takes<br />

the selection from hybrid MT results including RBMT,<br />

TMS, and SMT as the first part of its analytic process.<br />

Then a self-calibration 3<br />

component applies, extended by<br />

controlled language technology and human<br />

post-processing to match real-world translation concerns.<br />

A novelty in this project is that human translators are<br />

integrated into the development process from the very<br />

beginning: Within several human evaluation rounds, the<br />

automatic selection and calibration mechanisms will be<br />

refined and iteratively improved. This paper focuses on<br />

hybrid translation (Section 2) and the large-scale human<br />

evaluation rounds in taraXÜ (Section 3). In the conclusion<br />

and outlook (Section 4), ongoing and future research<br />

is sketched.<br />

94<br />

2. Hybrid Machine Translation<br />

Hybrid MT is a recent trend (e.g. Federmann et al., 2009;<br />

Chen et al., 2009) for leveraging the quality of MT. Based<br />

on the observation that different MT systems often have<br />

complementary strengths and weaknesses, different methods<br />

for hybridization are investigated that aim to “fuse”<br />

an improved translation out of the good parts of several<br />

translation candidates.<br />

2 http://taraxu.dfki.de/<br />

3 Due to limited space, this won’t be discussed herein.<br />

Figure 1: Error classification interface used within taraXÜ .<br />

Complementary Errors Typical difficulties for SMT<br />

are morphology, sentence structure, long-range<br />

re-ordering, and missing words, while strengths are<br />

disambiguation and lexical choice.<br />

RBMT systems are typically strong in morphology, sentence<br />

structure, have the ability to handle long-range<br />

phenomena, and also ensure completeness of the resulting<br />

translation. Weaknesses arise from parsing errors and<br />

wrong lexical choice. The following examples illustrate<br />

the complementary nature of such systems’ errors.<br />

1) Source: Then, in the afternoon, the visit will<br />

culminate in a grand ceremony, at which Obama will<br />

receive the prestigious award.<br />

2) RBMT 4<br />

: Dann wird der Besuch am Nachmittag in<br />

einer großartigen Zeremonie gipfeln, an der Obama<br />

die berühmte Belohnung bekommen wird.<br />

5<br />

3) SMT : Dann am Nachmittag des Besuchs in<br />

beeindruckende Zeremonie mündet, wo Obama den<br />

angesehenen Preis erhalten werden.<br />

As you can see in the translation of Example 1), the<br />

RBMT system generated a complete sentence, yet with a<br />

wrong lexical choice for award. The SMT system on the<br />

other hand generated the right reading, but made morphological<br />

errors and did not generate a complete German<br />

sentence. In the translation of Example 4), a parsing<br />

error in the analysis phase of the RBMT system led to an<br />

almost unreadable result while the SMT decoder gener-<br />

4<br />

System used: Lucy MT (Alonso & Thurmair, 2003)<br />

5<br />

System used: phrase-based Moses (Koehn et al., 2007)


Multilingual Resources and Multilingual Applications - Regular Papers<br />

ated a generally intelligible translation, yet with stylistic<br />

and formal deficits.<br />

4) Source: Right after hearing about it, he described it<br />

as a “challenge to take action.”<br />

5) RBMT: Nachdem er richtig davon gehört hatte,<br />

bezeichnete er es als eine “Herausforderung, um<br />

Aktion auszuführen.”<br />

6) SMT: Gleich nach Anhörung darüber, beschrieb er<br />

es als eine “Herausforderung, Maßnahmen zu<br />

ergreifen.”<br />

Hybrid combination can hence lead to better overall<br />

translations.<br />

A Human-centric Hybrid Approach In contrast to<br />

other hybrid approaches; taraXÜ is in the first place<br />

designed to support human post-editing, e.g., in a translation<br />

agency. Two different modes have to be handled by<br />

the project’s selection mechanism:<br />

� Human post-editing: Select thesentence that is<br />

�<br />

easiest to post-edit and have the user edit it.<br />

Standalone MT: Select the overall best translation<br />

and present it to the user.<br />

For the translation of 4), the best selection in Standalone<br />

MT mode would probably be 6), which is a useful<br />

translation, e.g., for information gisting. In Human<br />

post-editing mode, 5) would be a better selection as it can<br />

relatively quickly be transformed into 7), which is a<br />

human-quality translation.<br />

7) Human edit of 5): Gleich, nachdem er davon gehört<br />

hatte, bezeichnete er es als eine “Herausforderung,<br />

zu handeln.”<br />

One goal of taraXÜ is the design and implementation of<br />

such a novel selection mechanism; however this is still<br />

work in progress and will be described elsewhere. Apart<br />

from properties of the source sentence (domain, complexity,<br />

etc.) and the different translations (grammatical<br />

Figure 2: Post-editing interface used within taraXÜ.<br />

correctness, sentence length, etc.), the selection mechanism<br />

will also take into account “metadata” of the<br />

various systems involved such as runtime, number of<br />

out-of-vocabulary-warnings, number of different readings<br />

generated, etc.<br />

One industry partner in the project consortium provides<br />

modules for language checking that will not only be used<br />

in the selection mechanism, but also in pre-processing of<br />

the input. Starting from the observation that many translation<br />

problems arise from problematic input, another<br />

goal of taraXÜ is to develop automatic methods for<br />

pre-processing input before it is sent to MT translation<br />

engines.<br />

3. Large-Scale Human Evaluation<br />

Several large-Scale human evaluation rounds are foreseen<br />

within the duration of taraXÜ, mainly for the calibration<br />

of both the selection mechanism as well as the<br />

pre-editing steps, but also for measuring the time needed<br />

for post-editing, and for getting a detailed error classification<br />

on the translation output from the various MT<br />

systems under investigation. The evaluation rounds are<br />

performed by external Language Service Providers that<br />

usually offer human translation services and hence are<br />

considered to act as non-biased experts.<br />

Evaluation Procedure The language pairs that will be<br />

implemented and tested during the runtime of taraXÜ are<br />

listed in Table 1.<br />

English<br />

French<br />

German ⇔ Japanese<br />

Russian<br />

Spanish<br />

Chinese<br />

English ⇔<br />

Czech<br />

Table 1: Language pairs treated in taraXÜ.<br />

95


Multilingual Resources and Multilingual Applications - Regular Papers<br />

We use an extended version of the browser-based evalu-<br />

ation tool Appraise (Federmann, 2010) to collect human<br />

judgments on the translation quality of the various sys-<br />

tems under investigation in taraXÜ. A screen-shot of the<br />

error classification interface can be seen in Figure 1, the<br />

post-editing view is presented in Figure 2.<br />

Pilot Evaluation Round The first (pilot) evaluation<br />

round of taraXÜ includes the language pairs EN→DE,<br />

DE→EN, and ES→DE. The corpus size per language<br />

pair is about 2,000 sentences, the data taken mainly from<br />

previous WMT shared tasks, but also extracted from<br />

freely available technical documentation. Two evaluation<br />

tasks will be performed by the human annotators, mirroring<br />

the two modi of our selection mechanism:<br />

1) In the first task, the annotators have to rank the<br />

output of four different MT systems depending on<br />

their translation quality. In a subsequent step, they<br />

are asked to classify the two main types of errors (if<br />

any) of the chosen best translation. We use a subset<br />

of the error types suggested by (Vilar et al., 2006), as<br />

shown in Figure 1.<br />

2) The second task for the human annotators in the first<br />

evaluation round is selecting the translation that is<br />

easiest to post-edit and to perform the editing. Only a<br />

minimal post-editing should be performed.<br />

Some very first results of the ongoing examination of the<br />

first human evaluation round are shown in Table 2. The<br />

top of the table shows the over-all ranking among the four<br />

listed systems, bold face indicates the best system. Below<br />

are the results for translation from Spanish and English<br />

into German, respectively. On the bottom of the table,<br />

overall results on selected corpora are shown from the<br />

news domain (1,030 sentences from the WMT-2010<br />

news test set of Callison-Burch et al. (2010), sub-sampled<br />

proportionally to each one of its documents) and from the<br />

technical documentation of the OpenOffice project.<br />

One observation is that the systems’ ranks are comparably<br />

close except for Trados, which is not a proper MT<br />

system. The very good result of Trados on the news<br />

corpora requires further investigation. A noticeable result<br />

is that Google performs worst on the WMT corpus although<br />

the data should—in principle—have been available<br />

online for training; this will also require some more<br />

detailed inspection. The latter might, however, explain<br />

the good performance of the web-based system on the<br />

OpenOffice corpus.<br />

<strong>96</strong><br />

Lucy Moses Trados Google<br />

Overall 2.00 2.38 3.74 1.86<br />

DE-EN 2.01 2.46 3.80 1.73<br />

ES-DE 1.85 2.42 3.72 1.99<br />

EN-DE 2.12 2.28 3.71 1.89<br />

WMT10 2.52 2.59 2.21 2.69<br />

OpenOffice 1.72 2.77 3.95 1.56<br />

Table 2: First human ranking results, as the average<br />

rank of each system in each task.<br />

4. Conclusions and Outlook<br />

In this paper, we have argued and shown evidence that a<br />

human-centric hybrid approach to Machine Translation is<br />

a promising way of integrating this technology into industrial<br />

translation workflows. Even in this early stage,<br />

taraXÜ has generated positive feedback and raised interest,<br />

especially on the side of the industry partners. We<br />

reported early results from the first (pilot) evaluation of<br />

taraXÜ, including language pairs EN→DE, DE→EN,<br />

and ES→DE. After analyzing the results of this pilot,<br />

further evaluation rounds will iteratively extend the<br />

numbers of languages covered and include questions<br />

related to topics such as controlled language, error types,<br />

and the effect of different subject domains. In the presentation<br />

of this paper, we will include a more detailed<br />

discussion of the first evaluation results.<br />

5. Acknowledgements<br />

This work has partly been developed within the taraXÜ<br />

project financed by TSB Technologiestiftung Berlin –<br />

Zukunftsfonds Berlin, co-financed by the European Union<br />

– European fund for regional development. This work<br />

was also supported by the EuroMatrixPlus project<br />

(IST-231720) that is funded by the European Community<br />

under the Seventh Framework Programme for Research<br />

and Technological Development.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

6. References<br />

Alonso, J. A., Thurmair, G. (2003): The comprendium<br />

translator system. In Proceedings of the Ninth Machine<br />

Translation Summit.<br />

Callison-Burch, C., Koehn, P., Monz, C., Peterson, K.,<br />

Przybocki, M., Zaidan, O. (2010):Findings of the 2010<br />

joint workshop on statistical machine translation and<br />

metrics for machine translation. In Proceedings of the<br />

Joint Fifth Workshop on Statistical Machine Translation<br />

and MetricsMATR, pp. 17–53, Uppsala, Sweden.<br />

Association for Computational Linguistics. Revised<br />

August 2010.<br />

Callison-Burch, C., Osborne, M., Koehn, P. (2006):<br />

Re-evaluating the role of bleu in machine translation<br />

research. In Proceedings of the 11th Conference of the<br />

European Chapter of the Association for Computational<br />

Linguistics, pp. 249–256.<br />

Chen, Y., Jellinghaus, M., Eisele, A., Zhang, Y., Hunsicker,<br />

S., Theison, S., Federmann, C., Uszkoreit, H.<br />

(2009): Combining multi-engine translations with<br />

Moses. In Proceedings of the Fourth Workshop on<br />

Statistical Machine Translation, pp. 42–46, Athens,<br />

Greece. Association for Computational Linguistics.<br />

Federmann, C. (2010): Appraise: An open-source toolkit<br />

for manual phrase-based evaluation of translations. In<br />

Proceedings of the Seventh conference on International<br />

Language Resources and Evaluation.European<br />

Language Resources Association (ELRA).<br />

Federmann, C., Theison, S., Eisele, A., Uszkoreit, H.,<br />

Chen, Y., Jellinghaus, M., Hunsicker, S. (2009):<br />

Translation combination using factored word substitution.<br />

In Proceedings of the Fourth Workshop on<br />

Statistical Machine Translation, pp. 70–74, Athens,<br />

Greece. Association for Computational Linguistics.<br />

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,<br />

Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran,<br />

C., Zens, R., Dyer, C. J., Bojar, O., Constantin, A.,<br />

Herbst, E. (2007): Moses: Open source toolkit for statistical<br />

machine translation. In Proceedings of the 45th<br />

Annual Meeting of the Association for Computational<br />

Linguistics Companion Volume Proceedings of the<br />

Demo and Poster Sessions, pp. 177–180, Prague,<br />

Czech Republic. Association for Computational Linguistics.<br />

Li, Z., Callison-Burch, C., Dyer, C., Ganitkevitch, J.,<br />

Irvine, A., Khudanpur, S., Schwartz, L., Thornton, W.,<br />

Wang, Z., Weese, J., Zaidan, O. (2010): Joshua 2.0: A<br />

toolkit for parsing-based machine translation with<br />

syntax, semirings, discriminative training and other<br />

goodies. In Proceedings of the Joint Fifth Workshop on<br />

Statistical Machine Translation and MetricsMATR,<br />

pp. 133–137, Uppsala, Sweden. Association for<br />

Computational Linguistics.<br />

Papineni, K., Roukos, S., Ward, T., Zhu, W.-J. (2001):<br />

Bleu: a method for automatic evaluation of machine<br />

translation. IBM Research Report RC22176<br />

(W0109-022), IBM.<br />

Schwall, U., Thurmair, G. (1997): From metal to t1:<br />

systems and components for machine translation applications.<br />

In Proceedings of the Sixth Machine<br />

Translation Summit, pp. 180– 190.<br />

Vilar, D., Xu, J., D’Haro, L. F., and Ney, H. (2006): Error<br />

Analysis of Machine Translation Output. In International<br />

Conference on Language Resources and Evaluation,<br />

pp. 697– 702, Genoa, Italy.<br />

97


Multilingual Resources and Multilingual Applications - Regular Papers<br />

98


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Annotation of Explicit and Implicit Discourse Relations<br />

in the TüBa-D/Z Treebank<br />

Anna Gastel, Sabrina Schulze, Yannick Versley, Erhard Hinrichs<br />

SFB 833, <strong>Universität</strong> Tübingen<br />

E-mail: (yannick.versley|erhard.hinrichs|sabrina.schulze)@uni-tuebingen.de, anna.gastel@student.uni-tuebingen.de,<br />

Abstract<br />

We report on an effort to add annotation for discourse relations, discourse structure, and topic segmentation to a<br />

subset of the texts of the Tübingen Treebank of Written German (TüBa-D/Z), which will allow the study of discourse<br />

relations and discourse structure in the context of the other information currently present in the corpus (including<br />

syntax, referential annotation, and named entities). This paper motivates the design decisions taken in the context of<br />

existing annotation schemes for RST, SDRT or the Penn Discourse Treebank, provides an overview over the<br />

annotation scheme and presents the result of an agreement study. In the agreement study, we use the notion of inter-<br />

adjudicator agreement to show that the task of discourse annotation, while challenging in principle, can be<br />

successfully solved when using appropriate heuristics.<br />

Keywords: discourse, annotation, text segmentation, agreement<br />

1. Introduction<br />

Discourse information has been proven useful for a<br />

number of tasks, including summarization (Schilder,<br />

2002) and information extraction (Somasundaran et al.,<br />

2009). While coreference corpora exist for many<br />

languages, and in large and very large sizes (frequently<br />

over one million words), the annotation of discourse<br />

structure and discourse relations has only recently<br />

gained the interest of the community at large.<br />

Many of the existing corpora containing discourse<br />

structure and/or discourse relations are tightly bound to<br />

existing discourse theories such as Rhetorical Structure<br />

Theory (RST, Mann & Thompson, 1988) or Segmented<br />

Discourse Representation Theory (Asher, 1993), or<br />

subscribe to a fundament of coherence relations while<br />

avoiding assumptions about discourse structure (Hobbs,<br />

1985; Wolf & Gibson, 2005).<br />

While annotation guidelines for corpora such as the RST<br />

Discourse Treebank (Carlson et al., 2003; see Stede<br />

2004, and van der Vlieth et al., <strong>2011</strong> for German and<br />

Dutch corpora, respectively, following these guidelines),<br />

an SDRT corpus (Hunter et al., 2007), or the Penn<br />

Discourse Treebank (PDTB, Prasad et al., 2007; see Al-<br />

Saif & Markert, 2010 for an effort towards an Arabic<br />

counterpart) generally agree on the idea of discourse<br />

relations between discourse segments, they do differ in<br />

other important aspects: RST (in particular, Carlson &<br />

Marcu, 2001) and the SDRT guidelines of (Reese et al.,<br />

2007) start from elementary discourse units (EDUs)<br />

that form the lowest level of a hierarchical structure; the<br />

PDTB's guidelines avoid the notion of discourse units,<br />

elementary or not, by asking annotators to mark<br />

connective arguments which may, but do not have to,<br />

coincide with syntactic or larger units, and do not need<br />

to form a hierarchy.<br />

In terms of the relation inventory, the most important<br />

desideratum consists in reconciling descriptive adequacy<br />

for the linguistic phenomena involved with an inventory<br />

size that can still be annotated reliably. This problem is<br />

solved in different ways: The RST guidelines contain a<br />

coarse level of 16 relation classes, which are further<br />

specified into 78 relations which are organized by<br />

nuclearity (where mononuclear relations put greater<br />

weight on one of the units, the nucleus, whereas<br />

99


Multilingual Resources and Multilingual Applications - Regular Papers<br />

CONTIGENCY [28.8%]<br />

Causal [20.5%]<br />

(c)Result-Cause (5.9%)<br />

(c)Result-Enable (4.7%)<br />

(c)Result-Epistemic (0.4%)<br />

(c)Result-Speechact (0.4%)<br />

(s)Explanation-Cause (6.6%)<br />

(s)Explanation-Enable (1.2%)<br />

(s)Explantion-Epistemic (1.1%)<br />

(s)Explanation-Speechact (0.6%)<br />

Conditional [3.0%]<br />

(c)Consequence (2.1%)<br />

(c)Alternation (0.5%)<br />

(c)Condition (0.5%)<br />

Denial [5.6%]<br />

(c)ConcessionC (4.0%)<br />

(s)Concession (2.0%)<br />

(s)Anti-Explanation (0.5%)<br />

multinuclear relations connect units that are equally<br />

important); Reese et al's guidelines for SDRT annotation<br />

do not posit any larger categories among their 14<br />

relations, but organize them by a distinction between<br />

coordinating and subordinating relations (cf. Asher &<br />

Vieu, 2005; this distinction vaguely corresponds to<br />

RST's notion of nuclearity), as well as by veridicality<br />

(where a relation is veridical if the larger unit containing<br />

it cannot be asserted without also asserting the truth of<br />

the relation arguments). The PDTB, in contrast, contains<br />

30 relations which are organized into a taxonomy with<br />

16 relations at the middle level and 4 relatively coarse<br />

top-level classes (Temporal, Contingency, Comparison,<br />

Expansion).<br />

For someone aiming to annotate a corpus with discourse<br />

structure, the choice is not easy: The Penn Discourse<br />

Treebank carefully avoids any strong commitments to<br />

the ideas it uses as a backdrop (such as Webber 2004;<br />

Knott et al., 2001), treating the annotation more like a<br />

collection of examples that can be mined to verify<br />

aspects of the theory; Al-Saif and Markert (2010), for<br />

their work on PDTB-style annotation of Arabic<br />

discourse, found it necessary to drastically simplify the<br />

annotation scheme (from 30 to 12 relations) in order to<br />

yield a feasible scheme for their annotation of explicit<br />

discourse connectives.<br />

Rhetorical Structure Theory, the most mature of the<br />

models for an annotation scheme, has also drawn a<br />

commensurate amount of (oftentimes valid) criticism:<br />

100<br />

EXPANSION [43.6%]<br />

Elaboration [23.6%]<br />

(s)Restatement (10.9%)<br />

(s)Instance (3.4%)<br />

(s)InstanceV (1.0%)<br />

(s)Background (9.1%)<br />

Interpretation [4.2%]<br />

(s)Summary (1.0%)<br />

(s)Commentary (3.3%)<br />

Continuation [6.8%]<br />

(c)Continuation (6.4%)<br />

TEMPORAL [14.35%]<br />

(c)Narration (9.3%)<br />

(s)Precondition (2.4%)<br />

COMPARISON [11,.%]<br />

(c)Parallel (3.3%)<br />

(c)ParallelV (1.1%)<br />

(c)Contrast (7.0%)<br />

REPORTING [9.5%]<br />

(s)Attribution (4.2%)<br />

(s)Source (6.0%)<br />

Table 1: Taxonomy of discourse relations with corpus frequencies<br />

The most important one is that RST defines its relations<br />

in terms of speaker intentions, which yields good<br />

descriptive adequacy (given an appropriate inventory of<br />

relations), but fares less well for cognitive plausibility<br />

(cf. the overview of critiques in Taboada & Mann,<br />

2006), with Sanders and Spooren (1999) claiming that<br />

RST lacks a separation between intentions, which are<br />

defined in terms of speaker and hearer, and their goals<br />

(as is customary in RST), and coherence relations,<br />

which connect two propositions. In a similar vein, Stede<br />

(2008) puts forward the claim that RST's notion of<br />

nuclearity encompasses criteria on different linguistic<br />

levels that are not always in agreement with each other.<br />

Despite SDRT's focus on coherence relations and its<br />

strong theoretical commitment on coherence relations<br />

and their role in structuring the text, attempts to realize<br />

these principles in a general scheme for the discourse<br />

annotation of text have been few and far in-between,<br />

with the unpublished corpus of Hunter et al (2007) being<br />

the most notable example.<br />

Hierarchical structuring of discourse is a wellestablished<br />

concept, not only because it reflects the<br />

principles that have been successful in structural<br />

accounts of syntax (see Polanyi & Scha, 1983; Grosz &<br />

Sidner, 1986, or Webber, 1991, inter alia), but also<br />

because it allows us to formulate well-formedness<br />

(coherence) constraints, as well as accessibility (Webber,<br />

1991) in terms of local configurations.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

While such a tree structure is classically motivated<br />

through intentional notions (the discourse segment<br />

purposes of Grosz & Sidner, 1986), the notion of<br />

question under discussion has been used in information<br />

structure to explain intonational focus in terms of (a<br />

hierarchy of) question under discussion (van Kuppevelt,<br />

1995; Roberts, 19<strong>96</strong>; Büring 2003; also Polanyi et al.,<br />

2003 for a related proposal). It also allows to couch<br />

well-formedness in terms of valid sub-questions (for<br />

subordination) or being (non-exhaustive) answers to a<br />

common question (for coordination; cf. Txurruka, 2003).<br />

Hence, we have, in addition to object-level relations<br />

(part-of, causality), an additional level of relations such<br />

as Contrast which are explainable in terms of<br />

information-structural notions, and which yet fulfill the<br />

intuition (made explicit by Roberts, 19<strong>96</strong>) that at any<br />

given point in discourse, interlocutors have a common<br />

notion of the discourse structure. This level is distinct<br />

from the upper-level structure that is the result of<br />

conscious structuring of the writer (possibly following<br />

genre-specific rules). As an example, some of the very<br />

general RST relations such as Motivation or<br />

Preparation are only explainable in terms of writer<br />

intentions and conscious text structuring, which may or<br />

may not be transparent to the average recipient.<br />

Our own annotation scheme reflects van Kuppevelt's<br />

and Roberts' intuitions about a shared structure in<br />

discourse: We found it important to keep a backbone of<br />

explicit hierarchical structure, as in RST's annotation<br />

scheme, but also to avoid vague relations between large<br />

text segments, which are often genre-specific or the<br />

(sometimes idiosyncratic) result of intentional text<br />

structuring by the author. The PDTB successfully uses<br />

the metaphor of implicit connectives to limit discourse<br />

relations to connective-argument-sized pieces; in our<br />

case, we reconcile an explicit notion of (shallow)<br />

hierarchy with a focus on coherence relations by<br />

dividing the text into topically coherent stretches (as<br />

discussed, e.g., by Hearst, 1997), which we call topic<br />

segments, and annotate hierarchical discourse structure<br />

(using SDRT's notion of co- and subordinating discourse<br />

relations) inside these topic segments.<br />

In the following text, section 2 gives more details on the<br />

corpus and on the annotation scheme, whereas section 3<br />

presents an experiment to establish the reliability of our<br />

scheme using an inter-annotator agreement study.<br />

Section 4 presents and summarizes our findings.<br />

2. Corpus and Annotation Scheme<br />

As a textual basis for the corpus, we selected newspaper<br />

articles from the syntactically and referentially<br />

annotated TüBa-D/Z corpus (Telljohann et al., 2009),<br />

with the current version totalling 919 sentences in 31<br />

articles, or about 29.6 sentences/article (against 20.6<br />

sentences/article on average in the complete TüBa-D/Z,<br />

which also includes very brief newswire-style reports),<br />

and altogether 1159 discourse relations and 103 topic<br />

segments (or about 9 sentences per topic segment).<br />

The relation inventory, and the distribution of different<br />

relation types, is presented in Table 1. From the starting<br />

point of the coordinating and subordinating discourse<br />

relations in Reese et al., we found it necessary to<br />

introduce finer distinctions in some places to ensure<br />

either consistency with a related effort on annotating<br />

explicit connectives (adding new relations such as<br />

Result-enable which corresponds to the Weak-Result<br />

relation proposed by Bras et al., 2006, for SDRT), but<br />

also the distinction between Contrast and Concession<br />

which is found in both the Penn Discourse Treebank and<br />

the RST annotation guidelines, but not Reese et al.'s<br />

proposal.<br />

The resulting 28 relations can be grouped into 8<br />

medium-level and 5 upper-level relation types by<br />

considering properties such as basic operation (causal<br />

vs. additive vs. temporal, with referential as a new group<br />

to account for elaborative relations) and symmetry as<br />

proposed by Sanders et al (1992); the resulting higherlevel<br />

types of discourse relations have much in common<br />

with the top-level taxonomic categories of the Penn<br />

Discourse Treebank with a small number of exceptions<br />

(the PDTB subsumes the non-symmetrical Concession<br />

relation under the label Comparison whereas we follow<br />

Sanders et al. in assuming a causal source of coherence<br />

for Concession and an additive source of coherence for<br />

the symmetrical Contrast relation; Our Reporting group<br />

includes the Attribution and Source relations that Hunter<br />

et al. use in accounting for reported facts, whereas the<br />

Penn Discourse Treebank, unlike RST and SDRT, treats<br />

attribution as an issue that is orthogonal to discourse<br />

structure).<br />

101


Multilingual Resources and Multilingual Applications - Regular Papers<br />

The hierarchical organization of relations according to<br />

basic operation does not differentiate between additional<br />

properties such as coordination/subordination or<br />

veridicality. Examples (1) and (2) serve to illustrate this<br />

distinction: 1<br />

(1) a) Private Unternehmen dürfen die Telefonbücher<br />

102<br />

der Telekom-Tochter DeTeMedien nicht ohne<br />

deren Erlaubnis zur Herstellung einer<br />

Telefonauskunfts-CDs verwenden.<br />

b) Die beklagten Unternehmen müssen den Vertrieb<br />

der Info-CDs sofort einstellen.<br />

Result-Cause(1a,1b)<br />

(2) a) Taxifahrer sind als Kolumnenthema eigentlich<br />

tabu,<br />

b) weil sie als "weiche Angriffsziele" gelten.<br />

Explanation-Cause(2a,2b)<br />

When the situation specified in Arg1(1a) is interpreted<br />

as the cause of the situation specified in Arg2 (1b), the<br />

relation between those two arguments is labeled Result-<br />

Cause. Both arguments are necessary for coherence, so<br />

they are coordinated. The second example is labeled<br />

Explanation-Cause, because the situation specified in<br />

Arg1(2a) is interpreted as the result of the situation<br />

specified in Arg2 (2b). The situation in (2a) contains the<br />

main information while the situation in (2b) contributes<br />

background information. With subordinating relations,<br />

Arg2 ('further information') is always subordinated to<br />

Arg1 ('main information'), independently of surface<br />

order, as you can see in the following two examples:<br />

(3) a) Zwei Ex-Mafiosi behaupten zudem,<br />

b) von dem Mordauftrag Andreottis gewußt zu<br />

haben.<br />

Attribution(3a,3b)<br />

(4) a) Nach Angaben von Polizeipräsident Hagen<br />

Saberschinsky<br />

b) haben Polizeibeamte einen ihrer Kollegen<br />

angezeigt.<br />

Source(4b,4a)<br />

In example (3) the main information is situated in Arg1:<br />

It is relevant for the coherence of the text to know that<br />

two mobsters testified knowing about the murder<br />

contract of Andreotti, which makes them important<br />

witnesses in the murder charges against Andreotti.<br />

1 TüBa-D/Z sentences 2563/2564, 7482/7483<br />

Therefore Arg2 is subordinated to Arg1. In example (4)<br />

the main information, namely that police officers press<br />

charges against one of their colleagues, is given by (4b).<br />

Therefore, 4b is the Arg1 of a Source relation, as it is<br />

more important to know about the complaint itself than<br />

to know where the information came from, and 4a is<br />

subordinated under 4b (cf. Hunter et al., 2007).<br />

Table 1 contains all discourse relations. Numbers in<br />

square brackets represent the distribution of the overall<br />

class. Numbers in parentheses represent the distribution<br />

of the single relation.<br />

In the table, coordinating relations are marked with a<br />

small 'c' in front of the relation and subordinating<br />

relations are marked with a small 's'.<br />

3. An experiment on inter-annotator and<br />

inter-adjudicator agreement<br />

For any annotation scheme that ventures into the domain<br />

of semantic and/or pragmatic distinctions, reliability is<br />

an issue that needs to be addressed explicitly in order to<br />

maintain the predictability of the annotated data (or,<br />

equivalently, the predictive power of conclusions from<br />

that data).<br />

Regarding the agreement on discourse relations, Marcu<br />

et al. (1999) determined κ values between κ=0.54<br />

(Brown corpus) and κ=0.62 (MUC) for fine-grained<br />

RST relations and between κ =0.59 (Brown) and κ =0.66<br />

(MUC) for coarser-grained relations. In their reliability<br />

study with the Penn Discourse Treebank, Prasad<br />

et al. (2008) determined agreement values between 80%<br />

(finest level) and 94% (coarsest level with 4 relation<br />

types), but did not report any chance-corrected values.<br />

Al-Saif and Markert (2010) report values of κ=0.57 for<br />

their PDTB-inspired connective scheme, saying that<br />

most disagreements are due to highly ambiguous<br />

connectives such as w/and, which can receive one of<br />

several relations. In a study on their Dutch RST corpus,<br />

van der Vlieth et al. (<strong>2011</strong>) found an inter-annotator<br />

agreement of κ=0.57. To the best of our knowledge, no<br />

agreement figures have been published on the RSTbased<br />

Potsdam Commentary Corpus (Stede, 2004) or<br />

any other German corpus with discourse relation<br />

annotation.<br />

In the regular annotation process of our corpus, two<br />

annotators create EDU segmentation, topic segments,


Multilingual Resources and Multilingual Applications - Regular Papers<br />

and discourse relations independently from each other;<br />

in a second step, the results from both annotators are<br />

compared and a coherent gold-standard annotation is<br />

created after discussing the goodness-of-fit of respective<br />

partial analyses to the text and the applicability of<br />

linguistic tests. In order to account for the complete<br />

annotation process including the revision step, we<br />

follow Burchardt et al. (2006) and separately report<br />

inter-annotator agreement, which is determined after the<br />

initial annotation, and inter-adjudicator agreement,<br />

which is determined after an additional adjudication<br />

step. The adjudication step is carried out by two<br />

adjudicators based on the original set of annotations, but<br />

is performed by each adjudicator independently from the<br />

other.<br />

In the case where multiple relations were annotated<br />

between the same EDU ranges (for example, a temporal<br />

Narration relation in addition to a Result-Cause relation<br />

from the Contingency group), we counted the<br />

annotations as matching whenever the complete set of<br />

relations (i.e. {Narration, Result-Cause} in the example)<br />

is the same across annotators.<br />

In a sample of three documents that we used for our<br />

agreement study, we found that annotators agreed on 49<br />

relations spans, with the comparison yielding an<br />

agreement value of κ=0.55 for individual relations, and<br />

κ=0.65 for the middle level of the taxonomy (eight<br />

relation types).<br />

For the inter-adjudicator task, we found an agreement on<br />

82 relation spans, among which relation agreement was<br />

at κ=0.83 for individual relations, and κ=0.85 for the<br />

middle level of the taxonomy, or a reduction of<br />

disagreements of about 57%.<br />

4. Discussion and Conclusion<br />

In this article, we have presented the annotation scheme<br />

we use to annotate discourse relations of complete texts<br />

in a subset of the TüBa-D/Z corpus, and reported the<br />

results of an agreement study using these guidelines and<br />

relation inventory. While the raw inter-annotator<br />

agreement is on a similar level as other annotation<br />

efforts with a similar scope, we found that a subsequent<br />

adjudication step introduces a rather substantial<br />

reduction in disagreements (between adjudicated<br />

versions that were obtained independently of each<br />

other), which suggests that a large part of the (raw)<br />

disagreement is due to the sheer complexity of the task<br />

and should not be taken as indicating the infeasibility of<br />

discourse structure (and discourse relation) annotation in<br />

general.<br />

The public availability of a corpus with discourse<br />

relation annotation in combination with the syntactic<br />

and referential annotation from the main TüBa-D/Z<br />

corpus will also allow it to provide an empirical<br />

evaluation of theories concerning the interface between<br />

syntax and discourse, such as D-LTAG (Webber, 2004)<br />

or D-STAG (Danlos, 2009) as well as those that predict<br />

interactions between referential and discourse structure<br />

(Grosz & Sidner 1986; Cristea et al., 1998; Webber,<br />

1991; Chiarcos & Krasavina, 2005, inter alia).<br />

5. References<br />

Al-Saif, A., Markert, K. (2010): Annotating discourse<br />

connectives for Arabic. In Proc. LREC 2010.<br />

Asher (1993): Reference to Abstract Objects in Discourse.<br />

Kluwer, Dordrecht.<br />

Asher, N., Lascarides, A. (2003): Logics of Conversation.<br />

Cambridge University Press, Cambridge.<br />

Asher, N., Vieu, L. (2005): Subordinating and coordinating<br />

discourse relations. Lingua 115, 591-610.<br />

Bras, M., Le Draoulec, A., Asher, N. (2006): Evidence for a<br />

Scalar Analysis of Result in SDRT from a Study of the<br />

French Temporal Connective 'alors'. In: SPRIK<br />

Conference ”Explicit and Implicit Information in Text -<br />

Information Structure across Languages”.<br />

Burchardt, A., Erk, K., Frank, A., Kowalski, A., Padó, S.,<br />

Pinkal, M. (2006): The SALSA Corpus: a German<br />

Corpus Resource for Lexical Semantics. In Proceedings<br />

of LREC 2006.<br />

Büring, D. (2003): On D-Trees, Beans, and B-Accents.<br />

Linguistics and Philosophy 26(5), pp. 511-545.<br />

Carlson, L., Marcu, D. (2001): Discourse Tagging Manual.<br />

ISI Tech Report ISI-TR-545.<br />

Carlson, L., Marcu, D., Okurowski, M. E. (2003): Building<br />

a Discourse-Tagged Corpus in the Framework of<br />

Rhetorical Structure Theory. In: Current Directions in<br />

Discourse and Dialogue, Kluwer.<br />

Chiarcos, C., Krasavina, O. (2005): Rhetorical Distance<br />

103


Multilingual Resources and Multilingual Applications - Regular Papers<br />

104<br />

Revisited: A Parametrized Approach. In Workshop on<br />

Constraints in Discourse (CID 2005).<br />

Cristea, D., Ide, N., Romary, L. (1998): Veins Theory: A<br />

Model of Global Discourse Cohesion and Coherence. In<br />

Proc. CoLing 1998.<br />

Danlos L. (2009): D-STAG : Un formalisme d'analyse<br />

automatique de discours basé sur les TAG synchrones.<br />

Revue TAL 50 (1), pp. 111-143.<br />

Grosz, B., Sidner, C. (1986): Attention, Intentions, and the<br />

structure of discourse. Computational Linguistics 12(3),<br />

pp. 175-204.<br />

Hearst, M. (1997): TextTiling: Segmenting Text into Multi-<br />

Paragraph Subtopic Passages, Computational<br />

Linguistics, 23 (1), pp. 33-64.<br />

Hobbs, J. (1985): On the Coherence and Structure of<br />

Discourse, Report No. CSLI-85-37, Center for the Study<br />

of Language and Information, Stanford University.<br />

Hunter, J., Baldridge, J., N. Asher (2007): Annotation for<br />

and Robust Parsing of Discourse Structure on<br />

Unrestricted Texts. Zeitschrift <strong>für</strong> Sprachwissenschaft<br />

26, pp. 213-239.<br />

Knott, A., Oberlander, J., O'Donnell, M., Mellish, C.<br />

(2001): Beyond Elaboration: The interaction of relations<br />

and focus in coherent text. In: Sanders, Schilperoord,<br />

Spooren (eds.), Text representation: linguistic and<br />

psycholinguistic aspects. John Benjamins.<br />

Mann, W. C., Thompson, S. A. (1998): Rhetorical Structure<br />

Theory: Toward a functional theory of text organization.<br />

Text 8, pp. 243-281.<br />

Marcu, D., Amorrortu, E., Romera, M. (1999):<br />

Experiments in Constructing a Corpus of Discourse<br />

Trees. ACL Workshop on Standards and Tools for<br />

Discourse Tagging.<br />

Polanyi, L., Scha. R. (1983): On the Recursive Structure of<br />

Discourse. In K. Ehlich & H. Van Riemsdijk (Eds.),<br />

Connectedness in sentence, discourse and text,<br />

pp. 141–178. Tilburg: Tilburg University<br />

Prasad, R., Miltsakaki, M., Dinesh, N., Lee, A., Joshi, A.,<br />

Robaldo, L., Webber, B. (2007): The Penn Discourse<br />

Treebank 2.0 Annotation Manual. Technical Report,<br />

University of Pennsylvania.<br />

Reese, B., Denis, P., Asher, N., Baldridge, J., Hunter, J.<br />

(2007): Reference Manual for the Analysis and<br />

Annotation of Rhetorical Structure. Technical Report,<br />

University of Texas at Austin.<br />

Roberts, C. (19<strong>96</strong>): Information Structure in Discourse:<br />

Towards an Integrated Formal Theory of Pragmatics. In<br />

Yoon, Kathol (eds.), OSU Workin Papers in Linguistics<br />

49: Papers in Semantics, pp. 91-136.<br />

Sanders, T. J. M., Spooren, W. P. M., Noordman, L. G. M.<br />

(1992): Toward a Taxonomy of Coherence Relations.<br />

Discourse Processes 15, pp. 1-35.<br />

Sanders, T. J. M., Spooren, W. P. M. (1999):<br />

Communicative intentions and coherence relations. In<br />

Bublitz, Lenk, Ventola (eds.) Coherence in Text and<br />

Discourse, pp. 235-250. John Benjamins, Amsterdam.<br />

Schilder, F. (2002): Robust discourse parsing via discourse<br />

markers, topicality and position. Natural Language<br />

Engineering 8(2), pp. 235-255.<br />

Somasundaran, S., Namata, G., Wiebe, J., Getoor, L.<br />

(2009): Supervised and Unsupervised Methods in<br />

Employing Discourse Relations for Improving Opinion<br />

Polarity Classification. In Proc. EMNLP 2009.<br />

Stede, M. (2004): The Potsdam Commentary Corpus. In<br />

Proc. ACL Workshop on Discourse Annotation.<br />

Telljohann, H., Hinrichs, E. W., Kübler, S., Zinsmeister, H.,<br />

Beck, K. (2009): Stylebook for the Tübingen Treebank<br />

of Written German (TüBa-D/Z). Technical Report,<br />

Seminar <strong>für</strong> Sprachwissenschaft, <strong>Universität</strong> Tübingen.<br />

Txurruka, I. G. (2003): The Natural Language Conjunction<br />

And. Linguistics and Philosophy 26(3), pp. 255-285.<br />

van der Vlieth, N., Berzlanovich, I., Bouma G., Egg, M.,<br />

Redeker, G. (<strong>2011</strong>): Building a Discourse-Annotated<br />

Dutch Text Corpus. In Proceedings of the DGfS<br />

Workshop “Beyond Semantics”, Bochumer<br />

Linguistische Arbeitsberichte 3.<br />

van Kuppevelt, J. (1995): Discourse Structure, Topicality<br />

and Questioning. Linguistics 31, pp. 109-147.<br />

Webber, B. (1991): Structure and Ostension in the<br />

Interpretation of Discourse Deixis. Natural Language<br />

and Cognitive Processes 6(2), pp. 107-135.<br />

Webber, B. (2004): DLTAG: Extending Lexicalized TAG<br />

to Discourse. Cognitive Science 28, pp. 751-779.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Devil’s Advocate on Metadata in Science<br />

Christina Hoppermann, Thorsten Trippel, Claus Zinn<br />

General and Computational Linguistics, University of Tübingen<br />

Wilhelmstraße 19, D-72074 Tübingen<br />

E-mail: christina.hoppermann@uni-tuebingen.de, thorsten.trippel@uni-tuebingen.de, claus.zinn@uni-tuebingen.de<br />

Abstract<br />

This paper uses a devil’s advocate position to highlight the benefits of metadata creation for linguistic resources. It provides an<br />

overview of the required metadata infrastructure and shows that this infrastructure is in the meantime developed by various projects<br />

and hence can be deployed by those working with linguistic resources and archiving. Possible caveats of metadata creation are<br />

mentioned starting with user requirements and backgrounds, contribution to academic merits of researchers and standardisation.<br />

These are answered with existing technologies and procedures, referring to the Component Metadata Infrastructure (CMDI). CMDI<br />

provides an infrastructure and methods for adapting metadata to the requirements of specific classes of resources, using central<br />

registries for data categories, and metadata schemas. These registries allow for the definition of metadata schemas per resource type<br />

while reusing groups of data categories also used by other schemas. In summary, rules of best practice for the creation of metadata are<br />

given.<br />

Keywords: metadata, Component Metadata Infrastructure (CMDI), infrastructure, sustainable archives<br />

1. Introduction<br />

The creation of primary research data and its analysis is a<br />

large share of a researcher’s workload. In linguistics,<br />

research data comprises many different types: there are<br />

resources such as corpora, lexicons, and grammars; there<br />

are various kinds of experimental data resulting, for<br />

example, from perception and production studies with<br />

sensor data originating from eye-tracking and MRI<br />

(magnetic resonance imaging) devices. There is data in<br />

the form of speech recordings, written text, videotaped<br />

gestures, which, in part, is annotated or transcribed along<br />

many different layers; there is audio and video data of<br />

other forms of human-human communication such as<br />

cultural or religious songs or dances; and there is also a<br />

large variety of software tools for the manipulation,<br />

analysis and interpretation of all these types of data<br />

sources.<br />

Once a study of research data yields statistically and<br />

scientifically significant results, it is documented and<br />

published, usually complementing a description of<br />

research methodology, interpretations of results, etc.,<br />

with a depiction of the underlying research data.<br />

Reputable journals are archived so that its articles are<br />

deemed accessible for a long time. Access to articles is<br />

usually facilitated via Dublin Core (DC) metadata<br />

categories such as ”author”, “title”, “journal”,<br />

“publisher” or “publication year”. In general, however,<br />

there is no infrastructure in place to access the research<br />

data underlying a reported study, although some<br />

researchers make such data available via their webpage<br />

or institution, and some conferences or journals ask<br />

authors to supplement their article with primary data,<br />

which is then also made public. 1<br />

So far, it is not the<br />

general rule to describe research data with metadata for<br />

indexing or cataloguing by themselves or others. In part,<br />

this is due to caveats for the provision of metadata held<br />

by large parts of the scientific community. In this paper,<br />

the Devil’s Advocate (DA) will articulate some of these<br />

caveats. We will aim at rebutting each of them, given the<br />

recent advances for metadata management, in particular,<br />

in the area of linguistics.<br />

2. Playing Devil’s Advocate<br />

DA: There is little if any scientific merit to be gained<br />

from resource and metadata provision.<br />

This is a view mentioned in a recent statement by the<br />

which says that infrastructure does<br />

Wissenschaftsrat 2<br />

1<br />

For example, Interspeech <strong>2011</strong> invited authors to submit<br />

supporting data files to be included on the Proceedings<br />

CD-ROM in case of paper acceptance.<br />

2<br />

The German Wissenschaftsrat is a joined council of German<br />

105


Multilingual Resources and Multilingual Applications - Regular Papers<br />

hardly provide for an increased scholarly reputation<br />

(Wissenschaftsrat, <strong>2011</strong>:23). Though this might be true<br />

for a restricted notion of scientific merit, that is the merit<br />

being defined by the number of published journal articles<br />

and books, it is not true in a less restricted sense.<br />

Furthermore, the Wissenschaftsrat (Wissenschaftsrat,<br />

<strong>2011</strong>:23) points out that infrastructural projects offer the<br />

opportunity for methodical innovations, generate new<br />

research questions, and help attracting new researchers.<br />

If new researchers, methods and research questions are<br />

part of the scientific merit, the claim that there is no<br />

scientific merit in metadata provision is thus not true.<br />

There are even more reasons for arguing that additional<br />

scientific merits are gained, at least in three overlapping<br />

areas: (1) by providing a complete overview over the<br />

field, (2) by fostering interoperability and providing<br />

reproducible, non-arbitrary results, and (3) by increasing<br />

the pace of gaining research results.<br />

First of all, in an ideal case, a metadata-driven resource<br />

inventory gives an accurate picture of a scientific<br />

landscape by containing all resource types such as<br />

corpora, lexical databases, or experiments. By having<br />

access to all these resources, in principle, nothing is<br />

gained because it is too time-consuming to analyse and<br />

reproduce research questions from the data. But as soon<br />

as resources are described by metadata, it is possible to<br />

classify, sort and provide an overview over them using<br />

the descriptions as such. Though descriptions contain<br />

generalisations, they are still sufficient to provide an<br />

outline of resources. This also serves the purpose of<br />

providing essential background for steering research<br />

activities and funding projects as well as to discover<br />

trends and gaps, all allowing to increase the researcher’s<br />

reputation and merit.<br />

Second, the metadata-based publicity fosters<br />

communication between researchers, for example,<br />

because contact information are required to gain access to<br />

resources, comparable data structures are needed to be<br />

reusable by other methods, or because selections of<br />

resources (e.g. subcorpora) have to be created. Resources<br />

can be merged and cross-evaluated to discover which<br />

results are reproducible. This helps to avoid fraud and<br />

plagiarism. At the same time, the investigation of<br />

research questions different from the original ones can be<br />

Research Foundation officials and researchers appointed by the<br />

government for consulting it on research related issues.<br />

106<br />

applied to existing resources. In all cases, good scientific<br />

practice will credit the resource creator, and thus add to<br />

his or her reputation when a publication makes reference<br />

to its underlying research data, which is possible on the<br />

basis of appropriate metadata. The references pointing to<br />

the resources can be indexed by others and are<br />

consequently added to the scientific map.<br />

Third, more and faster results can be created. By<br />

providing metadata, researchers new to a discipline gain<br />

a faster overview over the research questions and<br />

activities of a discipline as well as easier access to<br />

existing linguistic resources and tools. Moreover,<br />

accurate metadata descriptions can help avoiding the<br />

duplication of research work by providing insights and<br />

access to existing work. Hence, researchers who are<br />

applying new methods do not always have to recreate<br />

resources but can rely on existing ones, providing a<br />

jumpstart. At the same time, the resources as such are<br />

providing added benefit by being more widely used,<br />

thereby also increasing the reputation of the creator.<br />

DA: Expert knowledge on metadata is required to<br />

properly describe research data. Thus metadata<br />

experts rather than researchers are called for duty.<br />

The library sciences, with their long tradition and<br />

expertise in metadata, have many different classification<br />

systems in place to organise collections. But is it realistic<br />

to ask researchers, such as linguists, to properly describe<br />

language resources and tools with metadata, given their<br />

lack of knowledge in metadata provision, the variety and<br />

complexity of research data, and the missing dominant<br />

metadata schemes in the field? On the other hand, it<br />

seems clear that metadata provision cannot be done<br />

properly without the researchers’ involvement. It is<br />

unrealistic to assume that some research data can be just<br />

given to a librarian with expertise in linguistics (or a<br />

linguist with expertise in archiving methodology) with<br />

the task to assign proper metadata to it. There needs to be<br />

considerable involvement of the resource creator in<br />

describing the resource in formal (where possible) and<br />

informal terms (possibly by filling out a questionnaire).<br />

The “librarian” can then enter the provided information<br />

into a formal schema, ensuring that, at least, obligatory<br />

descriptors are properly provided. In sum, to put a proper<br />

metadata-based infrastructure in place, some minimal<br />

researcher training in metadata provision is needed. This


Multilingual Resources and Multilingual Applications - Regular Papers<br />

needs to be complemented with infrastructure personnel,<br />

or, if possible, with user-friendly metadata editors that<br />

trained researchers can learn to use.<br />

DA: There is a little if any consensus on the set of<br />

metadata descriptors or metadata schemes to be used<br />

in describing language resources and tools.<br />

It is clear that a common vocabulary for metadata<br />

provision is required. Otherwise it will be hard to offer<br />

effective metadata-based search and retrieval services. It<br />

is also evident that established metadata standards such<br />

as Dublin Core are insufficient, as they do not include<br />

every data category (DatCat) needed for describing<br />

specific types of resources. However, given the<br />

complexity of the research field in linguistics with its<br />

many different resource types, it is naïve to assume that<br />

established metadata schemas can be reused without<br />

loosing descriptive power. For example, resource types<br />

need to be indicated and for different resource types<br />

additional descriptive categories need to be defined. For<br />

lexical resources it is common to describe the lexical<br />

structures, for annotations the annotation tag sets, for<br />

experiments the size of the samples and the free and<br />

bound variables. Each of these data categories is only<br />

relevant for the individual type of a resource, but for<br />

these they can be more essential than categories such as<br />

“title” and “author”. As this list of data categories may<br />

require additions, since new resource types become<br />

available, it needs to be treated as an open list.<br />

In recent times, some consensus on the procedure of<br />

creating elementary field names for the description of<br />

linguistic research data has been achieved in order to<br />

allow for a standardisation of data categories. It is<br />

formally captured by the ISOcat data category registry<br />

for the description of language resources and tools (ISO<br />

12620; International Organization of Standardization,<br />

2009; http://www.isocat.org). ISOcat (Figure 1) is an<br />

open web-based registry of data categories into which<br />

everybody can insert his own data categories with<br />

(human-readable) definitions of their intended use. This<br />

is done in a private space with limited access that can be<br />

used by researchers to include new data categories not yet<br />

intended or not ready for standardisation. For private use,<br />

these data categories can already be referenced via<br />

persistent identifiers (PIDs) but they can also be stored in<br />

a public space with unrestricted access and be proposed<br />

as standard data categories. If the data categories are<br />

submitted for standardisation, a standardisation process<br />

involving domain experts is being initiated with<br />

community consensus building, quality assurance, voting<br />

and maintenance cycles.<br />

Figure 1: Relation between ISOcat, Component<br />

Registry and metadata instances<br />

The registry provides a solid base to start from, but the<br />

sheer size of available DatCats may overwhelm untrained<br />

users. Additional structures are needed to minimise cases<br />

where different users may apply different descriptors to<br />

provide similar resources with metadata. For this purpose,<br />

the Component Registry for metadata (Figure 1; Broeder<br />

et al., 2010; http://catalog.clarin.eu/ds/ComponentRegistry/#)<br />

contains a rich set of prefabricated metadata building<br />

blocks that aggregate elementary blocks of data<br />

categories into larger compounds. Researchers can select<br />

and combine existing buildings blocks – or define new<br />

ones – in a schema, which can then be instantiated to<br />

describe a given resource with the help of a so-called<br />

metadata instance (Figure 1). The concept of reusing<br />

building blocks is part of the Component Metadata<br />

Infrastructure (CMDI, http://www.clarin.eu/cmdi). For<br />

many resource types the registry already contains<br />

prefabricated schemas that can be re-used by researchers.<br />

Moreover, there exists at least one fully functional<br />

metadata editor (http://www.lat-mpi.eu/tools/arbil/) with<br />

interfaces to both ISOcat and the Component Registry. It<br />

is freely available and support is provided by the<br />

programmers to facilitate the use of the editor for<br />

non-expert users who otherwise might be overwhelmed<br />

by the total range of functions the editor offers. There are<br />

also other XML editors supporting the schemas. Once a<br />

schema is defined with these tools, these off-the-shelf<br />

107


Multilingual Resources and Multilingual Applications - Regular Papers<br />

XML editors are available to describe resources with<br />

metadata according to the metadata schema. These<br />

schemas can then be used to validate the metadata<br />

instances with the help of syntactical parsers to ensure the<br />

adherence to syntactic structures and controlled<br />

vocabulary.<br />

DA: There is rarely a right time to make a resource<br />

public (via metadata description).<br />

Research rarely follows a fully planned path. A resource<br />

such as a corpus or a lexicon is adjusted, additional layers<br />

of annotation or transcription are added, data may get<br />

re-annotated with different coders, lexical entries may get<br />

revised or extended to reflect new insights, etc.<br />

Nevertheless, the moment publications are created and<br />

project reports are written, it shall be good scientific<br />

practice to archive the underlying research data and to<br />

assign and publish metadata about the resource. Here, the<br />

current status of the resource can be marked with<br />

metadata about, for instance, the resource’s life cycle or<br />

versioning information.<br />

There is also a policy change in the funding agencies. The<br />

German Research Association (DFG), for instance, sets<br />

the terms that resources ought to be maintained by the<br />

originating institution; researchers are responsible for the<br />

proper documentation of resources, and procedures need<br />

to be defined for the case when they leave an institution<br />

(Deutsche Forschungsgemeinschaft, 1998:13). A proper<br />

documentation of resources has to include their<br />

description in terms of metadata to facilitate their<br />

archival and future retrieval.<br />

Therefore, at the latest, metadata shall be provided (or<br />

revised) at the end of a research project, at best by the<br />

researchers who have created the resource. Ideally, the<br />

life cycle stage at archiving time is already defined in the<br />

project work plan. Even if the desired final state was not<br />

accomplished, the primary data needs to be archived by<br />

the end of the project with proper metadata assigned to it.<br />

DA: Without a central metadata agency, all the added<br />

values advertised will not materialise.<br />

Added values such as searchability and citation of<br />

resources require some point of access to the metadata. It<br />

is correct that there is not a single central metadata<br />

agency but there are various interconnected agencies<br />

providing services to the community in terms of metadata.<br />

108<br />

For instance, the German NaLiDa project<br />

(http://www.sfs.uni-tuebingen.de/nalida/) serves as a<br />

metadata centre for resources and tools created in<br />

Germany. The project as such does not claim exclusive<br />

representation, but aims at cooperating with other<br />

archives in providing a service to the community for<br />

accessing metadata in the form of catalogues and<br />

allowing easy access to resources. It harvests metadata<br />

from participating institutions and also provides<br />

metadata management support for German research<br />

institutions (Barkey et al., <strong>2011</strong>). Within the project,<br />

a faceted search interface was developed with<br />

complementation of a full-text search engine<br />

(http://www.sfs.uni-tuebingen.de/nalida/katalog), with<br />

currently access to more than 10,000 metadata records of<br />

language resources and tools. Though the NaLiDa project<br />

could be seen as a central metadata agency, its<br />

implementation has a rather decentralised flavour.<br />

Metadata is harvested from various sources and then<br />

aggregated and indexed into a single database. To<br />

kick-start or increase the inflow of data, participating<br />

institutions receive help both in terms of setting-up an<br />

OAI-PMH 3<br />

-based data provision service and in other<br />

aspects of metadata creation and maintenance. Once the<br />

local metadata providers – the primary research data<br />

remains with the institutions – are set up, other parties<br />

than NaLiDa are free to crawl their data sets and to<br />

provide services in terms of all data.<br />

At the European level, the CLARIN project<br />

(http://www.clarin.eu) has also devised such a crawler,<br />

and is likewise offering a faceted search interface for<br />

language resources and tools (CLARIN Virtual Language<br />

Observatory, http://www.clarin.eu/vlo/). Since both (and<br />

other) parties work towards the realisation of a common<br />

infrastructure, with different foci but similar goals, there<br />

is much to be gained from a healthy competition and<br />

exchange of ideas for the scientific community to profit<br />

from.<br />

3. Summary<br />

Given the recent advances in linguistics with regard to<br />

metadata provision for linguistic resources and tools,<br />

there is little left to offer excuses for not using the<br />

existing infrastructure. In general, this results in the<br />

3 Open Archives Initiative Protocol for Metadata Harvesting


Multilingual Resources and Multilingual Applications - Regular Papers<br />

following rules of best practice for the documentation of<br />

resources:<br />

1) One of the best strategies for preserving research<br />

data is by publishing it into repositories and<br />

networks. This way, multiple archives serve as<br />

backup. Additionally, it allows for an easier sharing<br />

and spreading of resources, contributing to the<br />

academic merits of resource providers.<br />

2) Archived data is easier accessible if the data is<br />

sufficiently described. As flexible metadata schemas<br />

can adapt for various types of resources, it is possible<br />

to create such descriptions as required by the type of<br />

a resource. Metadata can then be used to make<br />

resources public, in order for others to use (harvest)<br />

them.<br />

3) Data categories are best defined in central<br />

4)<br />

(standardised) registries, such as ISOcat, that allow<br />

for references via persistent identifiers. No data<br />

categories should be used that are not centrally<br />

defined to avoid fragmentation of the resource<br />

community.<br />

For interoperability purposes, components as<br />

collections of data categories should be reused where<br />

adequate or defined as new entries in the Component<br />

Registry for reuse by others.<br />

5) The flexibility of the framework helps to avoid tag<br />

abuse if data providers adhere to data category<br />

definitions or, if not available, define their own<br />

modified categories. This will contribute to the<br />

consistency and reusability of data.<br />

6) Syntactic evaluation of metadata should always be<br />

performed to ensure harvesting, usability of<br />

7)<br />

applications and consistency. By checking for<br />

content models, tag abuse can be avoided further.<br />

When using research data, it should be referred to<br />

them as stated in the data’s metadata.<br />

8) Resource creators might need some training and<br />

assistance, which is provided by various projects.<br />

Some time for this work should be included.<br />

4. Acknowledgements<br />

Work on this paper was conducted within the Centre for<br />

Sustainability of Linguistic Data (<strong>Zentrum</strong> <strong>für</strong><br />

Nachhaltigkeit Linguistischer Daten, NaLiDa), which is<br />

funded by the German Research Foundation (DFG) in the<br />

Scientific Library Services and Information Systems<br />

(LIS) framework, and within the infrastructure project<br />

Heterogeneous Primary Research Data: Representation<br />

and Processing of the Collaborative Research Centre The<br />

Construction of Meaning: the Dynamics and Adaptivity<br />

of Linguistic Structures (SFB 833), which is also funded<br />

by the DFG.<br />

5. References<br />

Barkey, R., Hinrichs, E., Hoppermann, C. Trippel, T., Zinn,<br />

C. (<strong>2011</strong>): Komponenten-basierte Metadatenschemata<br />

und Facetten-basierte Suche - Ein flexibler und<br />

universeller Ansatz. In J. Griesbaum, T. Mandl & C.<br />

Womser-Hacker (eds.), Information und Wissen: global,<br />

sozial und frei? Internationales Symposium der<br />

Informationswissenschaft (Hildesheim). Boizenburg:<br />

Verlag Werner Hülsbusch (vwh), pp. 62-73.<br />

Broeder, D., Kemps-Snijders, M., Van Uytvanck, D.,<br />

Windhouwer, M., Withers, P., Wittenburg, P., Zinn, C.<br />

(2010): A Data Category Registry- and<br />

Component-based Metadata Framework. In<br />

Proceedings of the 7th Conference on International<br />

Language Resources and Evaluation, 19-21 May 2010,<br />

European Language Resources Association.<br />

Deutsche Forschungsgemeinschaft (1998): Vorschläge zur<br />

Sicherung guter wissenschaftlicher Praxis:<br />

Empfehlungen der Kommission „Selbstkontrolle in der<br />

Wissenschaft“, Denkschrift. Weinheim: Wiley-VCH.<br />

See<br />

http://www.dfg.de/download/pdf/dfg_im_profil/reden_<br />

stellungnahmen/download/empfehlung_wiss_praxis_0<br />

198.pdf (retrieved March 31, <strong>2011</strong>).<br />

International Organization of Standardization (2009):<br />

Terminology and other language and content resources -<br />

Specification of data categories and management of a<br />

Data Category Registry for language resources<br />

(ISO-12620-2009), Geneva. Go to www.isocat.org to<br />

access the registry.<br />

Wissenschaftsrat (<strong>2011</strong>): Empfehlung zu<br />

Forschungsinfrastrukturen in den Geistes- und<br />

Sozialwissenschaften. Berlin: 28/01/<strong>2011</strong>. See<br />

http://www.wissenschaftsrat.de/download/archiv/10465<br />

-11.pdf (retrieved March 31, <strong>2011</strong>).<br />

109


Multilingual Resources and Multilingual Applications - Regular Papers<br />

110


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Improving an Existing RBMT System by Stochastic Analysis<br />

Christian Federmann, Sabine Hunsicker<br />

DFKI – Language Technology Lab<br />

Stuhlsatzenhausweg 3, D-66123 Saarbrücken, GERMANY<br />

E-mail: {cfedermann,sabine.hunsicker}@dfki.de<br />

Abstract<br />

In this paper we describe how an existing, rule-based machine translation (RBMT) system that follows a transfer-based translation<br />

approach can be improved by integrating stochastic knowledge into its analysis phase. First, we investigate how often the rule-based<br />

system selects the wrong analysis tree to determine the potential benefit from an improved selection method. Afterwards we describe<br />

an extended architecture that allows integrating an external stochastic parser into the analysis phase of the RBMT system. We report<br />

on the results of both automatic metrics and human evaluation and also give some examples that show the improvements that can be<br />

obtained by such a hybrid machine translation setup. While the work reported on in this paper is a dedicated extension of a specific<br />

rule-based machine translation system, the overall approach can be used with any transfer-based RBMT system. The addition of<br />

stochastic knowledge to an existing rule-based machine translation system represents an example of a successful, hybrid<br />

combination of different MT paradigms into a joint system.<br />

Keywords: Machine Translation, Hybrid Machine Translation, Stochastic Parsing, System Combination<br />

1. Introduction<br />

Rule-based machine translation (RBMT) systems that<br />

employ a transfer-based translation approach, highly<br />

depend on the quality of their analysis phase as it<br />

provides the basis for its later processing phases, namely<br />

transfer and generation. Any parse failures encountered<br />

in the initial analysis phase will proliferate and cause<br />

further errors in the following phases. Very often, bad<br />

translation results can be traced back to incorrect analysis<br />

trees that have been computed for the respective input<br />

sentences. Consequently, any improvements that can be<br />

achieved for the analysis phase of some RBMT system<br />

lead to improved translation output, which makes this an<br />

interesting topic in the context of hybrid machine<br />

translation.<br />

In this paper we describe how a stochastic parser can<br />

supplement the rule-based analysis phase of a<br />

commercial RBMT system. The system in question is the<br />

rule-based engine Lucy LT. This engine uses a<br />

sophisticated RBMT transfer approach with a long<br />

research history, as explained in detail in (Wolf et al.,<br />

2010). The output of its analysis phase is a forest<br />

containing a small number of tree structures. For this<br />

study we investigated if the existing rule base of the Lucy<br />

LT system chooses the best tree from the analysis forest<br />

and how the selection of this best tree out of the set of<br />

candidates can be improved by adding stochastic<br />

knowledge to the RBMT system.<br />

The paper is structured in the following way: in Section 2<br />

we describe the Lucy RBMT system and its<br />

transfer-based architecture. Afterwards, in Section 3, we<br />

provide details on the integration of a stochastic parser<br />

into the Lucy analysis phase of this rule-based system.<br />

Section 4 describes the experiments we performed and<br />

reports the results of both automated metrics and human<br />

evaluation efforts before Section 5 discusses some<br />

examples that show how the proposed approach has<br />

improved or degraded machine translation quality.<br />

Finally, in Section 6, we conclude and provide an outlook<br />

on future work in this area.<br />

2. Lucy System Architecture<br />

The Lucy LT engine is a renowned RMBT system that<br />

follows a classical, transfer-based translation approach.<br />

The system first analyses the given source sentence<br />

resulting in a forest of several analysis trees. One of these<br />

trees is then selected (as “best” analysis) and transformed<br />

in the transfer phase into a tree structure from which the<br />

target text can be generated.<br />

It is clear that any errors that occur during the initial<br />

111


Multilingual Resources and Multilingual Applications - Regular Papers<br />

analysis phase proliferate and cause negative side effects<br />

on the quality of the resulting translation. As the analysis<br />

phase is of special importance, we describe it in more<br />

detail. The Lucy LT analysis consists of several phases:<br />

1) The input is tokenised with regards to the source<br />

language lexicon.<br />

2) The resulting tokens then undergo a morphological<br />

analysis, which identifies possible combinations of<br />

allomorphs for a token.<br />

3) This leads to a chart which forms the basis for the<br />

actual parsing, using a head-driven strategy. Special<br />

treatment is performed for the analysis of multi-word<br />

expressions and also for verbal framing.<br />

At the end of the analysis, there is an extra phase named<br />

phrasal analysis that is called whenever the grammar was<br />

not able to construct a legal constituent from all the<br />

elements of the input. This happens in several different<br />

scenarios:<br />

� The input is ungrammatical according to the LT<br />

analysis grammar.<br />

� The category of the derived constituent is not one of<br />

the allowed categories.<br />

� A grammatical phenomenon in the source sentence is<br />

not covered.<br />

� There are missing lexical entries for the input<br />

sentence.<br />

During the phrasal analysis, the LT engine collects all<br />

partial trees and greedily constructs an overall<br />

interpretation of the chart. Based on our findings from<br />

experiments with the Lucy LT engine, phrasal analyses<br />

are performed for more than 40% of the sentences from<br />

our test sets and very often result in bad translations.<br />

Each resulting analysis tree, independent of whether it is<br />

a grammatical or phrasal analysis, is also assigned an<br />

integer score by the grammar. The tree with the highest<br />

score is then handed over to the transfer phase, thus<br />

pre-defining the final translation output.<br />

112<br />

3. Adding Stochastic Analysis<br />

An initial, manual evaluation of the translation quality<br />

based on the tree selection of the analysis phase showed<br />

that there is potential for improvement. For this, we<br />

changed the RBMT system to produce translations for all<br />

its analysis trees and ranked them according to their<br />

quality. In many cases, one of the alternative trees would<br />

have lead to a better translation.<br />

Next to the assigned score, we examined the significance<br />

of two other features:<br />

1) The size of the analysis trees themselves, and<br />

2) The tree edit distance of each analysis candidate to a<br />

stochastic parse tree.<br />

An advantage of stochastic parsing lies in the fact that<br />

parsers from this class can deal very well even with<br />

ungrammatical or unknown output, which we have seen<br />

is problematic for a rule-base parser. We decided to make<br />

use of the Stanford Parser as described in<br />

(Klein & Manning, 2003), which uses an unlexicalised,<br />

probabilistic context-free grammar trained on the Penn<br />

Treebank. We parse the original source sentence with this<br />

PCFG grammar to get a stochastic parse tree that can be<br />

compared to the trees from the Lucy analysis forest.<br />

In our experiments, we compare the stochastic parse tree<br />

with the alternatives given by Lucy LT. Tree comparison<br />

is implemented based on the Tree Edit Distance, as<br />

originally defined in (Zhang & Shasha, 1989}. In<br />

analogy to the Word Edit or Levenshtein Distance, the<br />

distance between two trees is the number of editing<br />

actions that are required to transform the first tree into the<br />

second tree. The Tree Edit Distance knows three actions:<br />

� Insertion<br />

� Deletion<br />

� Renaming (substitution in Levenshtein Distance)<br />

We use a normalised version of the Tree Edit Distance to<br />

estimate the quality of the trees from the Lucy analysis<br />

forest. The integration of the stochastic selection has<br />

been possible by using an adapted version of the<br />

rule-based system, which allowed performing the<br />

selection of the analysis tree from an external process.<br />

4. Experiments<br />

Two test sets were used in our experiments. The first test<br />

set was taken from the WMT shared task 2008, consisting<br />

of a section of data from Europarl (Koehn, 2005). The<br />

second test set, which was taken from the WMT shared<br />

task 2010 contained news text. Phrasal analyses caused<br />

by unknown lexical items occurred more often in the<br />

news text, as that text sort tends to more often use<br />

colloquial expressions. In our experiments, we translated<br />

from English→German; evaluation was performed using<br />

both automated metrics and human evaluation using an<br />

annotation tool similar to e.g. Appraise (Federmann,<br />

2010).


Multilingual Resources and Multilingual Applications - Regular Papers<br />

First, only the Tree Edit Distance and internal score from<br />

the Lucy analysis phase were used and we select the tree<br />

with the lowest edit distance. If the lowest distance holds<br />

for two or more trees, the tree with the highest LT internal<br />

score is chosen. Later we added the size of the candidate<br />

trees as an additional feature, with a bias to prefer larger<br />

trees as they proved to create better translations in our<br />

experiments. Results from automatic scoring using<br />

BLEU (Papineni et al., 2001) and the derived NIST score<br />

are reported in Table 1 and Table 2 for test set #1 and test<br />

set #2, respectively. The BLEU scores for the new<br />

translations are a little bit worse, but still comparable to<br />

the quality of the original translations. The difference is<br />

not statistically significant.<br />

Test set #1 BLEU NIST<br />

Baseline 0.1100 4.4059<br />

Stochastic Selection 0.10<strong>96</strong> 4.3946<br />

Table 1: Automatic scores for test set #1.<br />

Test set #2 BLEU NIST<br />

Baseline 0.1529 5.5725<br />

Stochastic Selection 0.1514 5.5469<br />

Selection+Size 0.1511 5.5341<br />

Table 2: Automatic scores for test set #2.<br />

We also manually evaluated a sample of 100 sentences.<br />

For this, we created all possible translations for each<br />

phrasal analysis and had human annotators judge on their<br />

quality. Then, we checked whether our stochastic<br />

selection mechanism returned a tree that led to the best<br />

translation. In case it did not, we investigated the reasons<br />

for this. Sentences for which all trees created the same<br />

translation were skipped.<br />

Table 3 shows the error rate of our stochastic analysis<br />

component that chose the optimal tree for 56% of the<br />

sentences, while Table 4 shows the selection reasons that<br />

resulted in the selection of a non-optimal tree. We also<br />

see that the minimal tree edit distance seems to be a good<br />

feature to use for comparisons, as it holds for 71% of the<br />

trees, including those examples where the best tree was<br />

not scored highest by the LT engine. This also means that<br />

additional features for choosing the tree out of the group<br />

of trees with the minimal edit distance are required.<br />

Best translation? Yes (56%) No (44%)<br />

Minimal distance? Yes (71%) No (29%)<br />

Table 3: Error rate of the stochastic analysis.<br />

More than 50 tokens in source 36.4%<br />

Time-out before best tree is reached 29.5%<br />

Chosen tree had minimal distance 34.1%<br />

Table 4: Reasons for erroneous tree selection.<br />

Even for the 29% of sentences, in which the optimal tree<br />

was not chosen, little quality was lost: in 75.86% of those<br />

cases, the translations didn't change at all (obviously the<br />

trees resulted in equal translation output). In the<br />

remaining cases the translations were divided evenly<br />

between slight degradations and equal quality.<br />

In cases when the best tree was not chosen, the first tree<br />

(which is the default tree) was selected in 70.45% . This<br />

is due to a combination of robustness factors that are<br />

implemented in the RBMT system and have been beyond<br />

our control in the experiments. The LT engine has several<br />

different indicators that may each throw a time-out<br />

exception, if, for example, the analysis phase takes too<br />

long to produce a result. To avoid getting time-out errors,<br />

only sentences with up to 50 tokens are treated by our<br />

stochastic selection mechanism. Additionally, the<br />

component itself checks the processing time and returns<br />

intermediate results, if this limit is reached. We are<br />

currently working on eliminating this time-out issue as it<br />

prevents us from driving our approach to its full potential.<br />

As with the internal score, we see that the Tree Edit<br />

Distance on its own is a good indicator of the quality of<br />

the analysis, but that additional features are required to<br />

prevent suboptimal decisions to be taken. As such, we<br />

included the size of the trees. Here the bigger trees are<br />

preferred to smaller ones as experimental results have<br />

confirmed that these are more likely to produce better<br />

translations.<br />

The manual evaluation shows results that are similar to<br />

the automated metrics. We are currently investigating in<br />

more detail what happened in case of the degradations to<br />

improve that misbehaviour. It seems as if additional<br />

features might be needed to more broadly improve the<br />

rule-based machine translation engine using our<br />

stochastic selection mechanism.<br />

113


Multilingual Resources and Multilingual Applications - Regular Papers<br />

114<br />

5. Examples<br />

We now provide some examples from our experiments<br />

that illustrate how the stochstic selection mechanism<br />

changed the translation output of the rule-based system.<br />

For example, the analysis of the following sentence is<br />

now correct:<br />

Source: “They were also protesting against bad pay<br />

conditions and alleged persecution.”<br />

Translation A: “Sie protestierten auch gegen schlechte<br />

Soldbedingungen und behaupteten Verfolgung.”<br />

Translation B: “Sie protestierten auch gegen schlechte<br />

Soldbedingungen und angebliche Verfolgung.”<br />

Translation A is the default translation. The analysis tree<br />

associated with this translation contains a node for the<br />

adjective “alleged” which is wrongly parsed as a verb.<br />

The next example shows how an incorrect word order<br />

problem is fixed:<br />

Source: “If the finance minister can't find the money<br />

elsewhere, the project will have to be aborted and<br />

sanctions will be imposed, warns Janota.”<br />

Translation A: “Wenn der Finanzminister das Geld nicht<br />

anderswo finden kann, das Projekt abgebrochen<br />

werden müssen wird und Sanktionen auferlegt<br />

werden werden, warnt Janota.”<br />

Translation B: “Wenn der Finanzminister das Geld nicht<br />

anderswo finden kann, wird das Projekt abgebrochen<br />

werden müssen und Sanktionen werden auferlegt<br />

werden, warnt Janota.”<br />

Lexical items are associated with a domain area in the<br />

lexicon of the rule-based system. Items that are contained<br />

within a different domain area than the input text are still<br />

accessible, but items in the same domain are preferred. In<br />

the following example, this leads to an incorrect<br />

disambiguation of multi-word expressions:<br />

Source: “Apparently the engine blew up in the rocket's<br />

third phase.”<br />

Translation A: “Offenbar blies der Motor hinauf die<br />

dritte Phase der Rakete in.”<br />

Translation B: “Offenbar flog der Motor in der dritten<br />

Phase der Rakete in die Luft.”<br />

Again, the stochastic selection allows choosing a better<br />

tree, which leads to the correct idiomatic translation.<br />

Something similar happens in the following case:<br />

Source: “As of January, they should be paid for by the<br />

insurance companies and not compulsory.”<br />

Translation A: “Ab Januar sollten sie <strong>für</strong> von den<br />

Versicherungsgesellschaften und nicht obligatorisch<br />

bezahlt werden.”<br />

Translation B: “Ab Januar sollten sie von den<br />

Versicherungsgesellschaften und nicht obligatorisch<br />

gezahlt werden.”<br />

These changes remain at a rather local scope, but we also<br />

have observed instances where the sentence improves<br />

globally:<br />

Source: “In his new book, ‘After the Ice’, Alun Anderson,<br />

a former editor of New Scientist, offers a clear and<br />

chilling account of the science of the Arctic and a<br />

gripping glimpse of how the future may turn out<br />

there.”<br />

Translation A: “In seinem neuen Buch bietet Alun<br />

Anderson, ein früherer Redakteur von Neuem<br />

Wissenschaftler, ‘Nach dem Eis’ einen klaren und<br />

kalten Bericht über die Wissenschaft der Arktis und<br />

einen spannenden Blick davon an, wie die Zukunft<br />

sich hinaus dort drehen kann.”<br />

Translation B: “In seinem neuen Buch, ‘Nach dem Eis’,<br />

bietet Alun Anderson, ein früherer Redakteur von<br />

Neuem Wissenschaftler, einen klaren und kalten<br />

Bericht über die Wissenschaft der Arktis und einen<br />

spannenden Blick davon an, wie die Zukunft sich<br />

hinaus dort drehen kann.”<br />

In translation A, the name of the book, “After the Ice”,<br />

has been moved to an entirely different place in the<br />

sentence, removing it from its original context.<br />

6. Conclusion and Outlook<br />

The analysis phase proves to be crucial for the quality of<br />

the translation in rule-based machine translation systems.<br />

In this paper, we have shown that it is possible to improve<br />

the analysis results of such a rule-based engine by<br />

introducing a better selection method for the trees created<br />

by the grammar. Our experiments show that the selection<br />

itself is not a trivial task and requires fine-grained<br />

selection criteria.<br />

While the work reported on in this paper is a dedicated<br />

extension of a specific rule-based machine translation<br />

system, the overall approach can be used with any<br />

transfer-based RBMT system. Future work will<br />

concentrate on the circumvention of e.g. the time-out<br />

errors that prevented a better performance of the<br />

stochastic selection mechanism. Also, we will more<br />

closely investigate the issue of decreased translation


Multilingual Resources and Multilingual Applications - Regular Papers<br />

quality and experiment with additional decision factors<br />

that may help to alleviate the negative effects.<br />

The addition of stochastic knowledge to an existing<br />

rule-based machine translation system represents an<br />

example of a successful, hybrid combination of different<br />

MT paradigms into a joint system.<br />

7. Acknowledgements<br />

This work was also supported by the EuroMatrixPlus<br />

project (IST-231720) that is funded by the European<br />

Community under the Seventh Framework Programme<br />

for Research and Technological Development.<br />

8. References<br />

Federmann, C. (2010). Appraise: An open-source toolkit<br />

for manual phrase-based evaluation of translations. In<br />

Proceedings of the Seventh conference on<br />

International Language Resources and Evaluation.<br />

European Language Resources Association (ELRA).<br />

Klein, D., Manning, C. D. (2003). Accurate unlexicalized<br />

parsing. In Proceedings of the 41st Annual Meeting of<br />

the ACL, pp. 423–430.<br />

Koehn, P. (2005). Europarl: A parallel corpus for<br />

statistical machine translation. In Proceedings of the<br />

MT Summit 2005.<br />

Papineni, K., Roukos, S., Ward, T., Zhu, W.-J. (2001).<br />

Bleu: a method for automatic evaluation of machine<br />

translation. IBM Research Report RC22176<br />

(W0109-022), IBM.<br />

Wolf, P., Alonso, J., Bernardi, U., Llorens, A. (2010).<br />

EuroMatrixPlus WP2.2: Study of Example- Based<br />

Modules for LT Transfer.<br />

Zhang, K., Shasha, D. (1989). Simple fast algorithms for<br />

the editing distance between trees and related problems.<br />

SIAM J. Comput., 18, pp. 1245–1262.<br />

115


Multilingual Resources and Multilingual Applications - Regular Papers<br />

116


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Terminology extraction and term variation patterns:<br />

a study of French and German data<br />

Marion Weller a , Helena Blancafort b , Anita Gojun a , Ulrich Heid a<br />

a Institut <strong>für</strong> maschinelle Sprachverarbeitung, <strong>Universität</strong> Stuttgart<br />

b Syllabs, Paris<br />

E-mail: {wellermn|gojunaa|heid}@ims.uni-stuttgart.de, blancafort@syllabs.com<br />

Abstract<br />

The terminology of many technical domains, especially new and evolving ones, is not fully fixed and shows considerable<br />

variation. The purpose of the work described in this paper is to capture term variation. For term extraction, we apply hand-crafted<br />

POS patterns on tagged corpora, and we use rules to relate morphological and syntactic variants. We discuss some French and<br />

German variation patterns, and we present first experimental results from our tools. It is not always easy to distinguish (near)<br />

synonyms from variants that have a slightly different meaning from the original term; we discuss ways of operating such a<br />

distinction. Our tools are based on POS tagging and an approximation of derivation and compounding; however, we also propose a<br />

non-symbolic, statistics-based line of development. We discuss general issues of evaluating variant detection and present a smallscale<br />

precision evaluation.<br />

Keywords: terminology, term variation, comparable corpora, pattern-based term extraction, compound nouns<br />

1. Introduction<br />

The objective of the EU-funded project TTC 1<br />

(Terminology Extraction, Translation Tools and<br />

Comparable Corpora) is the extraction of terminology<br />

from comparable corpora. The tools under development<br />

within the project address the issues of compiling corpus<br />

collections, monolingual term extraction and the<br />

alignment of terms into pairs of multilingual<br />

equivalence candidates, as well as the management and<br />

the export of the resulting terminological data towards<br />

CAT and MT tools.<br />

Since parallel corpora of specialized domains are scarce<br />

and not necessarily available for a broad range of<br />

languages (TTC deals with English (EN), Spanish (ES),<br />

German (DE), French (FR), Latvian (LV), Russian<br />

(RU), Chinese (ZH)), comparable corpora are used<br />

instead: textual material from specialized domains is<br />

accessible for many languages, either on the Internet or<br />

in publications of companies.<br />

1 http://www.ttc-project.eu<br />

The research leading to these results has received funding from<br />

the European Community's Seventh Framework Programme<br />

(FP7/2007-2013) under Grant Agreement n. 248005.<br />

In technical domains which are rapidly evolving,<br />

documents published on the Internet are often the most<br />

recent sources of data. In such domains, terminology<br />

typically has not yet been standardized, and thus<br />

numerous variants co-exist in published documents.<br />

Tools which support the extraction, identification and<br />

interrelating of term variants are thus necessary to<br />

capture the full range of expressions used in the<br />

respective domain. End users may then decide (e.g. on<br />

the basis of variant frequency and sources of variants)<br />

which expression to prefer.<br />

A second, more technical motivation for term variant<br />

extraction is provided by the procedures for term<br />

alignment (either lexical or statistical strategies), for<br />

which data sparseness is a problem. In order to reduce<br />

the complexity of term alignment, TTC intends to gather<br />

monolingual variants into sets of related terms.<br />

Particularly for this application, we do not only allow<br />

for (quasi) synonyms, but also for variants with a slight<br />

difference in meaning as shown in 1.<br />

1) production d'électricité ↔ électricité produite<br />

(production of electricity ↔ produced electricity)<br />

Terms may be of different forms (single-word vs. multiword<br />

terms) in different languages: this is a challenge<br />

117


Multilingual Resources and Multilingual Applications - Regular Papers<br />

for term alignment. For example, compound nouns play<br />

an important role in German terminology, but have no<br />

equivalents of the same morpho-syntactic structure in<br />

many other languages. Grouping equivalent terms of<br />

different syntactic structures can help to deal with such<br />

cases, as illustrated in 2:<br />

2) Energieproduktion ↔ Produktion von Energie ↔<br />

production d'électricité<br />

(energy production ↔ production of energy)<br />

2. Methodology<br />

The steps required for term extraction and for variant<br />

identification follow a simple pipeline architecture: first,<br />

a corpus collection is compiled, which then undergoes<br />

linguistic pre-processing. Following these steps,<br />

monolingual term candidates are extracted. As not all<br />

extracted items are domain relevant, we apply statistical<br />

filtering. Since we intend to detect term variation on a<br />

morpho-syntactic level, this last step requires<br />

morphological processing in order to model<br />

derivational relationships between word classes.<br />

2.1. Compiling a corpus and pre-processing<br />

To collect corpus data, we use the focused Web crawler<br />

Babouk (de Groc, <strong>2011</strong>) which has been developed<br />

within the TTC project. Babouk starts with a set of seed<br />

terms or URLs given by the user which are combined<br />

into queries and submitted to a search engine. Babouk<br />

scores the relevance of the retrieved web pages using a<br />

weighted-lexicon-based thematic filter. Based on the<br />

content of relevant retrieved pages, the lexicon is<br />

extended and new search queries are combined.<br />

One objective of the TTC project is to rely on flat<br />

linguistic analysis that is available for all languages.<br />

One strand of research thus goes towards the<br />

development of knowledge-poor strategies, such as<br />

using a pseudo part-of-speech tagger (Clark, 2003) as a<br />

basis for probabilistic NP-extraction (Guégan & Loupy,<br />

<strong>2011</strong>). A knowledge-rich approach is term extraction<br />

based on hand-crafted part-of-speech (POS) patterns,<br />

which is the method we chose for the present work.<br />

Pre-processing of our data collection consists of<br />

tokenizing, POS-tagging and lemmatization using<br />

TreeTagger (Schmid, 1994). For efficiency reasons, with<br />

German and French being morphologically rich<br />

118<br />

languages, we work with lemmas rather than inflected<br />

forms.<br />

2.2. Term candidate extraction and filtering<br />

Our main focus is on the extraction of nominal phrases<br />

such as [NN NN] or [NN PRP NN] constructions (cf.<br />

tables 2-5), but [V NN] collocations are also of interest2 .<br />

For each language, we identify term candidates by using<br />

hand-crafted POS patterns. In contrast to nominal<br />

phrases, which are relatively easy to capture by POS<br />

patterns, the identification of [V NN] collocations is<br />

more challenging, as verbs and their object nouns do not<br />

necessarily occur in adjacent positions, depending on the<br />

general structure of the sentence. This applies<br />

particularly to German where constituent order is rather<br />

flexible and allows for long distances between verbs and<br />

their objects.<br />

In order to reduce the extracted term candidates to a set<br />

of domain-relevant items, we estimate their domain<br />

specificity by comparing them with terms extracted<br />

from general language corpora (Ahmad et al, 1992). The<br />

underlying idea of this procedure is the assumption that<br />

terms which occur in both domain-specific and general<br />

language corpora are not domain-relevant, whereas<br />

terms occurring only or predominantly in the domainspecific<br />

data can be considered as specialized terms. We<br />

use the quotient q of a term's relative frequency in the<br />

specialized data and in the general language corpus as<br />

an indicator for its domain relevance (see table 1).<br />

term candidate f domain f general q<br />

Gleichstrom<br />

(direct current)<br />

128 4 22362,7<br />

Jahr (year) 2157 221.213 1,2<br />

Table 1: Domain-specific vs. general language<br />

2.3. Term variation<br />

In TTC we define a term variant as “an utterance which<br />

is semantically and conceptually related to an original<br />

term” (Daille, 2005). Thus, term variants are bound to<br />

texts (“utterance”) and require the presence of an<br />

“original term” identified e.g. by means of a morphosyntactic<br />

term pattern.<br />

2NN:noun, PRP: preposition, V: verb, VPART:<br />

participle


Multilingual Resources and Multilingual Applications - Regular Papers<br />

The relationship between term variant and original term<br />

is supposed to mainly be one of (quasi-) synonymy or of<br />

controlled modification (e.g. by attributive adjectives,<br />

NPs or PPs). We formalize this by explicitly classifying<br />

relationships between patterns.<br />

We distinguish the following types of variants:<br />

� graphical air flow ↔ airflow<br />

� morphological (derivation, compounding)<br />

Energieproduktion ↔ Produktion von Energie<br />

(production of energy)<br />

solare Energie ↔ Solarenergie (solar energy)<br />

� paradigmatic e.g. omissions<br />

les énergies renouvelables ↔ les renouvelables<br />

(the renewable energies ↔ the renewables)<br />

� abbreviations, acronyms<br />

Windenergieanlage ↔ WEA (wind energy plant)<br />

� syntactic variants3 consommation d’énergie ↔<br />

consommation annuelle d’énergie<br />

(energy consumption ↔ yearly energy consumption)<br />

Assuming that German technical texts contain many<br />

domain-specific compounds, we focus in this work on<br />

compound nouns and their variant [NN PRP NN] as<br />

illustrated above (morphological variants).<br />

For French, we choose a similar pattern [NN de NN] ↔<br />

[NN VPART]. In our current work, we restrict this<br />

pattern to nouns ending in -tion. The addition of French<br />

morphology tools is planned to widen the scope of these<br />

patterns.<br />

2.4. Morphological processing<br />

In order to identify morphological variants of German<br />

compounds, we need to split compounds into their<br />

components: in the present work, we opt for a statistical<br />

compound splitter; the implementation is based on<br />

(Koehn & Knight, 2003).<br />

Searching for the most probable split of a given word,<br />

the basic idea is that the components of a compound also<br />

appear as single words and consequently should occur in<br />

corpus data. A word frequency list serves as training<br />

data, supplemented with a hand-crafted set of rules to<br />

model transitional elements, such as the s in<br />

Produktions|kosten (production costs).<br />

3 This last type of variants is not necessarily synonymous with<br />

the original term.<br />

For French, we created a set of rules to model the<br />

relationship between nouns ending in -tion and the<br />

respective verbs:<br />

� production → produire (production → produce)<br />

� évolution → évoluer (evolution → evolve)<br />

� condition → conditionner (condition → condition)<br />

� protection → protéger (protection → protect)<br />

Similar rules can be formulated, e.g. for nouns ending in<br />

-ment or -eur, e.g. chargement (nominalized action) →<br />

charger (verb), as well as convertisseur (nominalized<br />

tool name) → convertir (verb). Similarly, terms<br />

containing adjectives ending in -able, such as utilisable<br />

→ utiliser (cf. table 5) or relational adjectives<br />

(prototypique → prototype) are under study. A further<br />

type of pattern that could be added are rules to handle<br />

prefixation (e.g. anti-corrosion → corrosion).<br />

2.5. Processing formally related items<br />

A very common form of graphic variation is<br />

hyphenation, e.g. Luftwärmepumpe vs. Luft-Wärmepumpe<br />

(air-source heat pump). This type of variation is<br />

dealt with by the splitting programm, which uses<br />

hyphens as splitting points. Hyphenated and nonhyphenated<br />

forms are treated as one term.<br />

To a certain extent, our variant detection tools also deal<br />

with alternating transitional elements (Kraftwerkbetrieb<br />

vs. Kraftwerksbetrieb). This is modeled by hand-crafted<br />

rules which allow for several realizations. Additionally,<br />

there are relatively regular forms of spelling variation,<br />

e.g. the new/old orthography in German, resulting in<br />

e.g. ph/f variation. This can be dealt with either by rules<br />

or using a method based on string-distance.<br />

3. Experiments and examples of results<br />

Our experiments are based on comparable corpora<br />

crawled from the Web. While they are generally easy to<br />

obtain with a focused crawler, such corpora might be<br />

inhomogeneous with respect to domain coverage or<br />

types of sources. When working with several languages,<br />

the degree of comparability may also vary.<br />

We use a collection of 1000 documents each for French<br />

and German, with a total size of 1.55 M tokens (FR) and<br />

1.29 M tokens (DE) of the domain of wind energy.<br />

When looking at the extracted German data, we find that<br />

119


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Nutzenergie useful energy 89<br />

nutzbar Energie usable energy 24<br />

genutzt Energie used energy 5<br />

nutzbar Energieform usable energy form 9<br />

genutzt Energieform used energy form 4<br />

nutzbar Energiegehalt usable energy content 3<br />

Nutzenergie-Anteil proportion of useful 1<br />

120<br />

Abgabe von Wärme 1 Wärmeabgabe 18 release of warmth<br />

Beleuchtung von Straße 1 Straßenbeleuchtung 49 street lighting<br />

Erzeugung von Strom 32 Stromerzeugung 569 power generation<br />

Produktion von Strom 4 Stromproduktion 72 power production<br />

Speicherung von Energie 7 Energiespeicherung 37 energy storage<br />

Verbrauch an Primärenergie 1 Primärenergieverbrauch 114 primary energy consumption<br />

Versorgung mit Fernwärme 2 Fernwärmeversorgung 13 district heating<br />

Nutzung von Biomasse 8 Biomassenutzung 7 biomass utilization<br />

energy<br />

nutzbar Energiemenge usable amount of<br />

energy<br />

Table 4: Variants of the compound Nutzenergie.<br />

the realization of a term as a compound is often more<br />

frequent than the alternative structures [NN PRP NN]<br />

or [NN ARTgen NNgen], as illustrated in table 2. This<br />

does not only apply to common words like Strom-<br />

erzeugung (power generation), but also to comparative-<br />

ly long and more complex words like Fernwärmever-<br />

sorgung (lit. long-distance heat supply: district heating).<br />

We consider this as evidence that the respective<br />

compound nouns are established as terms in the domain<br />

or even in general language. The degree of preference<br />

varies, up to the point of there not being an alternative<br />

realization, as is the case with Windgeschwindigkeit<br />

(wind speed, freq=149), for which one could imagine a<br />

construction like *Geschwindigkeit des Windes (speed<br />

of the wind), which does not occur in our corpus.<br />

In contrast to the German structures, the French terms<br />

Table 2: Prepositional phrases vs. compound nouns<br />

consommation d'électricité electricity consumption 28 électricité consommée consumed electricity 15<br />

consommation d'énergie energy consumption 66 énergie consommée consumed energy 22<br />

importation de pétrole import of petroleum 9 pétrole importé imported petroleum 1<br />

production d'électricité electricity production 225 électricité produite produced electricity 95<br />

production de chaleur heat production 26 chaleur produite produced heat 21<br />

installation d'éolienne wind turbine installation 5 éolienne installée installed wind turbine 16<br />

installation de puissance installation of power 1 puissance installée installed power 69<br />

utilisation d'énergie use of energy 5 énergie utilisée used energy 19<br />

Table 3: Related French terms: prepositional phrases vs. noun-participle constructions.<br />

1<br />

énergie utilisée used energy 19<br />

énergie utile useful energy 14<br />

énergie utilisable usable energy 14<br />

forme d'énergie utile useful energy form 2<br />

form d'énergie form of useable 2<br />

utilisable<br />

energy<br />

source d'énergie source of usable 1<br />

utilisable<br />

energy<br />

Table 5: Different combinations of the components<br />

energie and utile.<br />

of the pattern pair 4 [NN de NN] ↔ [NN VPART] in<br />

table 3 are not (near) synonyms, but could rather be<br />

considered as related. While some terms seem to prefer<br />

one of the two patterns, the overall tendency for<br />

preference is less clear than for the German examples.<br />

The difference in meaning (i.e. action vs. situation) does<br />

not allow for full interchangeability of related terms, and<br />

the use of the different forms of realization is context<br />

dependent. Some terms from the pairs contained in<br />

table 3 have different meanings, as is the case with<br />

puissance installée vs. installations de puissance élevée<br />

in example (3).<br />

3) Par contre, le coût et la complexité des installations les réservent<br />

le plus souvent à des installations de puissance élevée pour<br />

4 Note that the extracted lemma of the participle is its infinitive;<br />

we show the inflected form for better readability, i.e.<br />

consommée instead of consommer.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

bénéficier d’économies d’échelle.<br />

However, due to the cost and complexity of the installations, they<br />

are mostly restricted to installations of high power in order to<br />

benefit from the scaling effects.<br />

In other cases, grammatical and/or stylistic constraints<br />

may lead authors to use one variant rather than another.<br />

For example, compounds in enumerations are rather<br />

split in order to facilitate the combination with other<br />

nouns, e.g. Meeresboden vs. Boden von Meeren in<br />

example (4).<br />

4) Methanhydrat bildet sich am Boden von Meeren bzw. tiefen Seen<br />

Methane hydrate develops at the ground of the sea or deep lakes<br />

In table 4, we show examples of variants in a wider<br />

sense: starting with the compound Nutzenergie (useful<br />

energy), we find the synonym nutzbare Energie (usable<br />

energy) and the related form genutzte Energie (used<br />

energy). In the entries in the lower part of the table (grey<br />

background), the component Energie is part of a<br />

compound noun while still preserving the (basic)<br />

meaning of the term Nutzenergie (useful energy).<br />

The French examples in table 5 correspond to the<br />

German ones (table 4), with related terms consisting of<br />

the basic components in the upper part of the table, and<br />

terms expanded by an additional component in the lower<br />

part of the table (gray background). The forms nutzbar<br />

and utilisable (usable) in table 4 and 5 illustrate one of<br />

the above mentioned variation pattern for adjectives.<br />

4. Evaluation and discussion<br />

4.1. Issues in measuring precision and recall<br />

While it is relatively easy to measure the precision of<br />

identified (near) synonyms (such as the compound ↔<br />

[NN PRP NN] pairs), it is comparatively difficult to<br />

determine the precision of related terms like the ones in<br />

tables 4 and 5, as it is often difficult to decide on the<br />

degree of relatedness.<br />

Even more difficult is the evaluation of recall, which<br />

largely depends on the set of term variation patterns, but<br />

also on the patterns used for term candidate extraction.<br />

In order to avoid noise, term candidate extraction is<br />

restricted to productive patterns; this implies that not all<br />

term variants might be extracted and consequently, that<br />

some may not be available for variant grouping. The<br />

same applies to the set of rules used to group variants.<br />

For example, the French pattern [NN PRP NN] is<br />

restricted to the prepositions de and à. While there might<br />

be valid terms containing other prepositions, they are<br />

excluded from being extracted. Similarly, the large<br />

number of potential paraphrases of German compounds<br />

cannot be captured.<br />

The examples in tables 4 and 5 illustrate the wide range<br />

of possible types of variation and thus the difficulty to<br />

capture and relate the different types of variation. In<br />

addition to the problem of pattern coverage, another<br />

factor is the quality of the morphological tools used to<br />

model the relationship between word classes.<br />

4.2. Evaluation of precision<br />

In a small experiment, we measured the precision of the<br />

100 most-frequent German compound nouns and their<br />

proposed variants: 74 of the variants are valid. Most of<br />

the 26 invalid variants are due to bad PP-attachment, as<br />

illustrated by the following example:<br />

5) Stromkunde (energy customer) → *Kunde mit<br />

Strom (customer with energy)<br />

which is part of the verbal phrase Kunden mit Strom<br />

versorgen (supply costumers with energy). This kind of<br />

error can rather be considered a problem of the<br />

extraction step than of the variant detection.<br />

However, in the examined set of 100 items, there was<br />

one term-variant pair whose derivation is technically<br />

correct, but the meaning is not related:<br />

6) Grundwasser (ground water) → Wasser am Grund<br />

eines Sees (water on the ground of a lake)<br />

4.3. Symbolic vs. non-symbolic approach<br />

By relying on a fixed set of rules for extraction, we<br />

clearly favour precision at the cost of recall.<br />

In order to extract terms without a set of patterns, we<br />

present a knowledge-poor approach for term extraction<br />

using a probabilistic NP extractor and string-level term<br />

variation detection. First, we apply a probabilistic NP<br />

extractor trained on a small corpus annotated manually<br />

with NPs (300 to 600 sentences): this tool has been<br />

described in Guégan & Loupy (<strong>2011</strong>) for the extraction<br />

of NP chunks and uses a pseudo part-of-speech tagger<br />

(Clark, 2003).<br />

A further non-symbolic procedure consists in relating<br />

121


Multilingual Resources and Multilingual Applications - Regular Papers<br />

extracted terms without relying on a predefined set of<br />

variation patterns. We experimented with comparing<br />

NPs on a string level (using Levenshtein disctance ratio)<br />

and grouping terms by similarity. The resulting term<br />

groups also provide a basis for the automatic derivation<br />

of term variation patterns, which can be used as an input<br />

to the symbolic method.<br />

4.4. Relatedness of term candidates<br />

Using a predefined set of term variation patterns<br />

facilitates the decision whether terms are (near)<br />

synonyms or related. As synonyms, we consider for<br />

example the type [compound noun] ↔ [NN PRP NN].<br />

Structures involving relational adjectives ([ADJ NN]<br />

(DE), [NN ADJ] (FR)), can be expressed by<br />

prepositional phrases, e.g. production énergétique ↔<br />

production d'énergie (energy production ↔ production<br />

of energy).<br />

Similarly, patterns can also help to specify the degree of<br />

relatedness: by explicitly formulating term variation<br />

rules we can differentiate between merely related terms<br />

(e.g. consumption vs. annual consumption) and term<br />

variants where we assume quasi synonymy (cf.<br />

compound nouns in table 2).<br />

A difficult task is the identification of (neoclassical)<br />

synonyms: without additional information (e.g. a<br />

dictionary), it is impossible to relate terms like<br />

Sonnenenergie ↔ Solarenergie (solar energy), as the<br />

relation between Sonne and solar is not known to the<br />

system and cannot be derived by morphological means.<br />

While the terms in the example above are synonyms,<br />

there can be some slight difference in meaning between<br />

neoclassical compounds and their native form: the term<br />

hydroélectricité (hydroelectricity) is more precise than<br />

énergie de l'eau (water energy), and not necessarily a<br />

synonym.<br />

122<br />

5. Conclusion and next steps<br />

We presented a method for terminology extraction and<br />

for the identification of a certain type of term variation.<br />

Preliminary results show that there are preferences for a<br />

certain type of realization, especially when considering<br />

German compound nouns.<br />

Since our current work only deals with a small part of<br />

variation possibilities, we intend to enlarge our<br />

inventory by exploring more variation patterns. We<br />

particularly plan to include high-quality morphological<br />

tools, e.g. SMOR (Schmid et al., 2004) for German, and<br />

DériF (Namer, 2009) for French. SMOR has proven to<br />

outperform our statistical splitter.<br />

Another strand of research is the exploration of term<br />

variation across languages, e.g. relations between term<br />

variants that are similar within different language pairs.<br />

References<br />

Ahmad, K., Davies, A. , Fulford, H. , Rogers, M. (1992):<br />

What is a Term? The semi-automatic extraction of<br />

terms from text. In Translation Studies - an<br />

Interdiscipline. John Benjamins Publishing Company.<br />

Clark, A. (2003): Combining distributional and<br />

morphological information for part of speech<br />

induction. In Proceedings of the 10th conference of<br />

the European chapter of the Association for<br />

Computational Linguistics. Budapest, Hungary.<br />

Daille, B. (2005): Variants and application-oriented<br />

terminology engineering. In Terminology, volume. 1.<br />

Guégan, M. , de Loupy, C. (<strong>2011</strong>): Knowledge-Poor<br />

Approach to Shallow Parsing: Contribution of<br />

Unsupervised Part-of-Speech Induction. RANLP <strong>2011</strong><br />

- Recent Advances in Natural Language Processing.<br />

de Groc, C. (<strong>2011</strong>): Babouk: Focused web crawling for<br />

corpus compilation and automatic terminology<br />

extraction. In Proceedings of the IEEE/WIC/ACM<br />

International Conferences on Web Intelligence and<br />

Intelligent Agent Technology. Lyon, France.<br />

Koehn, P. , Knight, K. (2003): Empirical Methods for<br />

Compound Splitting. In Proceedings of the 10th<br />

conference of the European chapter of the Association<br />

for Computational Linguistics. Budapest, Hungary.<br />

Namer, F. (2009): Morphologie, Lexique et Traitement<br />

Automatique des Langues - Le système DériF.<br />

Hermès – Lavoisier Publishers.<br />

Schmid, H. (1994): Probabilistic part-of-speech tagging<br />

using decision trees. In Proceedings of the<br />

international conference on new methods in language<br />

processing. Manchester, UK.<br />

Schmid, H. , Fitschen, A. , Heid,U. (2004): SMOR: A<br />

German computational morphology covering<br />

derivation, composition and inflection. In Proceedings<br />

of LREC '04. Lisbon, Portugal.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Ansätze zur Verbesserung der Retrieval-Leistung<br />

kommerzieller Translation-Memory-Systeme<br />

Dino Azzano a , Uwe Reinke b , Melanie Sauer b<br />

a itl AG, b Fachhochschule Köln<br />

a Elsenheimerstr. 65, 80687 München<br />

b Gustav-Heinemann-Ufer 54, 50<strong>96</strong>8 Köln<br />

E-mail: dino.azzano@gmail.com, uwe.reinke@fh-koeln.de, melanie.sauer@fh-koeln.de<br />

Abstract<br />

Translation-Memory-Systeme (TM-Systeme) zählen zweifelsohne zu den wichtigsten und am weitesten verbreiteten Werkzeugen<br />

der computergestützten Übersetzung. Kommerzielle Systeme sind inzwischen seit über zwei Jahrzehnten auf dem Markt. Im Hin-<br />

blick auf die Erkennung semantischer Ähnlichkeiten wurde ihre Retrieval-Leistung bislang jedoch nicht entscheidend verbessert.<br />

Demgegenüber stellt die Computerlinguistik seit langem robuste Verfahren bereit, die zu diesem Zweck sinnvoll eingesetzt werden<br />

könnten. Ausgehend von den derzeitigen Grenzen der Retrieval-Leistung kommerzieller TM-Systeme zeigt der vorliegende Beitrag<br />

mögliche Ansätze zur Retrieval-Optimierung auf. Dabei wird zwischen Ansätzen mit und ohne Nutzung von linguistischem Wissen<br />

unterschieden. So genannte platzierbare und lokalisierbare Elemente können ohne linguistisches Wissen effizient behandelt werden.<br />

Im Gegensatz zu normalem Fließtext sind diese Elemente grundsätzlich eindeutig erkennbar und bleiben in der Übersetzung unverändert<br />

oder sie werden gemäß vorgegebenen Regeln angepasst. Die Erkennung mancher Elemente kann durch reguläre Ausdrücke<br />

erzielt werden und – basierend auf der Erkennung – verbessert eine optimierte Ähnlichkeitsberechnung das Retrieval der Segmente,<br />

in denen die Elemente vorkommen. Für die Optimierung des Retrievals von Paraphrasen und Subsegmenten (Phrasen,<br />

Teilsätzen) sowie <strong>für</strong> eine Verbesserung der Terminologieerkennung sind demgegenüber linguistische Verfahren erforderlich. Im<br />

Rahmen eines Forschungsprojekts wird an der Fachhochschule Köln derzeit versucht, vorhandene computerlinguistische Verfahren<br />

in kommerzielle TM-Systeme zu integrieren.<br />

Keywords: computergestützte Übersetzung, Translation-Memory-Systeme, Retrieval-Optimierung, Fuzzy-Matching, platzierbare<br />

und lokalisierbare Elemente<br />

1. Translation-Memory-Systeme<br />

TM-Systeme sind Software-Applikationen, die den Übersetzungsprozess<br />

unterstützen und seit Jahren <strong>für</strong> alle am<br />

Übersetzungsprozess Beteiligten ein wichtiges computergestütztes<br />

Werkzeug darstellen. Ihr Hauptzweck ist die<br />

Wiederverwendung bereits übersetzten Textmaterials<br />

(Trujillo, 1999; Reinke, 2005). Unter den professionellen<br />

Übersetzern arbeitet die Mehrheit regelmäßig mit einem<br />

oder mehreren TM-Systemen (Massion, 2005; Lagoudaki,<br />

2006). Zu den bekanntesten kommerziellen Produkten<br />

zählen Across, Déjà Vu, memoQ, MultiTrans,<br />

SDL Trados, Similis, Transit und Wordfast. Als nicht<br />

kommerzielles Produkt sei Omega-T erwähnt.<br />

Kernstück eines TM-Systems ist das Translation-Memory<br />

(TM), eine Datenbank oder eine Kollektion<br />

von Dateien, welche Einzelsegmente – die in der Regel<br />

einem Satz entsprechen – in der Ausgangssprache sowie<br />

in mindestens einer Zielsprache enthält. Zwischen den<br />

ausgangssprachlichen und den zielsprachlichen Einträgen<br />

besteht eine feste Zuordnung. TMs stellen daher<br />

alignierte parallele Textkorpora dar, die Metainformationen<br />

(wie Anlagedatum, Erzeuger usw.) enthalten, aber<br />

nicht linguistisch annotiert sind (Kenning, 2010; Zinsmeister,<br />

2010). Weitere Komponenten von TM-Systemen<br />

wie Terminologiedatenbank, Editor, Filter zur Konvertierung<br />

von Dateiformaten sowie Projektmanagementwerkzeuge<br />

seien hier nur aus Gründen der Vollständigkeit<br />

erwähnt.<br />

TM-Systeme generieren keine eigenen Texte. Sie sind<br />

daher klar von Systemen zur maschinellen Übersetzung<br />

(MÜ) zu unterscheiden, wobei hybride Lösungen exis-<br />

123


Multilingual Resources and Multilingual Applications - Regular Papers<br />

tieren, die TM und MÜ integrieren. Kernaufgabe eines<br />

TM-Systems ist das Nachschlagen und Auffinden von<br />

Treffern im TM (Reinke, 2004; Jekat & Volk, 2010). Ein<br />

TM-System ist somit in erster Linie ein (monolinguales)<br />

Information-Retrieval-System. Die Suche erfolgt<br />

zunächst auf Segmentebene. Während der Übersetzung<br />

wird der zu bearbeitende ausgangssprachliche Text<br />

segmentweise mit dem TM verglichen. Wird ein ausgangssprachlicher<br />

Treffer gefunden, kann die zugeordnete<br />

zielsprachliche Entsprechung zur Weiterverarbeitung<br />

verwendet werden. Dabei ist die Suche unscharf,<br />

so dass auch ähnliche Segmente (Fuzzy-Treffer) gefunden<br />

werden können (Sikes, 2007). Die Ähnlichkeit<br />

zwischen Suchanfrage und Treffer wird durch einen<br />

Prozentwert quantifiziert. Beim Übersetzen werden dem<br />

TM die neu erstellten Segmentpaare hinzugefügt, so dass<br />

dessen Umfang kontinuierlich zunimmt.<br />

Moderne TM-Systeme bieten darüber hinaus Funktionen,<br />

um die Suche ggf. auf die Subsegmentebene auszudehnen.<br />

Hierbei werden ausgangs- und zielsprachliche Subsegmente<br />

einander mit Hilfe statistischer Verfahren zugeordnet<br />

(Macken, 2009; Chama, 2010) und während der<br />

Übersetzung vorgeschlagen (z.B. mit Hilfe einer Autovervollständigen-Funktion).<br />

Obwohl TM-Systeme inzwischen seit über zwei Jahrzehnten<br />

auf dem Markt sind, wurde ihre Retrieval-Leistung<br />

auf Segmentebene bislang qualitativ und<br />

quantitativ nicht entscheidend verbessert. Selbst Ansätze,<br />

die ohne linguistisches Wissen auskommen und somit auf<br />

recht einfache Weise Verbesserungen erzielen könnten,<br />

haben in kommerziellen TM-Systemen bislang wenig<br />

Beachtung gefunden. Diese werden zunächst im 2. Abschnitt<br />

des Beitrags behandelt. Der 3. Abschnitt geht<br />

dann auf Möglichkeiten zur linguistischen Optimierung<br />

der Retrieval-Leistung ein.<br />

124<br />

2. Retrieval-Optimierung<br />

ohne linguistisches Wissen<br />

Bei den bisherigen Evaluierungen der Retrieval-Leistung<br />

von TM-Systemen wurde der Schwerpunkt auf den<br />

Fließtext gelegt (Reinke, 2004; Sikes, 2007; Baldwin,<br />

2010). Das ist berechtigt, birgt jedoch die Gefahr, andere<br />

Textelemente, so genannte platzierbare und lokalisierbare<br />

Elemente, außer Acht zu lassen, die im Übersetzungsprozess<br />

eine beachtliche Rolle spielen.<br />

2.1. Platzierbare und lokalisierbare Elemente<br />

Platzierbare Elemente wie Tags, Inline-Grafiken und<br />

Felder bestehen nicht oder nur teilweise aus reinem Text<br />

und können häufig unverändert in den Zieltext übernommen<br />

werden. Tags sind Auszeichnungselemente in<br />

HTML- und XML-Dateien. XML-Formate haben in den<br />

letzten Jahren im Bereich der technischen Dokumentation<br />

– auch als Austauschformat – stark an Bedeutung<br />

gewonnen (Reinke, 2008; Anastasiou, 2010; Pelster,<br />

<strong>2011</strong>) und spielen daher auch im Übersetzungsprozess<br />

eine wichtige Rolle. Inline-Grafiken und Felder sind<br />

typische Elemente in Desktop-Publishing-Formaten sowie<br />

in Formaten aus MS Word 1<br />

, die im Alltag der meisten<br />

Übersetzer von zentraler Bedeutung sind (Lagoudaki,<br />

2006:12).<br />

Lokalisierbare Elemente wie Zahlen, Datumsangaben,<br />

Eigennamen mit eindeutiger Oberflächenstruktur, URLs<br />

und E-Mail-Adressen sind hingegen Elemente aus reinem<br />

Text, die meist ohne linguistisches Wissen erkennbar<br />

sind und deren Lokalisierung – im Unterschied zum<br />

normalen Fließtext – vorgegebenen Regeln obliegt und<br />

häufig keine Auswirkung auf den restlichen Text hat.<br />

2.2. Untersuchung<br />

Im Rahmen einer Promotionsarbeit (Azzano, <strong>2011</strong>) wur-<br />

de untersucht, welchen Einfluss platzierbare und lokali-<br />

sierbare Elemente auf das Retrieval kommerzieller<br />

TM-Systeme ausüben. Zu diesem Zweck wurden acht<br />

kommerzielle TM-Systeme verglichen: Across, Déjà Vu,<br />

Heartsome, memoQ, MultiTrans, SDL Trados, Transit<br />

und Wordfast. Aus unterschiedlichen Korpora wurden<br />

Segmente extrahiert, in denen platzierbare und lokalisierbare<br />

Elemente vorkamen, wobei möglichst viele<br />

Variationsmuster berücksichtigt wurden. Diese Segmente<br />

wurden anschließend mit den TM-Systemen bearbeitet,<br />

um die Erkennung der Elemente sowie die vorgeschlagenen<br />

Ähnlichkeitswerte zu prüfen. 2<br />

Die Hauptergebnisse<br />

der vergleichenden Analyse werden im Folgenden<br />

zusammenfasst.<br />

1 Mit DOCX verwendet MS Word zwar ein XML-basiertes<br />

Format, seine Betrachtung und Bearbeitung im Übersetzungsprozess<br />

unterscheidet sich jedoch von üblichen XML-Dokumenten.<br />

2 Eine nähere Erläuterung der Testmethoden und Testdaten ist<br />

in diesem Beitrag aus Platzgründen nicht möglich; siehe Azzano<br />

(<strong>2011</strong>) <strong>für</strong> weitere Einzelheiten.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

2.2.1. Recall<br />

Prinzipiell sollten in einem TM gespeicherte ausgangs-<br />

sprachliche Segmente auch dann gefunden werden, wenn<br />

sie sich vom aktuell zu übersetzenden ausgangssprachlichen<br />

Segment nur durch die oben genannten Elemente<br />

unterscheiden. 3<br />

Die Tests zeigten jedoch, dass das Retrieval<br />

in solchen Fällen fehlschlagen kann oder dass es,<br />

wie im folgenden Beispiel, trotz minimalen Unterschieds<br />

zu sehr hohen Abzügen kommt:<br />

� Amstrong stepped off Eagle’s footpad […]<br />

� Amstrong stepped off Eagle’s footpad […]<br />

Die meisten TM-Systeme bieten Ähnlichkeitswerte zwi-<br />

schen 91% und 99%; eines bietet aber 85% und eines<br />

nur 46%.<br />

Auf Grund der kommerziellen Natur der untersuchten<br />

TM-Systeme und der dadurch bedingten Black-Box-<br />

Evaluierung lassen sich die Ursachen <strong>für</strong> diese Fehler<br />

nicht eindeutig identifizieren. Dennoch sind einige Rückschlüsse<br />

aus den Testergebnissen möglich.<br />

Bei platzierbaren Elementen liegt die Ursache <strong>für</strong> hohe<br />

Abzüge manchmal darin, dass diese Elemente bei der<br />

Ermittlung der Segmentlänge wie übliche Wörter aus dem<br />

Fließtext gewichtet und damit überbewertet werden. 4<br />

Hingegen stellt ein fester Abzug <strong>für</strong> Unterschiede bei<br />

platzierbaren Elementen eine gute Lösung dar und wird<br />

von vier der getesteten TM-Systeme i.d.R. auch angewendet.<br />

Im Unterschied zu Fließtext kann die Art der<br />

Änderung (Hinzufügung, Löschung, Ersetzung oder Umstellung)<br />

bei der Gewichtung von Abzügen ignoriert werden.<br />

Ein fester Abzug sowie dessen Unabhängigkeit von<br />

der Art der Änderung dürften keine allzu großen Anpassungen<br />

der Algorithmen zur Ermittlung des Ähnlichkeitswertes<br />

in den kommerziellen TM-Systemen darstellen.<br />

Bei lokalisierbaren Elementen kann die Retrieval-Leistung<br />

ebenfalls verbessert werden, wenn diese als<br />

„Sonderelemente“ und nicht als normaler Fließtext erkannt<br />

werden. Es können prinzipiell dieselben Strategien<br />

wie <strong>für</strong> platzierbare Elemente angewendet werden (z.B.<br />

fester Abzug). Zur Erkennung solcher Elemente, die bes-<br />

3 Eine mögliche Ausnahme bilden hier solche platzierbare Elemente,<br />

die als Attribut- oder Feldwerte längere zu übersetzende<br />

Fragmente enthalten, wie z.B. .<br />

4 Die Segmentlänge, d.h. die Anzahl der Token (Wortanzahl) im<br />

Segment, wird als Normalisierungsfaktor zur korrekten Ermittlung<br />

des Änderungsumfangs im Verhältnis zum Segmentumfang<br />

verwendet (Trujillo, 1999; Manning & Raghavan<br />

& Schütze, 2008).<br />

timmten Mustern folgen, bewähren sich reguläre<br />

Ausdrücke. Aktuell zeigen kommerzielle TM-Systeme<br />

jedoch noch Schwächen. Zum einen sind Mechanismen<br />

zur Erkennung zwar prinzipiell vorhanden, aber sie<br />

schlagen oft fehl, beispielsweise bei Zahlen. Tabelle 1<br />

führt die Erkennungsrate der untersuchten TM-Systeme<br />

bei Zahlentoken auf. 5<br />

TM-System Erkennungsrate<br />

Wordfast 0,99<br />

memoQ 0,99<br />

Across 0,<strong>96</strong><br />

Transit 0,90<br />

Déjà Vu 0,89<br />

SDL Trados 0,71<br />

Tabelle 1: Erkennungsrate von Zahlen<br />

Zum anderen werden einige lokalisierbare Elemente<br />

völlig ignoriert. Beispielsweise sind zuverlässige reguläre<br />

Ausdrücke zur Erkennung von URLs im reinen Text<br />

bereits ohne Weiteres verfügbar (Goyvaerts & Levithan,<br />

2009), aber nur ein TM-System implementiert sie. Daher<br />

wurden reguläre Ausdrücke zur Erkennung der jeweiligen<br />

lokalisierbaren Elemente präsentiert bzw. entwickelt.<br />

Die Vorteile einer geeigneten Behandlung platzierbarer<br />

und lokalisierbarer Elemente gehen über das reine Retrieval<br />

hinaus. Solche Elemente können häufig automatisch<br />

ersetzt oder gelöscht werden, wobei der Rest des<br />

Fließtextes gleich bleibt. Diese – zum Teil in den TM-Systemen<br />

bereits angewendeten – automatischen Anpassungen<br />

können zum einen Zeit bei der Übersetzung sparen,<br />

zum anderen erhöhen sie den Ähnlichkeitswert.<br />

2.2.2. Precision<br />

Unter 2.2.1 wurden Beispiele präsentiert, bei denen die<br />

Ähnlichkeitswerte zu niedrig ausfallen. Allerdings tritt<br />

auch der umgekehrte Fall ein.<br />

Nicht erkannt werden häufig Unterschiede zwischen den<br />

zu übersetzenden und den im TM gefundenen Segmenten,<br />

wenn sich die Position bzw. die Reihenfolge der platzierbaren<br />

Elemente unterscheidet. Im folgenden Beispiel bieten<br />

drei TM-Systeme <strong>für</strong> das zweite Segment einen<br />

100%-Treffer, obwohl sich die Position der Tags geändert<br />

5 Insgesamt wurden 79 Zahlentoken getestet, wobei jedes Token<br />

ein einmaliges Muster aufweist. Für weitere Informationen<br />

zu den Einzeltests und den getesteten Versionen siehe Azzano<br />

(<strong>2011</strong>).<br />

125


Multilingual Resources and Multilingual Applications - Regular Papers<br />

hat und folglich auch in der Fremdsprache angepasst wer-<br />

den müsste.<br />

� This statement is true only when […]<br />

� This statement is true only when […]<br />

Der Fehler liegt vermutlich darin, dass diese Elemente<br />

lediglich in ungeordneter Reihenfolge berücksichtigt bzw.<br />

vor der Auswertung durch inhaltsleere nicht-positionale<br />

Platzhalter ersetzt werden. Darüber hinaus wird die An-<br />

zahl der Änderungen teilweise nicht berücksichtigt so<br />

dass die Ähnlichkeitswerte zu positiv ausfallen. Im fol-<br />

genden Beispiel bieten vier TM-Systeme <strong>für</strong> beide Varia-<br />

tionen des ersten Segments den gleichen Ähnlichkeitswert,<br />

obwohl im dritten das Tag zweimal hinzugefügt<br />

worden ist.<br />

� Last transmission February 6, 1<strong>96</strong>6, 22:55 UTC.<br />

� Last transmission February 6, 1<strong>96</strong>6, 22:55<br />

126<br />

UTC.<br />

� Last transmission February 6, 1<strong>96</strong>6, <br />

22:55 UTC.<br />

All diese Unzulänglichkeiten dürften mit wenigen Ein-<br />

griffen in die Retrieval-Algorithmen der TM-Systeme<br />

beseitigt werden können.<br />

3. Retrieval Optimierung<br />

mit linguistischem Wissen<br />

3.1. Aktuelle Ansätze<br />

Grundsätzlich lassen sich bei der Optimierung der Retrie-<br />

val-Ergebnisse von TM-Systemen zwei Zielsetzungen<br />

unterscheiden:<br />

1) Die Verbesserung von Recall und Precision des (monolingualen)<br />

Retrievals (Optimierung der Treffermenge<br />

und des Rankings der Treffer)<br />

a. auf Segmentebene<br />

b. auf Subsegmentebene (Retrieval von ‚Chunks’,<br />

(komplexen) Phrasen, Teilsätzen)<br />

2) Die Anpassung der gefundenen Treffer zur Optimierung<br />

ihrer Wiederverwendbarkeit.<br />

In der Forschung sind derzeit in erster Linie Bemühungen<br />

festzustellen, die Wiederverwendbarkeit von Fuzzy-Treffern<br />

durch Verfahren der statistischen maschinellen Übersetzung<br />

zu erhöhen (Biçici & Dymetman, 2008; Zhechev<br />

& van Genabith, 2010; Koehn & Senellart, 2010). Dabei<br />

werden solche Fragmente, die den Unterschied zwischen<br />

einem zu übersetzenden Segment und einem in der<br />

TM-Datenbank gefundenen Fuzzy-Treffer ausmachen,<br />

mit Hilfe statistischer Übersetzungsverfahren so bearbeitet,<br />

dass die Anpassung der im Translation-Memory gefundenen<br />

Übersetzung an den aktuellen Kontext <strong>für</strong> den<br />

Übersetzer idealerweise keinen zusätzlichen Posteditionsaufwand<br />

bedeutet. Welche Auswirkung eine solche<br />

‚Verschmelzung’ von Humanübersetzung und maschineller<br />

Übersetzung auf Segmentebene tatsächlich auf die<br />

Postedition von Fuzzy-Treffern und somit auf Produktivität<br />

und Textqualität hat, müsste aber in jedem Fall empirisch<br />

untersucht werden.<br />

Unter dem Aspekt einer effizienten Einbindung vorhandener<br />

linguistischer Verfahren in kommerzielle TM-Systeme<br />

scheint es zunächst durchaus lohnenswert, eine Optimierung<br />

von Recall und Precision zu verfolgen. Eines<br />

der wenigen kommerziellen TM-Systeme, das zur Optimierung<br />

der Retrieval–Leistung nicht nur zeichenkettenbasierte,<br />

sondern (einfache) computerlinguistische<br />

Verfahren anwendet, ist das Programm Similis der<br />

französischen Firma Lingua et Machina, das morphosyntaktische<br />

Analysen und flache Parsing-Verfahren einsetzt,<br />

um Fragmente unterhalb der Segmentebene zu identifizieren<br />

(Planas, 2005).<br />

Neben der Identifikation von Subsegmenten ist vor allem<br />

auch eine Verbesserung des Retrievals solcher ausgangssprachlicher<br />

Segmente erforderlich, die lediglich Paraphrasen<br />

bereits übersetzter Sätze darstellen und somit auf<br />

zielsprachlicher Seite häufig keinerlei Veränderung erfordern.<br />

Durch morphosyntaktische, lexikalische und/<br />

oder syntaktische Variation gekennzeichnete Paraphrasen<br />

machen einen nicht zu unterschätzenden Anteil in solchen<br />

Fachtexten aus, die ständig aktualisiert, modifiziert und<br />

wiederverwendet werden.<br />

3.2. Zielsetzungen und Ansätze im Rahmen<br />

des Projekts iMEM<br />

Möglichkeiten, vorhandene computerlinguistische Verfahren<br />

in kommerzielle TM-Systeme zu integrieren, werden<br />

derzeit an der Fachhochschule Köln im Rahmen des Forschungsprojekts<br />

„Intelligente Translation Memories<br />

durch computerlinguistische Optimierung (iMEM)“ untersucht.<br />

iMEM zielt auf eine Optimierung der Retrieval-Leistung<br />

von TM-Systemen sowohl im Hinblick auf die<br />

bessere Erkennung von Fragmenten unterhalb der Segmentebene<br />

als auch hinsichtlich einer Optimierung der<br />

Verfahren zur Terminologieerkennung und -prüfung. Dabei<br />

sollen robuste Verfahren zur morphosyntaktischen


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Analyse sowie zur regelbasierten Satzsegmentierung zum<br />

Einsatz kommen. Ziel ist die Entwicklung von<br />

Schnittstellenmodellen und prototypischen Schnittstellen<br />

zwischen kommerziellen TM-Systemen und „Lingware“.<br />

Hierbei werden exemplarisch das TM-System SDL Tra-<br />

dos Studio 2009 und das morphosyntaktische Analyse-<br />

werkzeug MPRO (Maas & Rösener & Theofilidis, 2009)<br />

eingesetzt. Ausgehend von den Sprachen Deutsch und<br />

Englisch sollen Erfahrungen <strong>für</strong> die Entwicklung weiterer<br />

Sprachmodule sowie <strong>für</strong> die Übertragung der Ergebnisse<br />

auf andere TM-Systeme gewonnen werden.<br />

Für die Einbindung morphosyntaktischer Informationen<br />

in das TM-System wurde eine eigenständige SQL-Daten-<br />

bank konzipiert, die aus den Daten der TM-Datenbank als<br />

paralleles „linguistisches TM“ aufgebaut wird und über<br />

entsprechende IDs mit der TM-Datenbank verknüpft ist.<br />

Das „linguistische TM“ enthält neben den Token der Tex-<br />

toberfläche derzeit im Wesentlichen Ergebnisse der mit<br />

MPRO durchgeführten Kompositaanalyse, wobei die<br />

Daten zur Beschleunigung des Retrievals in Form von<br />

SuffixArrays (Aluru, 2004) vorgehalten werden.<br />

In der Retrieval-Phase wird das aktuell zu übersetzende<br />

Segment zunächst analog zum „linguistischen TM“ lin-<br />

guistisch analysiert und annotiert, so dass ein Abgleich<br />

stattfinden kann. Die Abfrage des TM erfolgt in zwei un-<br />

abhängigen Teilprozessen, bei denen einerseits die Token-<br />

ketten und andererseits die Ergebnisse der Kompositazer-<br />

legungen des zu übersetzenden Segments mit den<br />

entsprechenden Daten der im „linguistischen TM“ ges-<br />

peicherten ausgangssprachlichen Segmente verglichen<br />

werden. Dabei werden <strong>für</strong> alle Ergebnisse der beiden Ab-<br />

fragen unter Verwendung von Generalized Suffix Arrays<br />

(GSA) (Rieck & Laskov & Sonnenburg, 2007) die<br />

längsten gemeinsamen Zeichenketten (Longest Common<br />

Substring, LCS) von zu übersetzendem Segment und im<br />

„linguistischen TM“ gefundenem Segment ermittelt. Für<br />

das Ranking der gefundenen Treffer ist noch eine Formel<br />

zu entwickeln, die die Ergebnisse beider Teilsuchen kom-<br />

biniert und gewichtet, wobei jeweils u.a. Anzahl und<br />

Länge der LCS sowie deren Position in den zu verglei-<br />

chenden Segmenten berücksichtigt werden sollte (vgl.<br />

auch Hawkins & Giraud-Carrier, 2009).<br />

Im weiteren Verlauf des Projekts soll untersucht werden,<br />

inwieweit das bisherige Verfahren durch satzsyntaktische<br />

Analysen so erweitert werden kann, dass unterhalb der<br />

Segmentebene nicht nur vergleichsweise einfache Phra-<br />

sen, sondern vor allem auch Übersetzungseinheiten wie<br />

Teilsätze und komplexe Phrasen gefunden werden, die <strong>für</strong><br />

das computergestützte Humanübersetzen mit TM-Syste-<br />

men relevant sind.<br />

4. Fazit undAusblick<br />

Zusammenfassend kann festgestellt werden, dass <strong>für</strong> die<br />

Lösung der Retrieval-Probleme kommerzieller TM-Sys-<br />

teme aus Sicht der Computerlinguistik bewährte Verfah-<br />

ren zur Verfügung stehen, die aber bisher kaum oder nur<br />

vereinzelt Eingang in die TM-Systeme gefunden haben.<br />

Ein sprachunabhängiger, rein zeichenkettenbasierter An-<br />

satz ohne Nutzung linguistischen Wissens, wie er derzeit<br />

bei fast allen kommerziellen TM-Systemen verfolgt wird,<br />

liefert ungeachtet seiner offensichtlichen Vorteile hinsich-<br />

tlich der Sprachabdeckung keine optimalen Precision-<br />

und Recall-Werte. Es liegt daher nahe, einen differenzier-<br />

ten Ansatz zu verfolgen und <strong>für</strong> die vom Übersetzungsvo-<br />

lumen her ‘großen’ Sprachen damit zu beginnen, vorhan-<br />

dene robuste Verfahren der linguistischen Datenverarbei-<br />

tung in kommerzielle TM-Systeme zu integrieren. Bei<br />

Sprachen, <strong>für</strong> die entsprechende Verfahren nicht zur<br />

Verfügung stehen oder nicht ausreichend robust sind, kann<br />

zunächst weiter mit den herkömmlichen Retriev-<br />

al-Mechanismen gearbeitet werden, wobei Verbesserun-<br />

gen in der Handhabung von platzierbaren und lokalisier-<br />

baren Elementen möglich sind.<br />

5. Danksagung<br />

Das Projekt „iMEM – Intelligente Translation Memories<br />

durch computerlinguistische Optimierung“ wird vom<br />

Bundesministerium <strong>für</strong> Bildung und Forschung im Rah-<br />

men des Programms „Forschung an Fachhochschu-<br />

len“ gefördert.<br />

6. Literatur<br />

Aluru, S. (2004): „Suffix Trees and Suffix Arrays“. In<br />

Mehta, D. P. und Sahni, S. (Eds.), Handbook of Data<br />

Structures and Applications. Boca Rayton: Chapman &<br />

Hall/CRC.<br />

Azzano, D. (<strong>2011</strong>): Placeable and localizable elements in<br />

translation memory systems. Dissertation. Ludwig-<br />

Maximilians-<strong>Universität</strong> München.<br />

Anastasiou, D. (2010): Survey on the Use of XLIFF in<br />

Localisation Industry and Academia. In Proceedings of<br />

the 7th International Conference on Language Re-<br />

127


Multilingual Resources and Multilingual Applications - Regular Papers<br />

128<br />

sources and Evaluation.<br />

Baldwin, T. (2010): The hare and the tortoise: speed and<br />

accuracy in translation retrieval. Machine Translation,<br />

23(4), pp. 195-240.<br />

Biçici, E. und Dymetman, M. (2008): Dynamic Transla-<br />

tion Memory: Using Statistical Machine Translation to<br />

improve Translation Memory Fuzzy Matches. In Gel-<br />

bukh, A. F. (Ed.), Computational Linguistics and Intel-<br />

ligent Text Processing, 9th International Conference,<br />

Proceedings. Lecture Notes in Computer Science 4919.<br />

Berlin, Heidelberg: Springer, pp. 454-465.<br />

Chama, Z. (2010): Vom Segment zum Kontext. techni-<br />

sche kommunikation, 32(2), pp. 21-25.<br />

Goyvaerts, J. und Levithan, S. (2009): Regular expres-<br />

sions cookbook. Sebastopol, O’Reilly.<br />

Hawkins, B. und Giraud-Carrier, C. (2009): „Ranking<br />

search results for translated content“. In Zhang, K. und<br />

Alhajj, R. (Eds.), IRI'09 - Proceedings of the 10th IEEE<br />

international conference on Information Reuse & Inte-<br />

gration. Piscataway, NJ: IEEE Press, pp. 242-245.<br />

Jekat, S. und Volk, M. (2010): Maschinelle und computer-<br />

gestützte Übersetzung. In Carstensen, K.-U. et al. (Eds.),<br />

Computerlinguistik und Sprachtechnologie: eine Ein-<br />

führung. Heidelberg: Spektrum, pp. 642-658.<br />

Kenning, M.-M. (2010): What are parallel and compara-<br />

ble corpora and how can we use them? In O’Keeffe, A.<br />

und McCarthy, M. (Eds.), The Routledge Handbook of<br />

Corpus Linguistics. NewYork: Routledge, pp. 487-500.<br />

Koehn, Ph. und Senellart, J. (2010): Convergence of<br />

Translation Memory and Statistical Machine Transla-<br />

tion. In Zhechev, V. (Ed.), Proceedings of the Second<br />

Joint EM+/CNGLWorkshop “Bringing MT to the User:<br />

Research on Integrating MT in the Translation Indus-<br />

try”.<br />

Lagoudaki, E. (2006): Translation Memory systems: Enlight-<br />

ening users' perspective. http://www3.imperial.ac.uk/<br />

portal/pls/portallive/docs/1/7307707.PDF (26.07.<strong>2011</strong>).<br />

Maas, H. D., Rösener, Ch. und Theofilidis, A. (2009):<br />

„Morphosyntactic and semantic analysis of text: The<br />

MPRO tagging procedure“. In Mahlow, C. und<br />

Piotrowski, M. (Eds.), State of the Art in Computational<br />

Morphology: Workshop on Systems and Frameworks<br />

for Computational Morphology. Proceedings. Berlin et<br />

al.: Springer, pp. 76-87.<br />

Macken, L. (2009): In search of the recurrent units of<br />

translation. In Daelemans, W. und Hoste, V. (Eds.),<br />

Evaluation of Translation Technology. Brussels: Aca-<br />

demic and Scientific Publishers, pp. 195-212.<br />

Manning, C., Raghavan, P., Schütze, H. (2008): Introduc-<br />

tion to Information Retrieval. Cambridge et al.: Cam-<br />

bridge University Press.<br />

Massion, F. (2005): Translation-Memory-Systeme im<br />

Vergleich. Reutlingen: Doculine.<br />

Pelster, U. (<strong>2011</strong>): XML<strong>für</strong> den passenden Zweck. techni-<br />

sche kommunikation, 33(1), pp. 54-57.<br />

Planas, E. (2005): SIMILIS: Second-generation transla-<br />

tion memory software. In Translating and the Computer<br />

27: Proceedings of the Twenty-seventh International<br />

Conference on Translating and the Computer. London:<br />

Aslib.<br />

Reinke, U. (2004): Translation Memories: Systeme –<br />

Konzepte – linguistische Optimierung. Frankfurt am<br />

Main: Lang.<br />

Reinke, U. (2006): Translation Memories. In Brown, K.<br />

(Ed.), Encyclopedia of Language and Linguistics. Ox-<br />

ford: Elsevier, pp. 61-65.<br />

Reinke, U. (2008): XML-Unterstützung in Transla-<br />

tion-Memory-Systemen. In tekom Jahrestagung 2008,<br />

Zusammenfassung der Referate. Stuttgart: Gesellschaft<br />

<strong>für</strong> technische Kommunikation e.V.<br />

Rieck, K., Laskov, P. und Sonnenburg, S. (2007): Compu-<br />

tation of Similarity Measures for Sequential Data using<br />

Generalized Suffix Trees. In Schölkopf, B., Platt, J. und<br />

T. Hoffman (Eds.), Advances in Neural Information<br />

Processing Systems 19. Cambridge, MA: MIT Press,<br />

pp. 1177-1184.<br />

Sikes, R. (2007): Fuzzy matching in theory and practice.<br />

MultiLingual, 18(6), pp. 39-43.<br />

Trujillo, A. (1999): Translation Engines: Techniques for<br />

Machine Translation. London: Springer.<br />

Zinsmeister, H. (2010): Korpora. In Carstensen, K.-U. et<br />

al. (Eds.), Computerlinguistik und Sprachtechnologie:<br />

eine Einführung. Heidelberg: Spektrum, pp. 482-491.<br />

Zhechev, V. und van Genabith, J. (2010): Maximising TM<br />

Performance through Sub-Tree Alignment and SMT. In<br />

Proceedings of the Ninth Conference of the Association<br />

for Machine Translation in theAmericas.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

WikiWarsDE: A German Corpus of Narratives Annotated<br />

with Temporal Expressions<br />

Jannik Strötgen, Michael Gertz<br />

Institute of Computer Science, Heidelberg University<br />

Im Neuenheimer Feld 348, 69120 Heidelberg, Germany<br />

E-mail: stroetgen@uni-hd.de, gertz@uni-hd.de<br />

Abstract<br />

Temporal information plays an important role in many natural language processing and understanding tasks. Therefore, the<br />

extraction and normalization of temporal expressions from documents are crucial preprocessing steps in these research areas, and<br />

several temporal taggers have been developed in the past. The quality of such temporal taggers is usually evaluated using<br />

annotated corpora as gold standards. However, existing annotated corpora only contain documents of the news domain, i.e., short<br />

documents with only few temporal expressions. A remarkable exception is the recently published corpus WikiWars, which is the<br />

first temporal annotated English corpus containing long narratives that are rich in temporal expressions. Following this example, in<br />

this paper, we describe the development and the characteristics of WikiWarsDE, a new temporal annotated corpus for German.<br />

Additionally, we present evaluation results of our temporal tagger HeidelTime on WikiWarsDE and compare them with results<br />

achieved on other corpora. Both, WikiWarsDE as well as our temporal tagger HeidelTime are publicly available.<br />

Keywords: temporal expression, TIMEX2, corpus annotation, temporal information extraction<br />

1. Introduction and Related Work<br />

In the last decades, the extraction and normalization of<br />

temporal expressions have become hot topics in<br />

computational linguistics. In many research areas,<br />

temporal information plays an important role, e.g., in<br />

information extraction, document summarization, and<br />

question answering (Mani et al., 2005). In addition,<br />

temporal information is valuable in information retrieval<br />

and can be used to improve search and exploration tasks<br />

(Alonso et al., <strong>2011</strong>). However, the tasks of extracting<br />

and normalizing temporal expressions are challenging<br />

due to the fact that there are many different ways to<br />

express temporal information in documents and that<br />

temporal expressions may be ambiguous.<br />

Besides explicit expressions (e.g., “April 10, 2005”) that<br />

can directly be normalized to some standard format,<br />

relative and underspecified expressions are very<br />

common in many types of documents. To determine the<br />

semantics of such expressions, context information is<br />

required. For example, to normalize the expression<br />

“Monday” in phrases like “on Monday”, a reference<br />

time and the relation to the reference time have to be<br />

identified. Depending on the domain of the documents<br />

that are to be processed, this reference time can either be<br />

the document creation time or another temporal<br />

expression in the document. While the document<br />

creation time plays an important role in news<br />

documents, it is almost irrelevant in narrative style<br />

documents, e.g., documents about history or<br />

biographies. Despite these challenges, all applications<br />

using temporal information mentioned in documents<br />

rely on high quality temporal taggers, which correctly<br />

extract and normalize temporal expressions from<br />

documents.<br />

Due to the importance of temporal tagging, there have<br />

been significant efforts in the area of temporal<br />

annotation of text documents. Annotation standards such<br />

as TIDES TIMEX2 (Ferro et al., 2005) and TimeML<br />

(Pustejovsky et al., 2003b; Pustejovsky et al., 2005)<br />

were defined and temporal annotated corpora like<br />

TimeBank (Pustejovsky et al., 2003a) were developed –<br />

although most of the corpora contain English documents<br />

129


Multilingual Resources and Multilingual Applications - Regular Papers<br />

only. Furthermore, research challenges were organized<br />

where temporal taggers were evaluated. The ACE<br />

(Automatic Content Extraction) time expression and<br />

normalization (TERN) challenges were organized in<br />

2004, 2005, and 2007. 1 In 2010, temporal tagging was<br />

one task in the TempEval-2 challenge (Verhagen et al.,<br />

2010). However, so far, research was limited to the news<br />

domain, i.e., the documents of the annotated corpora are<br />

short with only a few temporal expressions. The<br />

temporal discourse structure is thus usually easy to<br />

follow. Only recently, a first corpus containing<br />

narratives was developed (Mazur & Dale, 2010). This<br />

corpus, called WikiWars, consists of Wikipedia articles<br />

about famous wars in history. The documents are much<br />

longer than news documents and contain many temporal<br />

expressions. As the developers point out, normalizing<br />

the temporal expressions in such documents is more<br />

challenging due to the rich temporal discourse structure<br />

of the documents.<br />

Motivated by this observation and by the fact that no<br />

temporal annotated corpus for German was publicly<br />

available so far, we created the WikiWarsDE corpus2 ,<br />

which we present in this paper. WikiWarsDE contains<br />

the corresponding German articles of the documents of<br />

the English WikiWars corpus. For the annotation<br />

process, we followed the suggestions of the WikiWars<br />

developers, i.e., annotated the temporal expressions<br />

according to the TIDES TIMEX2 annotation standard<br />

using the annotation tool Callisto3 . To be able to use<br />

publicly available evaluation scripts, the format of the<br />

ACE TERN corpus was selected. Thus, evaluating a<br />

temporal tagger on the WikiWarsDE corpus is<br />

straightforward and evaluation results of different<br />

taggers can be compared easily.<br />

The remainder of the paper is structured as follows. In<br />

Section 2, we describe the annotation schema and the<br />

corpus creation process. Then, in Section 3, we present<br />

detailed information about the corpus such as statistics<br />

on the length of the documents and the number of<br />

temporal expressions. In addition, evaluation results of<br />

1 The 2004 and 2005 training sets and the 2004 evaluation set<br />

are released by the LDC as is the TimeBank corpus; see<br />

http://www.ldc.upenn.edu/<br />

2 WikiWarsDE is publicly available on http://dbs.ifi.uniheidelberg.de/temporal_tagging/<br />

3 http://callisto.mitre.org/<br />

130<br />

our own temporal tagger on the WikiWarsDE corpus are<br />

presented. Finally, we conclude our paper in Section 4.<br />

Temporal<br />

Value of the VAL<br />

Expression<br />

attribute<br />

November 12, 2001 2001-11-12<br />

9:30 p.m. 2001-11-12T21:304 24 months P20M<br />

daily XXXX-XX-XX<br />

Table 1: Normalization examples (VAL) of temporal<br />

expressions of the types date, time, duration, and set.<br />

2. Annotation Schema and Corpus Creation<br />

In Section 2.1, we describe the annotation schema,<br />

which we used for the annotation of temporal<br />

expressions in our newly created corpus. Furthermore,<br />

we explain the task of normalizing temporal expressions<br />

using some examples. Then, in Section 2.2, we detail the<br />

corpus creation process and explain the format, in which<br />

WikiWarsDE is publicly available.<br />

2.1. Annotation Schema<br />

Following the approach of Mazur and Dale (2010), we<br />

use TIDES TIMEX2 as annotation schema to annotate<br />

the temporal expressions in our corpus. The TIDES<br />

TIMEX2 annotation guidelines (Ferro et al., 2005)<br />

describe how to determine the extents of temporal<br />

expressions and their normalizations. In addition to date<br />

and time expressions, such as “November 12, 2001” and<br />

“9:30 p.m.”, temporal expressions describing durations<br />

and sets are to be annotated as well. Examples for<br />

expressions of the types duration and set are “24<br />

months” and “daily”, respectively.<br />

The normalization of temporal expressions is based on<br />

the ISO 8601 standard for temporal information with<br />

some extensions. The following five features can be<br />

used to normalize a temporal expression:<br />

� VAL (value)<br />

� MOD (modifier)<br />

� ANCHOR_VAL (anchor value)<br />

� ANCHOR_DIR (anchor direction)<br />

� SET<br />

The most important feature of a TIMEX2 annotation is<br />

the “VAL” (value) feature. For the four examples above,<br />

4 Assuming that “9:30 p.m.” refers to 9:30 p.m. on November<br />

12, 2001.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

the values of VAL are given in Table 1. Furthermore,<br />

“MOD” (modifier) is used, for instance, for expressions<br />

such as “the end of November 2001”, where MOD is set<br />

to “END”, i.e., to capture additional specifications not<br />

captured by VAL. ANCHOR_VAL and ANCHOR_DIR<br />

are used to anchor a duration to a specific date, using the<br />

value information of the date and specifying whether the<br />

duration starts or ends on this date. Finally, SET is used<br />

to identify set expressions.<br />

Often, for example in the TempEval-2 challenge, the<br />

normalization quality of temporal taggers is evaluated<br />

based on the VAL (value) feature, only. This fact points<br />

out the importance of this feature and was the<br />

motivation to evaluate the normalization quality of our<br />

temporal tagger based on this feature as described in<br />

Section 3.<br />

2.2. Corpus Creation<br />

For the creation of the corpus, we followed Mazur and<br />

Dale (2010), the developers of the English WikiWars<br />

corpus. We selected the 22 corresponding German<br />

Wikipedia articles and manually copied sections<br />

describing the course of the wars. 5 All pictures, crosspage<br />

references, and citations were removed. All text<br />

files were then converted into SGML files, the format of<br />

the ACE TERN corpora containing “DOC”, “DOCID”,<br />

“DOCTYPE”, “DATETIME”, and “TEXT” tags. The<br />

document creation time was set to the time of<br />

downloading the articles from Wikipedia. The “TEXT”<br />

tag surrounds the text that is to be annotated.<br />

Similar to Mazur and Dale (2010), we used our own<br />

temporal tagger, which is described in Section 3.2,<br />

containing a rule set for German as a first-pass<br />

annotation tool. The output of the tagger can then be<br />

imported to the annotation tool Callisto for manual<br />

correction of the annotations. Although this fact has to<br />

be taken into account when comparing the evaluation<br />

results on WikiWarsDE of our temporal tagger with<br />

other taggers, this procedure is motivated by the fact that<br />

“annotator blindness” is reduced to a minimum, i.e., that<br />

annotators miss temporal expressions. Furthermore, the<br />

annotation effort is reduced significantly since one does<br />

5 Due to the shortness of the Wikipedia article about the Punic<br />

Wars in general, we used sections of three separate articles<br />

about the 1st, 2nd, and 3rd Punic Wars.<br />

not have to create a TIMEX2 tag for the expressions<br />

already identified by the tagger.<br />

At the second annotation stage, the documents were<br />

examined for temporal expressions missed by the<br />

temporal tagger and annotations created by the tagger<br />

were manually corrected. This task was performed by<br />

two annotators – although Annotator 2 only annotated<br />

the extents of temporal expressions. The more difficult<br />

task of normalizing the temporal expressions was<br />

performed by Annotator 1 only, since a lot of experience<br />

in temporal annotation is required for this task. At the<br />

third annotation stage, the results of both annotators<br />

were merged and in cases of disagreement the extents<br />

and normalizations were rechecked and corrected by<br />

Annotator 1.<br />

To compare our inter-annotator agreement for the<br />

determination of the extents of temporal expressions to<br />

others, we calculated the same measures as the<br />

developers of the TimeBank-1.2 corpus. They calculated<br />

the average of precision and recall with one annotator's<br />

data as the key and the other's as the response. Using a<br />

subset of ten documents, they report inter-annotator<br />

agreement of <strong>96</strong>% and 83% for partial match (lenient)<br />

and exact match (strict), respectively. 6 Our scores for<br />

lenient and exact match on the whole corpus are <strong>96</strong>.7%<br />

and 81.3%, respectively.<br />

Finally, the annotated files, which contain inline<br />

annotations, were transformed into the ACE APF XML<br />

format, a stand-off markup format used by the ACE<br />

evaluations. Thus, the WikiWarsDE corpus is available<br />

in the same two formats as the WikiWars corpus, and the<br />

evaluation tools of the ACE TERN evaluations can be<br />

used with this German corpus as well.<br />

3. Corpus Statistics and Evaluation Results<br />

In this section, we first present some statistical<br />

information about the WikiWarsDE corpus, such as the<br />

length of the documents and the number of temporal<br />

expressions in the documents (Section 3.1). Then, in<br />

Section 3.2, we shortly introduce our own temporal<br />

tagger HeidelTime, present its evaluation results on<br />

WikiWarsDE, and compare them with results achieved<br />

on other corpora.<br />

6<br />

For more information on TimeBank, see<br />

http://timeml.org/site/timebank/ documentation-1.2.html.<br />

131


Multilingual Resources and Multilingual Applications - Regular Papers<br />

132<br />

Corpus Docs Token Timex Token / Timex Timex / Document<br />

ACE 04 en train 863 306.463 8.938 34,3 10,4<br />

TimeBank 1.2 183 78.444 1.414 55,5 7,7<br />

TempEval2 en train 162 53.450 1.052 50,8 6,5<br />

TempEval2 en eval 9 4.849 81 59,9 9,0<br />

WikiWars 22 119.468 2.671 44,7 121,4<br />

WikiWarsDE 22 95.604 2.240 42,7 101,8<br />

Table 2: Statistics of the WikiWarsDE corpus and other publicly available or released corpora.<br />

3.1. Corpus Statistics<br />

The WikiWarsDE corpus contains 22 documents with a<br />

total of more than 95,000 tokens and 2,240 temporal<br />

expressions. Note that the fact that the WikiWars corpus<br />

contains almost 25,000 tokens more than WikiWarsDE<br />

can be partly explained by the differences between the<br />

two languages. In German compounds are very frequent,<br />

e.g., the 3 English tokens "course of war" is just 1 token<br />

in German (“Kriegsverlauf”).<br />

In Table 2, we present some statistics of the corpus in<br />

comparison to other publicly available corpora. On the<br />

one hand, the density of temporal expressions<br />

(Token/Timex) is similar among the documents of all the<br />

corpora. In WikiWarsDE, one temporal expression<br />

occurs every 42.7 tokens on average.<br />

On the other hand, one can easily see that the documents<br />

of the WikiWarsDE and the WikiWars corpora are much<br />

longer and contain many more temporal expressions<br />

than the documents of the news corpora. While<br />

WikiWars and WikiWarsDE contain 121.4 and 101.8<br />

temporal expressions per document on average, the<br />

number of temporal expressions on the news copora<br />

ranges between 6.5 and 10.4 temporal expressions only.<br />

Thus, the temporal discourse structure is much more<br />

complex for the narrative-style documents in WikiWars<br />

and WikiWarsDE. Further statistics on the single<br />

documents of WikiWarsDE are published with the<br />

corpus.<br />

3.2. Evaluation Results<br />

After the development of the corpus, we evaluated our<br />

temporal tagger HeidelTime on the corpus. HeidelTime<br />

is a multilingual, rule-based temporal tagger. Currently,<br />

two languages are supported (English and German), but<br />

due to the strict separation between the source code and<br />

the resources (rules, extraction patterns, normalization<br />

information), HeidelTime can be easily adapted to<br />

further languages. In the TempEval-2 challenge,<br />

HeidelTime achieved the best results for the extraction<br />

and normalization of temporal expressions from English<br />

documents (Strötgen & Gertz, 2010; Verhagen et al.,<br />

2010). Since HeidelTime uses different normalization<br />

strategies depending on the type of the documents that<br />

are to be processed (news- or narrative-style<br />

documents), we were able to show that HeidelTime<br />

achieves high quality results on both kinds of documents<br />

for English. 7<br />

With the development of WikiWarsDE, we are now able<br />

to evaluate HeidelTime on a German corpus as well. For<br />

this, we use the well-known evaluation measures of<br />

precision, recall, and f-score. In addition, we distinguish<br />

between lenient (overlapping match) and strict (exact<br />

match) measures. For the normalization, one can<br />

calculate the measures for all expressions that were<br />

correctly extracted by the system (value). This approach<br />

is used by the ACE TERN evaluations. However, similar<br />

to Ahn et al. (2005) and Mazur and Dale (2010), we<br />

argue that it is more meaningful to combine the<br />

extraction with the normalization tasks, i.e., to calculate<br />

the measures for all expressions in the corpus<br />

(lenient+value and strict+value).<br />

7 More information on HeidelTime, its evaluation results on<br />

several corpora, as well as download links and an online demo<br />

can be found at http://dbs.ifi.uni-heidelberg.de/heideltime/.


Corpus<br />

Multilingual Resources and Multilingual Applications - Regular Papers<br />

lenient<br />

P R F<br />

strict<br />

P R F<br />

value<br />

P R F<br />

lenient + value<br />

P R F<br />

strict + value<br />

P R F<br />

TimeBank 1.2 90.5 91.4 90.9 83.5 84.3 83.9 86.2 86.2 86.2 78.0 78.8 78.4 73.2 74.0 73.6<br />

WikiWars 93.9 82.4 87.8 86.0 75.4 80.4 89.5 90.1 89.8 84.1 73.8 78.6 79.6 69.8 74.4<br />

WikiWarsDE 98.5 85.0 91.3 92.6 79.9 85.8 87.0 87.0 87.0 85.7 74.0 79.4 82.5 71.2 76.5<br />

Table 3: Evaluation results of our temporal tagger on an English news corpus (TimeBank 1.2), an English narratives<br />

corpus (WikiWars) and our newly created German narratives corpus WikiWarsDE.<br />

On WikiWarsDE, HeidelTime achieves f-scores of 91.3<br />

and 85.8 for the extraction (lenient and strict,<br />

respectively) and 79.4 and 76.5 for the normalization<br />

(lenient + value and strict + value, respectively).<br />

For comparison, we present the results of HeidelTime on<br />

some English corpora. As shown in Table 3, our<br />

temporal tagger achieves equally good results on both,<br />

the narratives corpora (WikiWars and WikiWarsDE) and<br />

the news corpus (TimeBank). Note that our temporal<br />

tagger uses different normalization strategies depending<br />

on the type of the corpus that is to be processed. This<br />

might be the main reason why HeidelTime clearly<br />

outperforms the temporal tagger of the WikiWars<br />

developers. For the WikiWars corpus, Mazur and Dale<br />

(2010) report f-scores for the normalization of only 59,0<br />

and 58,0 (lenient + value and strict + value,<br />

respectively). Compared to these values, HeidelTime<br />

achieves much higher f-scores, namely 78.6 and 74.4,<br />

respectively.<br />

4. Conclusions<br />

In this paper, we described WikiWarsDE, a temporal<br />

annotated corpus containing German narrative-style<br />

documents. After presenting the creation process and<br />

statistics of WikiWarsDE, we used the corpus to<br />

evaluate our temporal tagger HeidelTime. While Mazur<br />

and Dale (2010) report lower evaluation results of their<br />

temporal tagger on narratives than on news documents,<br />

HeidelTime achieves similar results on both types of<br />

corpora. Nevertheless, we share their opinion that the<br />

normalization of temporal expressions on narratives is<br />

challenging. However, using different normalization<br />

strategy for different types of documents (news-style<br />

and narrative-style documents), this problem can be<br />

tackled.<br />

By making available WikiWarsDE and HeidelTime, we<br />

provide useful contributions to the community in<br />

support of developing and evaluating temporal taggers<br />

and of improving temporal information extraction.<br />

5. Acknowledgements<br />

We thank the anonymous reviewers for their valuable<br />

suggestions to improve the paper.<br />

6. References<br />

Ahn, D., Adafre, S.F., de Rijke, M. (2005): Towards<br />

Task-Based Temporal Extraction and Recognition. In<br />

G. Katz, J. Pustejovsky, F. Schilder (Eds.), Extracting<br />

and Reasoning about Time and Events. Dagstuhl,<br />

Germany: Dagstuhl Seminar Proceedings.<br />

Alonso, O., Strötgen, J., Baeza-Yates, R., Gertz, M.<br />

(<strong>2011</strong>): Temporal Information Retrieval: Challenges<br />

and Opportunities. In Proceedings of the 1st<br />

International Temporal Web Analytics Workshop<br />

(TWAW), pp. 1–8.<br />

Ferro, L., Gerber, L., Mani, I., Sundheim, B., Wilson, G.<br />

(2005): TIDES 2005 Standard for the Annotation of<br />

Temporal Expressions. Technical report, The MITRE<br />

Corporation.<br />

Mani, I., Pustejovsky, J., Gaizauskas, R.J. (2005): The<br />

Language of Time: A Reader. Oxford University<br />

Press.<br />

Mazur, P., Dale, R. (2010): WikiWars: A New Corpus<br />

for Research on Temporal Expressions. In<br />

Proceedings of the 2010 Conference on Empirical<br />

Methods in Natural Language Processing (EMNLP),<br />

pp. 913-922.<br />

Pustejovsky, J., Hanks, P., Sauri, R., See, A.,<br />

Gaizauskas, R.J., Setzer, A., Radev, D., Sundheim, B.,<br />

Day, D., Ferro, L., Lazo, M. (2003a): The<br />

133


Multilingual Resources and Multilingual Applications - Regular Papers<br />

134<br />

TIMEBANK Corpus. In Proceedings of Corpus<br />

Linguistics 2003, pp. 647–656.<br />

Pustejovsky, J., Castaño, J.M., Ingria, R., Sauri, R.,<br />

Gaizauskas, R.J., Setzer, A., Katz, G. (2003b):<br />

TimeML: Robust Specification of Event and<br />

Temporal Expressions in Text. In New Directions in<br />

Question Answering, pp 28–34.<br />

Pustejovsky, J., Knippen, R., Littman, J., Sauri, R.<br />

(2005): Temporal and Event Information in Natural<br />

Language Text. Language Resources and Evaluation,<br />

39(2-3):123–164.<br />

Strötgen, J., Gertz, M. (2010): HeidelTime: High<br />

Quality Rule-based Extraction and Normalization of<br />

Temporal Expressions. In Proceedings of the 5th<br />

International Workshop on Semantic Evaluation<br />

(SemEval), pp. 321–324.<br />

Verhagen, M., Sauri, R., Caselli, T., Pustejovsky, J.<br />

(2010): SemEval-2010 Task 13: TempEval-2. In<br />

Proceedings of the 5th International Workshop on<br />

Semantic Evaluation (SemEval), pp. 57–62.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Translation and Language Change with Reference to Popular Science Articles:<br />

The Interplay of Diachronic and Synchronic Corpus-Based Studies<br />

Sofia Malamatidou<br />

University of Manchester<br />

Oxford Road, M13 9PL<br />

E-mail: sofia.malamatidou@manchester.ac.uk<br />

Abstract<br />

Although a number of scholars have adopted a corpus-based approach to the investigation of translation as a form of language<br />

contact and its impact on the target language (Steiner, 2008; House, 2004; 2008; Baumgarten et el. 2004), no sustained corpus-based<br />

study of translation involving Modern Greek has so far been attempted and very few diachronic corpus-based studies (Amouzadeh &<br />

House, 2010) have been undertaken in the field of translation. This study aims to combine synchronic and diachronic corpus-based<br />

approaches, as well as parallel and comparable corpora for the analysis of the linguistic features of translated texts and their impact<br />

on non-translated ones. The corpus created captures a twenty-year period (1990-2010) and is divided into three sections, including<br />

non-translated and translated Modern Greek popular science articles published in different years, as well as the source texts of the<br />

translations. Unlike most studies employing comparable corpora, which focus on revealing recurrent features of translated language<br />

independently of the source and target language, this study approaches texts with the intention of revealing features that are<br />

dependent on the specific language pair involved in the translation process.<br />

Keywords: corpus-based translation studies, language change, diachronic corpora, Modern Greek, passive voice<br />

1. Introduction<br />

Translation as a language contact phenomenon is a<br />

phenomenon that neither linguistics nor translation<br />

studies has addressed in depth. However, in the era of the<br />

information society, the translation of popular science<br />

texts tends to be very much a unidirectional process from<br />

the dominant lingua franca, which is English, into less<br />

widely spoken languages such as Modern Greek. This<br />

process is likely to encourage changes in the<br />

communicative conventions of the target language.<br />

Given the fact that the genre of popular science was<br />

developed in Greece mainly through translations from<br />

Anglophone sources in the last two decades, it is<br />

interesting to examine whether and how the translations<br />

from English encouraged the dissemination of particular<br />

linguistic features in the target language in the discourse<br />

of this particular genre. A number of scholars, mostly<br />

within the English-German context, have taken interest in<br />

investigating translation as a form of language contact<br />

and its effects on the target language. Steiner (2008) has<br />

investigated grammatical and syntactic features of<br />

explicitness as a result of the contact between English<br />

and German, which however did not involve diachronic<br />

analyses of corpora. Most importantly, House and a<br />

group of scholars have investigated how translation from<br />

English affects German, but also Spanish and French<br />

(House, 2004; 2008; Baumgarten et al. 2004; Becher et al.<br />

2009). However, these studies mainly involved manual<br />

analyses of texts, that is, they were not corpus-based<br />

studies as they are understood by Baker (1995), i.e. they<br />

did not involve the automatic or semi-automatic analysis<br />

of machine-readable texts. Diachronic corpus-based<br />

approaches to translation are limited (Amouzadeh &<br />

House, 2010) and in terms of Modern Greek, no similar<br />

study has ever been conducted.<br />

This study aims to examine whether and how translation<br />

can encourage linguistic changes in the target language<br />

by investigating a diachronic corpus of non-translated<br />

and translated Modern Greek popular science articles,<br />

along with their source texts, in order to examine how<br />

translation can be understood as a language contact<br />

phenomenon. The linguistic change that is examined is<br />

135


Multilingual Resources and Multilingual Applications - Regular Papers<br />

the frequency of the passive voice, since it has been<br />

claimed to be found more frequently in translated<br />

Modern Greek texts (Apostolou-Panara, 1991),<br />

especially those translated from English.<br />

This paper first presents the theoretical model that<br />

informs the study, namely the Code-Copying Framework<br />

(Johanson, 1993; 1999; 2002). Then the research<br />

methodology is presented in detail and data analysis<br />

techniques are analysed. Finally, some preliminary<br />

findings are discussed. It must be mentioned, that this is<br />

still an ongoing project and for that reason the results are<br />

limited to a number of small sample studies.<br />

136<br />

2. The Code-Copying Framework<br />

The Code-Copying Framework is a widely applicable<br />

linguistic model that is suitable for the description of<br />

phenomena that have consistently been neglected, such<br />

as translation as a form of language contact and a<br />

propagator of change. Some of its concepts have recently<br />

been used by translation scholars to describe similar<br />

phenomena (Steiner, 2008), suggesting that it is a<br />

conceptual model suitable for analysing diverse cases of<br />

language contact, in particular cases where translation<br />

plays a central role in the dissemination of linguistic<br />

features.<br />

The Code-Copying Framework was developed by<br />

Johanson (1993; 1999; 2002) who is critical of the<br />

terminology, especially that of borrowing, used in the<br />

field of language change studies and it is this critique that<br />

serves as a point of departure towards developing a new<br />

explanatory framework of language contact, where<br />

‘copying’ replaces traditional terms and provides a<br />

different vintage point from which to analyse the<br />

phenomenon. Johanson (1999:39) argues that in any<br />

situation of code-interaction, that is, in a situation where<br />

two or more codes interact, two linguistic systems, i.e.<br />

two codes are employed. The Model Code is the source<br />

code, whereas the Basic Code is the recipient code which<br />

also provides the necessary morphosyntactic and other<br />

information for inserting and adapting the copied<br />

material (Johanson, 2008:62). Although, there are<br />

different directions of copying, this study focuses on the<br />

case of ‘adoption’ which involves elements being<br />

inserted from the Model Code into the Basic Code and<br />

views translation as a language contact situation where<br />

translators are likely to copy elements from the source<br />

language, i.e. the Model Code, when translating into the<br />

target language, which is the Basic Code.<br />

Two types of copying are possible within this model:<br />

global and selective copying. The linguistic properties<br />

that can be copied are material (i.e. phonic), semantic,<br />

combinational (i.e. collocations and syntax) and<br />

frequential properties, namely the frequency of particular<br />

linguistic units. In the case of global copying, a linguistic<br />

item is copied along with all its aforementioned<br />

properties. In the case of selective copying, one or more<br />

properties are copied resulting in distinct types of<br />

copying. Thus, there is material (M), semantic (S),<br />

combinational (C) and frequential (F) copying.<br />

Figure 1: The Code-Copying Framework<br />

(Johanson, 2006:5)<br />

During the process of translation, selective copying is<br />

more probable than global copying (Verschik, 2008:133).<br />

For that reason, the type of copying that is dealt with in<br />

this study is selective copying and in particular<br />

frequential copying, which results in a change in the<br />

frequency patterns of an existing linguistic unit.<br />

Apostolou-Panara (1991) notes that the passive<br />

constructions are used more frequently in Modern Greek<br />

than they once were. Traditionally, it has been argued that<br />

the passive voice structures are used in Modern Greek<br />

though not as often as in English (Warburton, 1975:576),<br />

where the passive voice is quite frequent especially in<br />

terms of informative texts such as popular science articles.<br />

As far as translation is concerned, different frequencies


Multilingual Resources and Multilingual Applications - Regular Papers<br />

and proportionalities of native patterns often result in<br />

texts having a ‘non-native’ feeling (Steiner, 2008:322).<br />

The frequent translation of source text patterns with<br />

grammatical, yet marginal, target language linguistic<br />

patterns may ultimately override prevailing patterns and<br />

result in new communicative preferences in the target<br />

language (Baumgarten & Özçetin 2008:294).<br />

Copies usually begin as momentary code-copies, that is,<br />

occasional instances of copying. When copies start being<br />

frequently and regularly used by a group of individuals or<br />

by a particular speech community, they become<br />

habitualised code-copies. Copies may also become<br />

conventionalised and become integrated and widely<br />

accepted by a speech community. The final stage is for<br />

copies to become monolingual, i.e. when copies are used<br />

by monolinguals and do not presuppose any bilingual<br />

ability (Johanson, 1999:48). Since momentary copies are<br />

difficult to trace (Csató, 2002:326), emphasis in this<br />

study is placed on habitualised code-copies. Translators<br />

are considered as part of a particular speech community<br />

and copies are regarded as habitualised when they are<br />

frequently and regularly used by translators.<br />

Conventionalised copies are not examined in this study,<br />

since they presuppose measuring social evaluation that is<br />

outside the scope of this research. However, it is safe to<br />

assume that if a copy is monolingualised, that is, it is used<br />

in non-translated texts; it is also in general terms socially<br />

approved.<br />

Translation in this study is understood as a social<br />

circumstance facilitating copying. It is not considered as<br />

a cause of change, but rather as an instance of contact<br />

during which copying may occur and change may<br />

proliferate through language, since translated texts,<br />

especially newspaper and magazine articles, are widely<br />

circulating texts that are likely to exert a powerful<br />

linguistic impact on a large audience. The main factors of<br />

copying are considered to be extra-linguistic, especially<br />

the cultural dominance of English in relation to Modern<br />

Greek, as far as the production of scientific texts is<br />

concerned, and the prestige that English enjoys as a<br />

prominent language and culture, both in the general sense<br />

of a lingua franca and in terms of scientific research.<br />

3. Data and Methodology<br />

3.1. Corpus design<br />

Based on the availability of data and the research aims of<br />

this thesis, an approximately 500,000 corpus of Modern<br />

Greek non-translated and translated popular science<br />

articles, along with their source texts was created. The<br />

corpus is named TROY (TRanslation Over the Years) and<br />

covers a 20-year period (1990-2010), which is considered<br />

to be an adequate time span for language change to occur<br />

and is amenable to being systematically observed.<br />

Newspapers and magazines dedicated to scientific issues<br />

provide are the two main sources of popular science<br />

articles. The corpus is specialised in terms of both genre<br />

and domain, i.e. it involves popular science articles from<br />

the domain of technology and life sciences. These<br />

domains were chosen due to the fact that the majority of<br />

articles, especially translations, seem to belong to either<br />

one of the two domains. This in turn indicates that<br />

interest is expressed for these domains from the general<br />

public, which consequently suggests that a high number<br />

of people will read articles belonging to the domains of<br />

technology and life sciences, a fact that is likely to result<br />

in a powerful linguistic impact on a large audience.<br />

The TROY corpus is divided into three subcorpora. The<br />

first subcorpus consists of non-translated Modern Greek<br />

popular science articles published in 1990-1991. The<br />

second subcorpus consists of non-translated and<br />

translated Modern Greek popular science articles<br />

published in 2003-2004, as well as the source texts of the<br />

translations. The years 2003-2004 were selected because<br />

translations of popular science texts started circulating<br />

more widely in Greece during that period than in<br />

previous years. The third subcorpus includes<br />

non-translated as well as translated texts and their source<br />

texts, all published in 2009-2010. The subcorpora are<br />

evenly balanced, both in terms of their overall size and<br />

between the two domains.<br />

3.2. Corpus Methodology<br />

The corpus methodology employed in this study has three<br />

aims. Firstly, it aims to investigate whether certain<br />

features have changed over time in Modern Greek.<br />

Secondly, it aims to examine whether this change is<br />

137


Multilingual Resources and Multilingual Applications - Regular Papers<br />

related or mirrored in the process of translation. Finally, it<br />

aims to investigate whether influence can be traced back<br />

to the English source texts. Ultimately, this methodology<br />

aims at combining most corpus-based methodologies<br />

under one research aim. Thus, synchronic and diachronic<br />

corpus-based approaches, as well as parallel and<br />

comparable corpora are employed in order to illustrate<br />

the way in which combined methodologies can assist in<br />

the analysis of the linguistic features of translated texts<br />

and their impact on non-translated ones.<br />

Firstly, the corpus methodology aims at examining<br />

language change in Modern Greek and in particular to<br />

investigate whether the frequency of the passive voice<br />

has changed over time. This involves a longitudinal<br />

corpus-based study, during which a comparable corpus is<br />

analysed diachronically. For the purposes of this study,<br />

the non-translated articles published in 1990-1991 will be<br />

compared to the non-translated articles published in<br />

2009-2010.<br />

The second aim of this corpus-based methodology is to<br />

examine the role of translation in this language change<br />

phenomenon. This involves a comparable corpus-based<br />

analysis where translated and non-translated Modern<br />

Greek popular science articles are analysed<br />

synchronically. First, the non-translated articles<br />

published in 2003-2004 will be compared to the<br />

translated articles published during the same years. Then,<br />

the same type of analysis will be conducted for articles<br />

published in 2009-2010. Two separate phases of analysis<br />

are included in order to investigate the extent to which<br />

the linguistic features in the translated texts differ from<br />

those of the non-translated ones at different time periods.<br />

More particularly, the first phase of analysis focuses on a<br />

period of time when the influence from English<br />

translations of popular science articles was at its initial<br />

stage. The second phase of analysis focuses on a later<br />

stage of the contact between English and Modern Greek<br />

through translation, as far as the particular genre or<br />

popular science is concerned.<br />

Finally, this corpus-based methodology aims to<br />

investigate the role of the source texts in this language<br />

contact situation. This involves the synchronic analysis<br />

of a parallel corpus of translated articles and their<br />

138<br />

originals, which consists of two phases of analysis, i.e.<br />

the translated popular science articles that were published<br />

in 2003-2004 will be compared to their source texts and<br />

the same analysis will be conducted for the articles<br />

published in 2009-2010.<br />

The analyses will be conducted with the help of the<br />

Concordance tool of WordSmith Tools 5.0 and will be<br />

based on semi-automatic methods, since at points where<br />

a closer examination of the texts is required, they will be<br />

analysed manually. The verb form is considered to be the<br />

unit of analysis and auxiliary verbs are excluded from the<br />

counts, since they do not provide any lexical information.<br />

For the sample studies discussed below, a part-of-speech<br />

(POS) tagger is not being used due to the fact that<br />

available Modern Greek POS taggers score relatively low<br />

on accuracy and Modern Greek verbs can be quite<br />

accurately identified from their suffixes with the use of<br />

wildcards.<br />

4. Preliminary Results<br />

Although this is still an ongoing project, a number of<br />

sample studies indicate that a corpus-based methodology<br />

that combines synchronic and diachronic corpus-based<br />

approaches, as well as parallel and comparable corpora<br />

can considerably assist in the analysis of the linguistic<br />

features of translated texts and their impact on<br />

non-translated ones. Articles for the sample studies are<br />

taken from the newspaper Βήμα (The Tribune), which<br />

includes a section dedicated to scientific issues.<br />

4.1. Language Change in Modern Greek<br />

In terms of the first aim of this corpus-based<br />

methodology, that is, the examination of language change<br />

in Modern Greek, a sample study of popular science<br />

articles published in 1991 and 2010 involving 4,000<br />

words was conducted in order to examine changes in the<br />

frequency of the passive voice. Although this is a very<br />

small sample study, it was found that the passive voice<br />

has become more frequent in Modern Greek in the last 20<br />

years, at least in terms of the specific genre of popular<br />

science articles. In particular, in the articles published in<br />

1991, 273 verb forms were found, 42 of which involved<br />

passive verb forms. In the articles published in 2010, 217<br />

instances of verb forms were identified, 42 of which were<br />

passive. This means that there is an approximately 5%


Multilingual Resources and Multilingual Applications - Regular Papers<br />

increase in the frequency of the distribution of passive<br />

voice constructions in Modern Greek. However, this 5%<br />

increase may be attributed to a number of factors that are<br />

irrespective of contact-induced language change, i.e. it<br />

may be a result of internal language changes. An analysis<br />

of translated texts is necessary in order to establish the<br />

extent to which contact through translation has<br />

encouraged a frequential copying of passive voice<br />

structures from English.<br />

Figure 2: Change in the frequency of the passive voice in<br />

Modern Greek (1991-2010)<br />

4.2. The Role of the Translations<br />

A second sample study was conducted in order to<br />

examine the role of the translation in this language<br />

change situation. In particular, a small corpus of 20,000<br />

words taken from translated and non-translated Modern<br />

Greek popular science articles published in 2010 was<br />

analysed. The analysis revealed that the frequency of the<br />

passive voice in the translated and non-translated articles<br />

is very similar, i.e. approximately 20%. In the<br />

non-translated articles, 1,081 verb instances were<br />

identified, 215 of which were passive, whereas the<br />

translated articles included 1,234 verb forms, out of<br />

which 243 involved passive voice occurrences.<br />

Figure 3: Frequency of the passive voice in translated and<br />

non-translated articles published in 2010<br />

This similarity in terms of the proportions of the passive<br />

voice suggests that the translated texts at least mirror the<br />

changes in the frequency of the passive voice that is<br />

attested in Modern Greek. This sample study focuses on a<br />

later stage of contact between English and Modern Greek<br />

in terms of popular science publications and it is assumed<br />

that this later stage indicates more established instances<br />

of copying, if we accept that some kind of copying has<br />

taken place. Although a comparable analysis of articles<br />

published in 2003-2004, when the influence from<br />

Anglophone source texts was at its initial stage, has not<br />

yet been attempted, such an analysis is likely to reveal a<br />

different patterning than the one discussed above, i.e. that<br />

the frequency of the passive voice is higher in translated<br />

texts than in non-translated ones. This will indicate that<br />

the frequential copying of passive voice gradually<br />

habitualised in the context of translation.<br />

4.3. The Role of the Source Texts<br />

Finally, in terms of the last aim of this corpus-based<br />

methodology, namely the investigation of the role of the<br />

English source texts in this language change<br />

phenomenon, it should be mentioned that although a<br />

sample study is not available at the moment for this type<br />

of analysis, it can be predicted based on the previous<br />

sample study that translated texts are likely to follow the<br />

patterns of the source texts. Corpus studies (Biber et al.<br />

1999:476) suggest that the English passives account for<br />

approximately 25% of all finite verbs in academic prose<br />

and for 15% in news. Popular science articles are<br />

considered to be somewhere in between these two genres,<br />

since they present scientific issues using a journalistic<br />

language. Thus, the frequency of the passive voice in<br />

English popular science articles can be expected to be<br />

somewhere between these two percentages, i.e. 20%. The<br />

distribution of the frequency of the passive voice in the<br />

previous sample study represents exactly this proportion.<br />

If this prediction is confirmed, it will suggest that the<br />

translation of popular science articles from Anglophone<br />

sources tends to encourage the frequential copying of the<br />

passive voice in Modern Greek. In that case, Modern<br />

Greek being the Basic Code copied the frequency of the<br />

passive voice patterns from the Model Code, which is<br />

English. The copies first habitualised in the discourse of<br />

the translation and then spread into the general linguistic<br />

community and became monolingual copies.<br />

139


Multilingual Resources and Multilingual Applications - Regular Papers<br />

140<br />

5. Conclusion<br />

Although the results are only preliminary, the importance<br />

of this corpus-based study lies in a number of factors.<br />

Firstly, it is one of the first diachronic corpus-based<br />

studies ever to be attempted within the field of translation<br />

studies and it raises collective awareness of how<br />

translation can encourage the dissemination of particular<br />

source language linguistic features. If this scholarly<br />

strand is to be consolidated, more research across a wider<br />

range of language pairs and linguistic features has to be<br />

conducted. Secondly, it is one of the first sustained<br />

corpus-based studies ever to be conducted in the Modern<br />

Greek context within the field of translation studies,<br />

which aims at analysing systematically and in depth the<br />

Modern Greek linguistic features of translated texts.<br />

Finally, this study combines all corpus-based<br />

methodologies, i.e. diachronic, synchronic, comparable<br />

and parallel, under one research aim: the investigation of<br />

translation as a language contact phenomenon. This is<br />

probably the most important aspect of this study since it<br />

stresses the numerous advantages of collaborative<br />

techniques and engages them in a mutually profitable<br />

dialogue.<br />

6. References<br />

Amouzadeh, M., House, J. (2010): Translation and<br />

Language Contact: The case of English and Persian.<br />

Languages in Contrast, 10(1), pp. 54-75.<br />

Apostolou-Panara, A. (1991): English Loanwords in<br />

Modern Greek: An overview. Terminologie et<br />

Traduction, 1(1), pp. 45-60.<br />

Baker, M. (1995): Corpora in Translation Studies: An<br />

overview and some suggestions for future research.<br />

Target, 7(2), pp. 223-243.<br />

Baumgarten, N., House, J., Probst, J. (2004): English as a<br />

Lingua Franca in Covert Translation Processes. The<br />

Translator, 10(1), pp. 83-108.<br />

Baumgarten, N., Özçetin, D. (2008): Linguistic Variation<br />

through Language Contact in Translation. In E.<br />

Siemund & N. Kintana (Eds.), Language Contact and<br />

Contact Languages. Amsterdam: John Benjamins, pp.<br />

293-316.<br />

Becher, V., House, J., Kranich, S. (2001): Convergence<br />

and Divergence of Communicative Norms through<br />

Language Contact in Translation. In K. Braunmüller &<br />

J. House (Eds.), Convergence and Divergence in<br />

Language Contact Situations. Amsterdam: John<br />

Benjamins, pp. 125-152.<br />

Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan,<br />

E. (1999): Longman Grammar of Spoken and Written<br />

English. Harlow: Longman.<br />

Csató, É.Á. (2002): Karaim: A high-copying language. In<br />

M.C. Jones & E. Esch (Eds.), Language Change: The<br />

interplay of internal, external and extra-linguistic<br />

factors. Berlin: Mouton de Gruyter, pp. 315-327.<br />

Johanson, L. (1993): Code-Copying in Immigrant<br />

Turkish. In G. Extra & L. Verhoeven (Eds.), Immigrant<br />

Languages in Europe. Clevedon, Philadelphia and<br />

Adelaide: Multilingual Matters, pp. 197-221.<br />

Johanson, L. (1999): The Dynamics of Code-Copying in<br />

Language Encounters. In B. Brendemoen, E. Lanza &<br />

E. Ryen (Eds.), Language Encounters across Time and<br />

Space. Oslo: Novus Press, pp. 37-62.<br />

Johanson, L. (2002): Structural Factors in Turkic<br />

language Contacts. London: Curzon.<br />

Johanson, L. (2008): Remodelling Grammar: Copying,<br />

conventionalisation, grammaticalisation. In E.<br />

Siemund & N. Kintana (Eds.), Language Contact and<br />

Contact Languages. Amsterdam: John Benjamins,<br />

pp. 61-79.<br />

House, J. (2004): English as Lingua Franca and its<br />

Influence on Other European Languages. In J.M.<br />

Bravo (Ed.), A New Spectrum of Translation Studies.<br />

Valladolid: Universidad de Valladolid, pp. 49-62.<br />

House, J. (2008): Global English and the Destruction of<br />

Identity?. In P. Nikolaou & M.V. Kyritsi (Eds.),<br />

Translating Selves: Experience and identity between<br />

languages and literatures. London and New York:<br />

Continuum, pp. 87-107.<br />

Steiner, E. (2008): Empirical Studies of Translations as a<br />

Mode of Language Contact: ‘Explicitness’ of<br />

lexicogrammatical encoding as a relevant dimension.<br />

In E. Siemund & N. Kintana (Eds.), Language Contact<br />

and Contact Languages. Amsterdam: John Benjamins,<br />

pp. 317-346.<br />

Verschik, A. (2008): Emerging Bilingual Speech: From<br />

Monolingualism to Code-Copying. London:<br />

Continuum.<br />

Warburton, I. (1975): The Passive in English and Greek.<br />

Foundations of Language, 13(4), pp. 563-578.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

A Comparable Wikipedia Corpus: From Wiki Syntax to POS Tagged XML<br />

Noah Bubenhofer, Stefanie Haupt, Horst Schwinn<br />

Institut <strong>für</strong> Deutsche Sprache IDS<br />

Mannheim<br />

E-mail: bubenhofer@ids-mannheim.de, st.haupt@gmail.com, schwinn@ids-mannheim.de<br />

Abstract<br />

To build a comparable Wikipedia corpus of German, French, Italian, Norwegian, Polish and Hungarian for contrastive grammar<br />

research, we used a set of XSLT stylesheets to transform the mediawiki annotations to XML. Furthermore, the data has been<br />

annotated with word class information using different taggers. The outcome is a corpus with rich meta data and linguistic annotation<br />

that can be used for multilingual research in various linguistic topics.<br />

Keywords: Wikipedia, Comparable Corpus, Multilingual Corpus, POS-Tagging, XSLT<br />

1. Background<br />

The project EuroGr@mm 1<br />

aims at describing German<br />

grammar from a multi-lingual perspective. Therefore, an<br />

international research team consisting of members from<br />

Germany, France, Italy, Norway, Poland and Hungary,<br />

collaborates in bringing in their respective language<br />

knowledge to a contrastive description of German. The<br />

grammatical topics that have been tackled so far are<br />

morphology, word classes, tense, word order and phrases.<br />

A corpus-based approach is used to compare the<br />

grammatical means of the languages in focus. But so far,<br />

no comparable corpus of the chosen languages was at the<br />

project’s disposal. Of course, for all the languages big<br />

corpora are available, but they consist of different text<br />

types and are in different states of preparation regarding<br />

linguistic markup.<br />

Hence we wanted to build our own corpus of comparable<br />

data in the different languages. The Wikipedia is a<br />

suitable source for building such a corpus. The<br />

disadvantage of the Wikipedia is its limitations regarding<br />

text types: The articles are (or are at least intended to be)<br />

very uniform in their linguistic structure. To overcome<br />

this problem we decided to include also the discussions<br />

of the articles in our corpus, which can broaden at least<br />

slightly the text type diversity.<br />

In this paper we describe, how the Wikipedia was<br />

converted to an XML format and part-of-speech-tagged.<br />

1 See<br />

http://www.ids-mannheim.de/gra/eurogr@mm.html.<br />

2. Wikipedia conversion to XCES<br />

To be able to integrate the linguistic annotated version of<br />

the Wikipedia into our existing corpus repository, the<br />

data has to be in the XML format XCES 2<br />

. There are<br />

already some attempts to convert the Wikipedia to a<br />

corpus linguistic usable data source (Fuchs, 2010:136).<br />

But they offer either only the data of a specific language<br />

version of the Wikipedia in an XML format (Wikipedia<br />

XML Corpus, Denoyer & Gallinari, 2006; SW1, Atserias<br />

et al., 2008), the format isn’t suitable for our needs<br />

(WikiPrep, Gabrilovich & Markovitch, 2006; WikIDF,<br />

Krizhanovsky, 2008; Java Wikipedia Library, Zesch et al.<br />

2008) or the conversion tool does not work anymore with<br />

the current mediawiki engine (WikiXML Collection;<br />

Wiki2TEI, Desgraupes & Loiseau 2007). To have a<br />

lasting solution, the conversion routines need to be<br />

useable also in the future which would allow us to get<br />

from time to time a new version of the Wikipedia.<br />

Therefore we developed our own solution of XSLT<br />

transformations to get an XCES version of the data.<br />

All Wikipedia articles and their discussions are available<br />

as mediawiki database dumps in XML (Extensible<br />

Markup Language, Bray et al., 1998). These database<br />

dumps contain different annotations. Metadata of articles<br />

display in XML while the articles display in mediawiki<br />

language. We convert these documents into XCES format<br />

using XSLT 2.0 transformations to ease research.<br />

2 http://www.xces.org/<br />

141


Multilingual Resources and Multilingual Applications - Regular Papers<br />

This process is divided into 2 sections:<br />

1) The conversion from mediawiki language to XML<br />

2) The conversion from the generated XML to XCES<br />

format<br />

The mediawiki language consists of a variety of special<br />

signs for special annotations. E.g. to describe a level 2<br />

header the line displays as text wrapped into two equal<br />

signs on each side, like this:<br />

== head ==<br />

Likewise lists display as a chain of hash or asterisk signs,<br />

according to the level, e.g. a level 3 list entry:<br />

### list entry<br />

During the first conversion we process the paragraphs<br />

according to their type and detect headers, lists, tables<br />

and usual paragraphs. We convert these signs into clean<br />

XML, so<br />

== head ==<br />

turns to<br />

text<br />

and<br />

### list entry<br />

turns to<br />

list entry.<br />

Of course inside the paragraphs there may be<br />

text-highlighting markup. We access the paragraphs and<br />

convert these wikimedia annotations to XML, too. Here<br />

we follow a certain pattern to detect text-highlighting<br />

signs.<br />

Still the document’s hierarchy is flat. In the next step we<br />

add structure to the lists. We group the list items<br />

according to their level to highlight the structure. In a<br />

later step we group all articles into sections depending on<br />

the occurrence of head elements. Whenever we add<br />

structure we need to take care of possible errors in the<br />

mediawiki syntax.<br />

Now the articles need to be transformed into the XCES<br />

structure. Here we sort the articles into alphanumerical<br />

sections. We transform the corpus and enrich every<br />

article with meta data. We provide a unique id for every<br />

article and discussion so that they can easily be<br />

referenced. Also the actual article text can be<br />

distinguished from the discussion part of the article,<br />

which is important because they are different text types.<br />

These conversion routines should work for all the<br />

language versions of the Wikipedia, but have so far only<br />

142<br />

been tested with the languages necessary for the project:<br />

German, French, Italian, Norwegian (Bokmål), Polish<br />

and Hungarian.<br />

3. POS-Tagging<br />

To enable searching for word class information in the<br />

corpus, the data needs being part-of-speech tagged. This<br />

task has not been finished yet, but preliminary tests have<br />

been done already. Not having any additional resources,<br />

we have to rely on ready to use taggers and cannot do any<br />

improvements or adjustments of the taggers. 3<br />

We are<br />

using the following taggers:<br />

German TreeTagger (Schmid, 1994) with<br />

the available training library for German<br />

(STTS-Tagset, Schiller et al., 1995)<br />

French TreeTagger with the available training<br />

library for French<br />

Italian TreeTagger with the available training<br />

library for Italian<br />

Polish TaKIPI (Piasecki, 2007), based on<br />

Morfeusz SIaT (Saloni et al., 2010)<br />

Hungarian System developed by the Hungarian<br />

National Corpus team (Váradi, 2002), based<br />

on TnT (Brants, 2000)<br />

Norwegian (Bokmål) Oslo-Bergen Tagger<br />

4<br />

(Hagen et al., 2000)<br />

The input for the taggers are raw text files without any<br />

XML mark-up and containing only those parts of the<br />

Wikipedia, which need to be tagged. So all meta<br />

information is being ignored.<br />

A Perl script is used to send the input data in manageable<br />

chunks to the tagger. The script also transfers the output<br />

of the tagger to a XML file that contains to each token the<br />

character position reference to the original data file.<br />

Because of the size of the Wikipedia, the tagging process<br />

is very time consuming. E.g. the XCES file of the<br />

German Wikipedia holds about 15.4 GB of data<br />

(785’791’766 tokens). The size of the stand-off file<br />

containing the linguistic mark-up produced by the<br />

3 Nevertheless we get support of the developers of the taggers,<br />

which we greatly appreciate.<br />

4 See http://tekstlab.uio.no/obt-ny/english/<br />

history.html for the newest developments of the tagger.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

TreeTagger (POS information to each token) is about<br />

157.9 GB. It took about 30 hours on a standard double<br />

core PC to process this file.<br />

4. Corpus Query System<br />

Our existing corpus management software COSMAS II 5<br />

is used as corpus query system. COSMAS II is currently<br />

used to manage the DeReKo (German Reference Corpus,<br />

see Kupietz et al., 2010), which contains about 4 billion<br />

tokens. Therefore COSMAS II is also able to cope with<br />

the Wikipedia data.<br />

To be able to build from time to time new versions of our<br />

corpus based on the latest Wikipedia, we can rely on the<br />

same version controlling mechanisms as the DeReKo<br />

does.<br />

For technical reasons, COSMAS II cannot handle UTF-8<br />

encoding. Therefore the encoding of the XCES files have<br />

to be changed to ISO-8859-1 and characters outside this<br />

range converted to numeric character references referring<br />

to the Unicode code point.<br />

At the end of this process, the Wikipedias in the XCES<br />

and the tagged format will be made publicly available to<br />

the scientific community.<br />

5. Conclusion<br />

While the Wikipedia is a often used and attractive source<br />

for various NLP and corpus linguistic tasks, it is not easy<br />

to get an endurable XML conversion routine which<br />

produces proper XML versions of the data. It was our<br />

attempt to find such a solution using XSLT stylesheets.<br />

After the part-of-speech tagging of the six language<br />

versions of the Wikipedia (German, French, Italian,<br />

Polish, Hungarian, Norwegian) we are able to build a<br />

multilingual comparable corpus for contrastive grammar<br />

research in our project.<br />

For future investigations, the advantage of a XML<br />

version of the Wikipedia is clearly visible: The XML<br />

structure holds all the meta information available in the<br />

mediawiki code and can therefore be used to differentiate<br />

findings of grammatical structures: Are there variants of<br />

specific constructions in different text types (lexicon<br />

entry vs. user discussion)? Or does the usage of the<br />

constructions depend on topic domains? And how do<br />

5 See http://www.ids-mannheim.de/cosmas2/.<br />

these observations change in the light of inter-lingual<br />

comparisons?<br />

6. References<br />

Atserias, J., Zaragoza, H., Ciaramita, M., Attardi, G.<br />

(2008): Semantically Annotated Snapshot of the<br />

English Wikipedia. In Proceedings of the Sixth<br />

International Language Resources and Evaluation<br />

(LREC 08), Marrakech, pp. 2313–2316.<br />

Brants, T. (2000): TnT – A Statistical Part-of-Speech<br />

Tagger. In Proceedings of the Sixth Conference on<br />

Applied Natural Language Processing (ANLP),<br />

Seattle, WA.<br />

Bray, T., Paoli, J., Sperberg-McQueen, C. M. (1998):<br />

Extensible Markup Language (XML) 1.0. W3C<br />

Recommendation<br />

.<br />

Denoyer, L., Gallinari, P. (2006): The Wikipedia XML<br />

Corpus. In SIGIR Forum.<br />

Desgraupes, B., Loiseau, S. (2007): Wiki to TEI 1.0<br />

project .<br />

Fuchs, M. (2010): Aufbau eines linguistischen Korpus<br />

aus den Daten der englischen Wikipedia. In Semantic<br />

Approaches in Natural Language Processing.<br />

Proceedings of the Conference on Natural Language<br />

Processing 2010 (KONVENS 10), Saarbrücken:<br />

<strong>Universität</strong>sverlag des Saarlandes, pp. 135–139.<br />

Gabrilovich, E., Markovitch, S. (2006): Overcoming the<br />

Brittleness Bottleneck using Wikipedia: Enhancing<br />

Text Categorization with Encyclopedic Knowledge.<br />

In Proceedings of The 21st National Conference<br />

on Artificial Intelligence (AAAI), Boston,<br />

pp. 1301–1306.<br />

Hagen, K., Johannessen, J. B., Nøklestad, A. (2000): A<br />

Constraint-based Tagger for Norwegian. In 17th<br />

Scandinavian Conference of Linguistics, Lund,<br />

Odense: University of Southern Denmark, 19,<br />

pp. 31–48 (Odense Working Papers in Language and<br />

Communication).<br />

Krizhanovsky, A. A. (2008): Index wiki database: design<br />

and experiments. In CoRR abs/0808.1753.<br />

Kupietz, M., Belica, C., Keibel, H., Witt, A. (2010): The<br />

German Reference Corpus DeReKo: A primordial<br />

sample for linguistic research. In Proceedings of the<br />

7th conference on International Language Resources<br />

143


Multilingual Resources and Multilingual Applications - Regular Papers<br />

and Evaluation, Valletta, Malta: European Language<br />

Resources Association (ELRA), pp. 1848-1854.<br />

Piasecki, M. (2007): Polish Tagger TaKIPI: Rule Based<br />

Construction and Optimisation. In Task Quarterly<br />

11(1–2), pp. 151–167.<br />

Saloni, Z., Gruszczyński, W., Woliński, M., Wołosz, R.<br />

(2010): Analizator morfologiczny Morfeusz<br />

.<br />

Schiller, A., Teufel, S., Thielen, C. (1995): Guidelines <strong>für</strong><br />

das Tagging deutscher Textcorpora mit STTS.<br />

<strong>Universität</strong> Stuttgart, Institut <strong>für</strong> maschinelle<br />

Sprachverarbeitung; <strong>Universität</strong> Tübingen, Seminar<br />

<strong>für</strong> Sprachwissenschaft, Stuttgart<br />

.<br />

Schmid, H. (1994): Probabilistic Part-of-Speech Tagging<br />

Using Decision Trees<br />

.<br />

Váradi, T. (2002): The Hungarian National Corpus. In<br />

Proceedings of the 3rd LREC Conference, Las Palmas,<br />

Spanyolország, pp. 385–389<br />

.<br />

Zesch, T., Müller, C., Gurevych, I. (2008): Extracting<br />

Lexical Semantic Knowledge from Wikipedia and<br />

Wiktionary. In Proceedings of the Sixth International<br />

Language Resources and Evaluation (LREC 08),<br />

Marrakech, pp. 1646–1652<br />

.<br />

144


Multilingual Resources and Multilingual Applications - Regular Papers<br />

A German Grammar for Generation in OpenCCG<br />

Jean Vancoppenolle * Eric Tabbert * Gerlof Bouma + Manfred Stede *<br />

* Dept of Linguistics, University of Potsdam, + Dept of Swedish, University of Gothenburg<br />

E-mail: * {vancoppenolle,tabbert,stede}@uni-potsdam.de + gerlof.bouma@gu.se<br />

Abstract<br />

We present a freely available CCG fragment for German that is being developed for natural language generation tasks in the<br />

domain of share price statistics. It is implemented in OpenCCG, an open source Java implementation of the computationally<br />

attractive CCG formalism. Since generation requires lexical categories to have semantic representations, so that possible<br />

realizations can be produced, the underlying grammar needs to define semantics. Hybrid Logic Dependency Semantics, a logic<br />

calculus especially suited for encoding linguistic meaning, is used to declare the semantics layer. To our knowledge, related work<br />

on German CCG development has not yet focused on the semantics layer. In terms of syntax, we concentrate on aspects of German<br />

as a partially free constituent order language. Special attention is payed to scrambling, where we employ CCG's type-changing<br />

mechanism in a manner that is somewhat unusual, but that allows us to a) minimize the amount of syntactic categories that are<br />

needed to model scrambling, compared to providing categories for all possible argument orders, and b) retain enough control to<br />

impose restrictions on scrambling.<br />

Keywords: CCG, Generation, Scrambling, German<br />

Introduction<br />

“Der Kurs der Post ist vom 13. September bis 29.<br />

Oktober stetig gefallen und dann bis zum 15. November<br />

wieder leicht angestiegen.<br />

Zwischen dem 13. und dem 29. September schwankte<br />

der Kurs leicht zwischen 15 und 16 Euro. Anschließend<br />

fiel er um mehr als die Hälfte ab und erreichte am 29.<br />

Oktober seinen Tiefststand bei 7 Euro. Bis zum 15.<br />

November stieg der Kurs nach einigen Schwankungen<br />

auf seinen Schlusswert von 10 Euro.”<br />

Consider the graph depicting the development of a share<br />

price. Undoubtedly, a human could interpret the<br />

mathematical properties of that graph and quite easily<br />

describe this information in prose. He would probably<br />

produce a text more or less similar to the one presented<br />

above. In computational linguistics (or, more general,<br />

artificial intelligence), people attempt to go one step<br />

further and let the computer do that work for us.<br />

Basically, it will have to perform the same steps that a<br />

human would need to in order to accomplish this task:<br />

determine the mathematical properties of interest and<br />

generate a text that is faithful to the input and easy to<br />

read. The present paper addresses the latter sub task –<br />

i.e., the text generation.<br />

Our goal is to develop a freely available fragment of a<br />

German grammar in OpenCCG that is suitable for<br />

natural language generation tasks in the domain of share<br />

prices. Related work on German in OpenCCG includes<br />

Hockenmaier (2006) and Hockenmaier and Young<br />

(2008), who employ grammar induction algorithms to<br />

induce CCG grammars automatically from treebanks<br />

(e.g. TiGERCorpus). To our knowledge, however, very<br />

little resources are actually freely available. In<br />

particular, the coverage of a part of the semantic layer is<br />

a novel contribution of the grammar that we present<br />

here.<br />

1. CCG<br />

CCG (Combinatory Categorial Grammar, Steedman<br />

2000, Steedman & Baldridge <strong>2011</strong>) is a lexicalized<br />

grammar formalism in which all constituents, lexical<br />

145


Multilingual Resources and Multilingual Applications - Regular Papers<br />

ones included, are assigned a syntactic category that<br />

describes its combinatory possibilities. These categories<br />

may be atomic or complex. Complex categories are<br />

functions from one category into another, with<br />

specification of the relative position of the function and<br />

its argument. For instance, the notation s∖ np describes<br />

a complex category that can be combined with an np<br />

on its left (direction of the slash) to yield an s .<br />

Category combination always applies to adjacent<br />

constituents and is governed by a set of combinatory<br />

rules, of which the simplest is function application. In<br />

the example in Fig. 1, we build a sentence (category s )<br />

around a transitive verb ( (s∖ np)/np )). There are two<br />

versions of function application used in the derivation:<br />

backward (), depending on which<br />

constituent is the argument and which is the function.<br />

An overview of other derivation rules is given in Table<br />

1.<br />

146<br />

Figure 1: A basic CCG derivation.<br />

The atomic categories in CCG come from a very<br />

restricted set. They may be enriched with features to<br />

handle case, agreement, clause types, etc. In addition, a<br />

grammar writer may choose to handle language-specific<br />

phenomena with unary type-changing rules. Finally, the<br />

grammar presented uses multi-modal CCG (henceforth<br />

MMCCG), which gives extended lexical control over<br />

derivation possibilities by adding modalities to the<br />

slashes in complex categories (see Baldridge 2002;<br />

Steedman & Baldridge <strong>2011</strong>, for introduction and<br />

overview).<br />

In its basic form, CCG has mildly context-sensitive<br />

generative power and is thus able to account for noncontext-free<br />

natural language phenomena (Steedman,<br />

2000). Its attractiveness is due to the linguistic<br />

expressiveness on the one hand and the fact that it is<br />

efficiently parsable in theory (Shanker & Weir, 1990), as<br />

well as in practice (Clark & Curran, 2007).<br />

Function application<br />

��� α/β �<br />

(T) α<br />

(


Multilingual Resources and Multilingual Applications - Regular Papers<br />

objects. Finally, the satisfaction operator @ states that<br />

the formula p in @ i p holds at world i.<br />

OpenCCG implements a flexible surface realizer that<br />

when given a logical form (LF) like in Fig. 1 returns one<br />

or more realizations of it, based on the underlying<br />

grammar. Both the number of realizations and their<br />

surface forms depend on how much information a LF<br />

specifies, thereby allowing to either enhance or restrain<br />

non-determinism of the realization process. For<br />

example, given that the LF in Fig. 2 does not specify<br />

which of the two arguments is fronted, the following<br />

two surface forms are possible:<br />

2) Der Kurs erreicht seinen Höchststand.<br />

the share-price reaches its peak<br />

3) Seinen Höchststand erreicht der Kurs.<br />

Its peak reaches the share-<br />

price.<br />

'The share-price reaches its peak'<br />

2. Coverage<br />

Our current work focuses on different aspects of<br />

German as a partially free constituent order language,<br />

including basic constituent order and scrambling in<br />

particular, but also on complex nominal phrases, clausal<br />

subordination, and coordination. In the next two<br />

Figure 3: NP fronting<br />

sections, we first give a brief overview of how<br />

topicalization is modeled in our grammar, followed by<br />

an approach to scrambling that we are currently<br />

investigating and that, as far as we know, is new in<br />

CCG.<br />

2.1. Topicalization<br />

The finite verb can occupy three different positions that<br />

depend on the clause type and determine the sentence<br />

mood: matrix clauses are either verb-initial (declarative<br />

or yes/no-interrogative), or verb-second (declarative or<br />

wh-interrogative), and subordinate clauses are always<br />

verb-final (declarative or interrogative).<br />

Following Steedman (2000), Hockenmaier (2006) and<br />

Hockenmaier and Young (2008), we implemented a<br />

topicalization rule that systematically derives verbsecond<br />

order from verb-initial order by fronting an<br />

argument of the verb, e.g. an NP, a PP, or a clause. This<br />

also covers partial fronting (see Fig. 3 for examples):<br />

4) T ⇒sv2 /(s v1/T ) , T={np , pp , sto-inf $∖ np,...}<br />

Sentence modifiers (e.g. heute in heute fällt der Kurs<br />

'today, the share price is falling') are analyzed as sv2/s v1<br />

and can thus form verb-second clauses on their own.<br />

147


Multilingual Resources and Multilingual Applications - Regular Papers<br />

2.2. Scrambling<br />

Much of the constituent order freedom in German is due<br />

to the fact that it allows for permutation of verbal<br />

arguments within a clause (local scrambling, 5) and<br />

'extraction' of arguments of an arbitrarily deeply<br />

embedded infinite clause (long-distance scrambling, 6):<br />

5) dass [dem Unternehmen]2 [das Richtige]3<br />

148<br />

that the enterprise the right-thing<br />

[der Berater]1 empfiehlt.<br />

the counselor advises<br />

'that the counselor advises the enterprise the right<br />

thing'<br />

6) dass [dem Unternehmen]2 [das Richtige]3<br />

that the enterprise the right-thing<br />

[der Berater]1 [_ _ zu empfehlen hofft].<br />

the counselor to advise hopes<br />

'that the counselor hopes to advise the enterprise the right<br />

thing'<br />

Different proposals have been made in MMCCG to<br />

account for constituent order freedom in general. To our<br />

knowledge, the two most common approaches are to<br />

provide separate categories for each possible order<br />

(Hockenmaier, 2006; Hockenmaier & Young, 2008) or<br />

to allow lexical underspecification of argument order<br />

through multi-sets (Hoffman, 1992; Steedman &<br />

Baldridge, 2003).<br />

We are investigating an approach to local scrambling<br />

that aims at combining the advantages of both methods,<br />

namely having fine-grained control over argument<br />

permutation on the one hand, and requiring as few<br />

categories as possible on the other. It is based on a set of<br />

type-changing rules that change categories 'on the fly'.<br />

(7) shows a simplified rule that allows to derive plural<br />

NPs from plural nouns, reflecting the optionality of<br />

determiners in German plural NPs (e.g. sie isst<br />

Kartoffeln 'she eats potatoes'):<br />

7) n pl ⇒nppl Type-changing rules can also be used to swap two<br />

consecutive argument NPs, (i and j denote indexes):<br />

8) s/ np 〈 i〉+base/np 〈 j〉 -pron⇒ s/np 〈 j〉/np 〈~i〉 -base<br />

9) s$∖ np 〈i〉-pron∖ np 〈 j〉 +base ,-pron ⇒s$∖ np 〈 ~j〉-base , -pron∖ np 〈i〉<br />

This essentially emulates the behavior of multi-sets and<br />

at the same time reduces the number of categories to a<br />

minimum, thereby enhancing the maintainability of the<br />

grammar. The advantage over multi-sets is that<br />

restrictions on scrambling can be formulated<br />

straightforwardly, such as that full NPs should not<br />

scramble over pronouns (i.e. NPs having the -pron(oun)<br />

feature) (see Uszkoreit (1987) for an overview of<br />

scrambling regularities in German).<br />

Rules like (8) and (9) require special caution, though.<br />

Type-changing rules are supposed to actually change the<br />

type of the argument category as they could otherwise<br />

apply over and over again, causing an infinite recursion.<br />

This is where the ±base feature comes into play. It<br />

indicates whether an NP occupies its base position or<br />

has already been scrambled, restricting the application<br />

of (8) and (9) to the former case and thereby preventing<br />

infinite recursion. The so-called dollar variable $ in<br />

(9) ranges over complex categories that have the same<br />

functor (here: s ), such as s ∖np . It is not crucial to our<br />

scrambling rules but generalizes (9) to apply to both<br />

transitive and ditransitive verbs.<br />

Four more rules are sufficient to capture all possible<br />

local permutations and also some of the long-distance<br />

permutations, as the one in (6).<br />

Figure 4: Parse of an infinite clause.<br />

The derivation in Fig. 4 contains the derivation of the<br />

complex verb cluster of example (6). The composed<br />

category s ∖np∖ np ∖ np corresponds to the one of an<br />

ordinary ditransitive verb, so although (6) is an instance<br />

of long-distance scrambling, it can be derived by means<br />

of our local scrambling rules (8) and (9).<br />

3. Lexicon<br />

The grammar is intended for use in the domain of the<br />

stock market, thus providing the means to describe the<br />

development of share prices. Since the expansion and<br />

proper implementation of a lexical database is a fullfledged<br />

task of its own and the focus of our current work<br />

is to extend the grammar, our current lexicon is still<br />

quite limited in its scope.<br />

At a later point one might consider to make use of the


CCGbank lexicon (Hockenmaier, 2006).<br />

3.1. Nouns<br />

Multilingual Resources and Multilingual Applications - Regular Papers<br />

Our lexicon currently contains approximately 125<br />

nouns. For the different inflectional paradigms we made<br />

use of inflection tables presented on the free online<br />

service canoo.net. 2 For each of these paradigms we<br />

wrote an 'expansion'. OpenCCG's expansions provide a<br />

means to define inflectional paradigms as an applicable<br />

rule and link lexical information to them, so that<br />

OpenCCG generates the different tokens of a word and<br />

its syntactic and semantic properties as interpretable<br />

lexical entries. Thus a typical noun entry is a one-liner<br />

like this:<br />

10) # Höchststand<br />

noun_infl_1(Höchststand, Höchstständ, masc<br />

peak, graph_point_definition)<br />

The first two arguments contain the singular and plural<br />

stem, to which the inflection endings will be attached by<br />

the expansion. The following arguments are gender (for<br />

agreement), a predicate (as semantic reference) and a<br />

semantic type from the ontology. While seemingly plain<br />

English, these semantic predicates should be thought of<br />

as grossly simplified meta language, which guarantees a<br />

unique and unambiguous semantic representation.<br />

3.2. Verbs<br />

For the verbs we followed a similar approach, with three<br />

expansions. The first two actually cover the same<br />

inflection paradigm, with the difference that for verbs<br />

ending in -ern like klettern (to climb) we duplicated the<br />

paradigm and made slight adjustments to circumvent the<br />

concatenation of the word stem kletter and certain<br />

inflectional morphs like -en to ungrammatical forms like<br />

*(wir) kletteren (instead of klettern). The third<br />

expansion covers several modal verbs like können (to<br />

can) or müssen (to have to).<br />

Each of those rules sets the features of the respective<br />

inflection (e.g. fin, 1st, sg, pres) and those for past tense.<br />

Sample entries:<br />

11) regular-vv(schwanken, schwank, schwankte,<br />

fluctuate)<br />

12) regular-vv-ern(klettern, kletter, kletterte, climb)<br />

2 http://www.canoo.net/<br />

4. Generation<br />

We would like to conclude with a brief outline of how<br />

our grammar fits into the generation scenario presented<br />

in the introduction.<br />

The idea is to generate text automatically from share<br />

price graphs, i.e., from collections of data points. Graphs<br />

are analyzed in terms of different mathematical<br />

properties (e.g. extremes and inflection points). These<br />

properties, together with user-provided realization<br />

parameters that allow fine-grained control over the<br />

'specificity' of LFs (and thus over the number of surface<br />

realizations), are input to static LF templates. The filled<br />

LF templates are then fed to the OpenCCG realizer<br />

where our grammar is used to compute the appropriate<br />

surface forms. In the last step, orthographic postprocessing,<br />

the surface forms are normalized with<br />

respect to language-specific orthographic standards (e.g.<br />

number or date formats, etc.). The figure below<br />

illustrates this procedure:<br />

Figure 5: Procedure of the generation process.<br />

5. Summary<br />

We have presented a freely available CCG fragment of a<br />

generation grammar for German that is equipped with a<br />

semantic layer implemented in Hybrid Logic<br />

149


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Dependency Semantics. In terms of syntax, we have<br />

focused on aspects of German as a partially free<br />

constituent order language and investigated an approach<br />

to scrambling by employing OpenCCG's type-changing<br />

rules in a somewhat unconventional manner. In doing<br />

so, we aimed at minimizing the amount of categories<br />

needed to allow different argument orders while<br />

retaining a certain degree of flexibility regarding<br />

argument order restrictions. Future work will<br />

concentrate more on the lexicon, for instance by refining<br />

and extending our expansions for inflectional paradigms<br />

of various word classes. We also hope to use<br />

OpenCCG's interesting regular expression facilities for<br />

derivational morphology.<br />

Our grammar can be downloaded from www.ling.unipotsdam.de/~stede/AGacl/ressourcen/GerGenGram.<br />

6. Acknowledgements<br />

We would like to thank the other participants of the<br />

course Automatische Textgenerierung (Winter 2010/11)<br />

at the University of Potsdam, and also the GSCL<br />

reviewers for their comments.<br />

7. References<br />

Baldridge, J. (2002): Lexically Specified Derivational<br />

Control in Combinatory Categorial Grammar. PhD<br />

thesis, School of Informatics, University of<br />

Edinburgh.<br />

Baldridge, J., Kruijff, G.-J.M. (2002): Coupling CCG<br />

and Hybrid Logic Dependency Semantics. In<br />

Proceedings of the 40th Annual Meeting of the<br />

Association for Computational Linguistics (ACL<br />

2002).<br />

Baldridge, J., Kruijff, G.-J.M. (2003): Multi-Modal<br />

Combinatory Categorial Grammar. In Proceedings of<br />

the 10th Conference of the European Chapter of the<br />

Association for Computational Linguistics (EACL<br />

2003).<br />

Blackburn, P. (1993): Modal Logic and Attribute Value<br />

Structures. In M. de Rijke, editors, Diamonds and<br />

Defaults, Synthese Language Library, pp. 19–65,<br />

Kluwer Academic Publishers, Dordrecht, 1993.<br />

Blackburn, P. (2000): Representation, Reasoning, and<br />

Relational Structures: a Hybrid Logic Manifesto.<br />

Logic Journal of the IGPL, 8(3), pp. 339-625.<br />

150<br />

Bozsahin, C., Kruijff, G.-J.M., White, M. (2008):<br />

Specifying Grammars for OpenCCG: A Rough Guide.<br />

http://openccg.sourceforge.net/<br />

Clark, S., Curran, S. (2007): Wide-coverage Efficient<br />

Statistical Parsing with CCG and Log-linear Models.<br />

Computational Linguistics, 33(4), pp. 493-552.<br />

Drach, E. (1937): Grundgedanken der deutschen<br />

Satzlehre. Diesterweg.<br />

Hockenmaier, J. (2006): Creating a CCGbank and a<br />

wide-coverage CCG lexicon for German. In<br />

Proceedings of the 21st International Conference on<br />

Computational Linguistics and 44th Annual Meeting<br />

of the ACL.<br />

Hockenmaier, J., Young, P. (2008): Non-local<br />

scrambling: the equivalence of TAG and CCG<br />

revisited. Proceedings of The Ninth International<br />

Workshop on Tree Adjoining Grammars and Related<br />

Formalisms, pp. 41–48, Tübingen, Germany.<br />

Hoffman, B. (1992): A CCG Approach to Free Word<br />

Order Languages. Proceedings of the 30th Annual<br />

Meeting of ACL, pp. 300-302.<br />

Müller, S. (2010): Grammatiktheorie. Stauffenburg<br />

Verlag.<br />

Steedman, M. (2000): The Syntactic Process. MIT Press.<br />

Steedman, M., Baldridge, J. (<strong>2011</strong>): Combinatory<br />

Categorial Grammar. In Borsley and Börjars (eds),<br />

Non-tranformational Syntax: Formal and explicit<br />

models of grammar, Wiley-Blackwell.<br />

Uszkoreit, H. (1987): Word Order and Constituent<br />

Structure in German. CSLI.<br />

Vijay-Shanker, K., Weir, D.J. (1990): Polynomial Time<br />

Parsing of Combinatory Categorial Grammars. Proceedings<br />

of the 28th Annual Meeting of Computational<br />

Linguistics, pp. 1-8, Pittsburgh, PA, June 1990.<br />

White, M.: OpenCCG Realizer Manual. Documentation<br />

of the OpenCCG Realizer.<br />

White, M. (2004): Efficient Realization of Coordinate<br />

Structures in Combinatory Categorial Grammar.<br />

Research on Language & Computation, 4(1), pp. 39-<br />

75.<br />

White, M., Rajkumar R., Martin, S. (2007): Towards<br />

Broad Coverage Surface Realization with CCG. In<br />

Proceedings of the Workshop on Using Corpora for<br />

NLG: Language Generation and Machine Translation<br />

(UCNLG+MT).


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Multilingualism in Ancient Texts: Language Detection by Example of Old High<br />

German and Old Saxon<br />

Zahurul Islam 1 , Roland Mittmann 2 , Alexander Mehler 1<br />

1 AG Texttechnology, Institut <strong>für</strong> Informatik, Goethe-<strong>Universität</strong> Frankfurt<br />

2 Institut <strong>für</strong> Empirische Sprachwissenschaft, Goethe-<strong>Universität</strong> Frankfurt<br />

E-mail: zahurul, mittmann, mehler@em.uni-frankfurt.de<br />

Abstract<br />

In this paper, we present an approach to language detection in streams of multilingual ancient texts. We introduce a supervised<br />

classifier that detects, amongst others, Old High German (OHG) and Old Saxon (OS). We evaluate our model by means of three<br />

experiments that show that language detection is possible even for dead languages. Finally, we present an experiment in unsupervised<br />

language detection as a tertium comparationis for our supervised classifier.<br />

Keywords: Language identification, Ancient text, n-gram, classification, clustering<br />

1. Introduction<br />

With the rise of the web, we face more and more on-line<br />

resources that mix different languages. This multilin-<br />

gualism of textual resources poses a challenge for many<br />

tasks in Natural Language Processing (NLP). As a consequence,<br />

Language Identification (LI) is now an indispensable<br />

step of preprocessing for many NLP applications.<br />

This includes machine translation, automatic<br />

speech recognition, text-to-speech systems as well as text<br />

classification in multilingual scenarios.<br />

Obviously, LI is a well-established field of application of<br />

NLP. However, if one looks at documents that were<br />

written in low-density languages or documents that mix<br />

several dead languages, adequate models of language<br />

detection are rarely found. In any event, ancient languages<br />

are becoming more and more central in approach<br />

to computational Humanities, historical semantics and<br />

studies on language evolution. Thus, we are in need of<br />

models of language detection of dead languages.<br />

In this paper, we present such a model. We introduce a<br />

supervised classifier that detects amongst others, OHG<br />

and OS. To do so, we extend the model of Waltinger and<br />

Mehler (2009) so that it also accounts for dead languages.<br />

For any segment of the logical document structure of a<br />

text, our task is to detect the corresponding language in<br />

which it was written. This detection at the segment level<br />

rather than at the level of whole texts allows us to make<br />

explicit the multilingualism of ancient documents starting<br />

from the level of words via the level of sentences up<br />

to the level of texts. As a result, language-specific preprocessing<br />

tools can be used in such a way that they focus<br />

on those segments that provide relevant input for them. In<br />

this way, our approach is a first step towards building a<br />

preprocessor of multilingual ancient texts.<br />

The paper is organized as follows: Section 3 describes the<br />

corpus of texts that we have used for our experiments.<br />

Section 4 briefly introduces our approach to supervised<br />

language detection, which is evaluated in Section 5.<br />

Section 6 describes unsupervised language classifier.<br />

Finally, a conclusion is given in Section 7.<br />

2. Related Work<br />

As we present a model of n-gram-based language detection,<br />

we briefly discuss work in this area.<br />

Cavnar and Trenkle (1994) describe a system of n-gram<br />

based text and language categorization. Basically, they<br />

calculate n-gram profiles for each target category. Categorization<br />

occurs by means of measuring the distances of<br />

the profiles of input documents with those of the target<br />

categories. Regarding language classification, the accuracy<br />

of this system is 99.8%.<br />

The same technique has been applied by Mansur et al.<br />

(2006) for text categorization. In this approach, a corpus<br />

of newspaper articles has been used as input to categorization.<br />

Mansur et al. (2006) show that n-grams of<br />

length 2 and 3 are most efficiently used as features for<br />

text categorization.<br />

151


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Kanaris and Stamatatos (2007) used character level<br />

n-grams to categorize web genres. Their approach is<br />

based on n-grams of characters of variable length that<br />

were combined with information about most frequently<br />

used HTML-tags.<br />

Note that the language detection toolkit of Google<br />

translator may also be considered as a related work.<br />

However, at present, this system does not recognize<br />

sentences in OHG. We have tested 10 example sentences.<br />

The toolkit categorized only one of these input sentences<br />

as modern German; other sentences were categorized as<br />

different languages (e.g., Italian, French, English and<br />

Danish).<br />

These approaches basically explore n-grams as features<br />

of language classification. However, they do that for<br />

modern languages. In this paper we present an approach<br />

that fills the gap of ancient language detection.<br />

3. The Corpus<br />

The corpus used consists of 160 complete texts in six<br />

diachronically and diatopically diverging stages of the<br />

German language plus the OS glosses, all collected from<br />

the TITUS 1<br />

online database. High German is the language<br />

variety spoken historically south of a bundle of<br />

isogloss lines stretching from Aachen through Düsseldorf,<br />

Siegen, Kassel and Halle to Frankfurt (Oder) and has<br />

developed into what today constitutes standard German.<br />

Low German was spoken historically north of this line<br />

but has undergone a decline in native speakers to the<br />

point that it is now considered a regional vernacular of<br />

and alongside standard German, despite the fact that Low<br />

German and High German were once distinct languages.<br />

Table 1 shows the historical and geographical varieties of<br />

older German.<br />

New discoveries of texts in the various historical forms<br />

and varieties of German are being made continually. Due<br />

to the steadily increasing number of transmitted texts<br />

from throughout the history of the German language, the<br />

focus of the TITUS corpus is on the older stages: it<br />

comprises the whole OHG corpus (apart from the glosses)<br />

as well as the entire OS corpus, including one mixed<br />

OHG and OS text. Of the younger language stages only<br />

unrepresentative amounts of texts are contained: several<br />

1 Thesaurus of Indo-European Text and Language Materials –<br />

see http://titus.uni-frankfurt.de<br />

152<br />

dozen Middle High German (MHG) texts, some Middle<br />

Low German (MLG) texts, a sample of Early New High<br />

German (ENHG) texts and one mixed ENHG and Early<br />

New Low German (ENLG) text all of them varying<br />

considerably in length, from a few words to several tens<br />

of thousands per text.<br />

Language Stage Period of Time<br />

OHG ca. 750 – 1050 CE<br />

MHG ca. 1050 – 1350 CE<br />

ENHG ca. 1350 – 1650 CE<br />

OS ca. 800 – 1200 CE<br />

MLG ca. 1200 – 1600 CE<br />

ENLG ca. 1600 – 1750 CE<br />

Table 1: Historical and geographical varieties<br />

Among the oldest transmissions are interlinear translations<br />

of Latin texts, but also free translations and adaptations<br />

as well as mixed German-Latin texts. Translations<br />

consist mainly of religious literature, prayers, hymns, but<br />

also of ancient authors and scientific writings. These are<br />

later on complemented by epic and lyrical poetry (minnesongs),<br />

prose literature, sermons and other religious<br />

works, specialist books, chronicles, legislative texts and<br />

philosophical treatises. The latest texts of the corpus<br />

cover a biographical and a historical work, a collection of<br />

legal texts for a prince, an experimental re-narration of a<br />

parodistic novel as well as the German parts of two<br />

bilingual texts, a High German-Old Prussian enchiridion<br />

and a mixed High and Low German textbook for learning<br />

Russian.<br />

Language Stage #Texts #Tokens<br />

OHG 101 437,390<br />

MHG 31 1,776,900<br />

ENHG 6 237,432<br />

OS 17 62,706<br />

MLG 4 133,584<br />

ENLG 1 26,679<br />

Total 160 2,674,691<br />

Table 2: Composition of the corpus<br />

The corpus was generated by entering plain text, either<br />

completely by hand or by scanning, performing OCR<br />

recognition and correcting it manually. The texts were<br />

then indexed and provided with information on languages<br />

and subdivisions using the


Word-Cruncher 2<br />

Multilingual Resources and Multilingual Applications - Regular Papers<br />

software developed by Brigham Young<br />

University in Provo, Utah. They were then converted into<br />

HTML format and were simultaneously conveyed into<br />

several SQL database files, classified by the words’<br />

language family, to enable the set-up of an on-line search.<br />

4. Approach<br />

In this section, we describe our language detection approach.<br />

We start with describing how we prepared the<br />

corpus from TITUS database to get input for our classifier<br />

(Section 4.1), introduce our model (Section 4.2)<br />

and describe its system design (Section 4.3).<br />

4.1. Corpus Preparation<br />

The training and test corpora that we used in our experiments<br />

were extracted from the database dump of TITUS<br />

(see Section 3). Each word in this extraction has been<br />

annotated with its corresponding language name (example:<br />

German), sub-language name (example: Old High<br />

German), document number, division number and its<br />

position within the underlying HTML corpus files. TI-<br />

TUS only annotates the boundaries of divisions so that<br />

any division may contain one or more sentences. For any<br />

sub-language (i.e., OHG, OS, MHG, MLG, ENLG and<br />

ENHG), we extracted text as reported in Table 2.<br />

4.2. Language Detection Toolkit<br />

Our approach for language detection is based on Cavnar<br />

and Trenkle (1994) and Waltinger and Mehler (2009). As<br />

in these studies, for every target category we learn an<br />

ordered list of most frequent n-grams that occur in descending<br />

order. The same is done for any input text so that<br />

categorization is done by measuring the distance between<br />

n-gram profiles of the target categories and the n-gram<br />

profiles of the test data.<br />

The idea behind this approach is that the more similar<br />

two texts are, the more they share features that are<br />

equally ordered.<br />

In general, classification is done by using a range of<br />

corpus features as are listed in Waltinger and Mehler<br />

(2009). Predefined information is extracted from the<br />

corpus to build sub-models based on those features. Each<br />

sub-model consists of a ranked frequency distribution of<br />

subset of corpus features. Corresponding n-gram information<br />

are extracted for n = 1 to 5. Each n-gram gets its<br />

2 http://wordcruncher.byu.edu<br />

own frequency counter. The normalized frequency distribution<br />

of relevant features is calculated according to<br />

�� �� =<br />

� ��<br />

��� �� � �(� � )� ��<br />

� (0,1]<br />

��is �� the frequency of feature ai in Dj, divided by the<br />

frequency of the most frequent feature ak in the feature<br />

representation L(Dj) of document Dj (see Waltinger &<br />

Mehler, 2009). To categorize any document Dm, it is<br />

compared to each category Cn using the distance d of the<br />

rank rmk of feature ak in the sub-model of Dm with the<br />

corresponding rank of that feature in the representation<br />

of Cn:<br />

�(� � , � � , � �) = � |� �� � � �� |� � � �(� � ) � � � � �(� � )<br />

max � � � �(� � ) � � � � �(� � )<br />

�(�� , ��, ��) equals max if feature ak does not belong to<br />

the representation of Dm or to the one of category Cn. max<br />

is the maximum that the term |��� � ��� | can assume.<br />

4.3. System Design<br />

The language detection toolkit (Waltinger & Mehler,<br />

2009) is used to build training models. It creates several<br />

n-gram models for each language which are used by the<br />

same tool for detection. Figure 1 shows the basic system<br />

diagram.<br />

To detect the language of a document, the toolkit traverses<br />

the document sentence by sentence and detects the<br />

language of each sentence. If the document is homogeneous,<br />

(i.e., all sentences belong to the same language),<br />

then sentence level detection suffices to trigger other<br />

tools for further processing (e.g., Parsing, Tagging and<br />

Morpho-syntactic analysis) of that document, where<br />

language detection is necessary for preprocessing.<br />

In the case that the sentences belong to more than one<br />

language (i.e., in the case of a heterogeneous document),<br />

the toolkit process the document word by word and detect<br />

the language of each token separately. This step is necessary<br />

in the case of multilingual documents that contain<br />

words from different languages are in single sentences.<br />

For example: in a scenario of lemmatization or morphological<br />

analysis of a multilingual document, it is necessary<br />

to trigger language specific tools to avoid errors. Just<br />

one tool needs to be triggered for further processing of a<br />

homogeneous document, whereas for a heterogeneous<br />

�<br />

153


Multilingual Resources and Multilingual Applications - Regular Papers<br />

document the same kind of tool has to be triggered based<br />

on the word level.<br />

154<br />

Figure 1: Basic system diagram<br />

Language Accuracy F-score<br />

OHG 100% 1<br />

OS 100% 1<br />

Table 3: Sentence level evaluation<br />

5. Evaluation<br />

In order to evaluate the language detection system, we<br />

extracted 200 sentences from the OHG corpus and 200<br />

sentences from the OS corpus. These evaluation sets had<br />

not been used for training. There are many evaluation<br />

metrics used to evaluate NLP tools, we decided to use<br />

Accuracy and F-score (Hotho et al., 2005). Table 3 shows<br />

the evaluation result of the sentence level language<br />

detection, where we obtained 100% accuracy for both<br />

test sets. Table 4 shows the evaluation result of the word<br />

level language detection. 153 out of 1,259 words in the<br />

OHG test set were detected as OS and 33 out of 799<br />

words in the OS test set were classified as OHG. The<br />

accuracy of the test set was 79.95% and 91.36% respectively.<br />

The evaluation result shows that the OHG test set<br />

might contain words from other languages, which is<br />

basically true. Petrova et al. (2009) show that the OHG<br />

diachronic corpus contains many Latin words. The<br />

evaluation becomes more effective when the result is<br />

compared with a gold-standard reference set. We came up<br />

with a list of 1,548 words (818 types) where each token is<br />

manually annotated with the name of the language to<br />

which the word belongs. Of 1,548 words, 564 overlapped<br />

with training data. Each word in the gold-standard test set<br />

is detected by the toolkit and the result was compared<br />

with the reference set. We obtained 91.66% accuracy and<br />

an F-score of 95%.<br />

Language Accuracy F-score<br />

OHG 79.95% 0.88<br />

OS 91.36% 0.<strong>96</strong><br />

Table 4: Word level evaluation<br />

6. Unsupervised Language Classification<br />

In addition to the classifier presented above, we experimented<br />

with an unsupervised classifier. The reason was<br />

twofold: one the one hand, we wanted to detect the<br />

added-value of an unsupervised classifier in comparison<br />

to its supervised counterpart. On the other hand, we<br />

aimed at extending the number of target languages to be<br />

detected. We collected several documents per target<br />

language, where each document was represented by a<br />

separate feature vector that counts the frequencies of a<br />

selected set of lexical features. As target classes we<br />

referred to six languages (whose cardinalities are displayed<br />

in Table 6): Early New High German (ENHG),<br />

Early New Low German (ENLG), Middle High German<br />

(MHG), Middle Low German (MLG), Old High German<br />

(OHG), and Old Saxon (OS). In order to implement an<br />

unsupervised language classifier, we followed the approach<br />

described in Mehler (2008). That is, we performed<br />

a hierarchical agglomerative clustering together<br />

with a subsequent partitioning that is informed about the<br />

number of target classes. However, other than in Mehler<br />

(2008), we did not perform a genetic search of the best<br />

performing subset of features as in the present case their<br />

number is too large. Table 5 shows the classification<br />

results. Performing a hierarchical-agglomerative clustering<br />

based on the cosine measure as the operative<br />

measure of object distance, we get an F-score of around<br />

78%. This is a promising result as it is accompanied by a<br />

remarkable high accuracy. However, as seen in Table 4,<br />

the target classes perform quite differently: while we fail<br />

to separate ENHG and ENLG (certainly due to the small<br />

number of respective target documents), we separate<br />

MHG, MLG, OHG and OS to a reasonable degree. In this<br />

sense, the unsupervised classifier makes expectable even<br />

higher F-score supposed that we look for better performing<br />

features in conjunction with well-trained supervised<br />

classifiers. At least, the present study provides a


Multilingual Resources and Multilingual Applications - Regular Papers<br />

baseline that can be referred to in future experiments in<br />

this area.<br />

Approach Object Distance F-Score Accuracy<br />

hierarchical/complete cosine 0.78098 0.91134<br />

hierarchical/weighted cosine 0.69325 0.86934<br />

hierarchical/average cosine 0.61763 0.8307<br />

hierarchical/single cosine 0.56675 0.7926<br />

Table 5: F-scores and accuracies of classifying historical<br />

language data in a semi semi-supervised environment<br />

Language #Texts F-score Recall Precision<br />

ENHG 6 0 0 0<br />

ENLG 1 0 0 0<br />

MHG 31 0.895 1 0.810<br />

MLG 4 0.8 0.8 0.8<br />

OHG 101 0.762 0.615 1<br />

OS 17 0.889 0.889 0.889<br />

Table 6: F-scores, recalls, and precisions differentiated by the target classes<br />

7. Conclusion<br />

Language detection plays an important role in processing<br />

multilingual documents. This is true especially for ancient<br />

documents that, due to their genealogy, mix different<br />

ancient languages. Here, documents need to be<br />

annotated in such a way that preprocessors can activate<br />

language specific routines on a segment by segment basis.<br />

In this paper, we presented an extended version of the<br />

language detection toolkit that allows us decide when to<br />

activate language specific analyses. Notwithstanding the<br />

low density of training material that is available for these<br />

languages, our classification results are very promising.<br />

At this point one may object that corpora of ancient texts<br />

are essentially so small that language detection can be<br />

done by hand. Actually, this objection is wrong if one<br />

considers corpora like the Patrologia Latina (Jordan,<br />

1995), which mixes classical Latin with medieval Latin<br />

as well as with French and other Romance languages that<br />

are used in commentaries. From the size of this corpus<br />

alone (more than 120 million tokens), it is evident that a<br />

reliable means of automatizing segment-based language<br />

detection needs to be a viable option. We also described<br />

an unsupervised language detector that is evaluated<br />

simultaneously by means of OHG, OS, MHG, MLG,<br />

ENLG and ENHG. Although this unsupervised classifier<br />

does not outperform its supervised counterpart, it shows<br />

that language detection in text streams of ancient languages<br />

comes into reach.<br />

8. Acknowledgements<br />

We would like to thank Ulli Waltinger, Armin Hoenen,<br />

Andy Lücking and Timothy Price for fruitful suggestions<br />

and comments. We also acknowledge funding by the<br />

LOEWE Digital-Humanities project in the<br />

Goethe-<strong>Universität</strong> Frankfurt.<br />

9. References<br />

Cavnar , W. B., Trenkle, J. M. (1994): Ngram-based text<br />

categorization. In In Proceedings of SDAIR-94, 3rd<br />

Annual Symposium on Document Analysis and Information<br />

Retrieval, pp. 161–175.<br />

Hotho, A. Nürnberger, A., Paaß, G. (2005): A Brief<br />

Survey of Text Mining. Journal for Language Technology<br />

and Computational Linguistics (JLCL), 20(1),<br />

pp. 19–62.<br />

Jordan, M. D., editor (1995): Patrologia Latina database.<br />

Chadwyck-Healey, Cambridge.<br />

Kanaris, I:, Stamatatos, E. (2007): Webpage genre identification<br />

using variable-length character n-grams. In<br />

155


Multilingual Resources and Multilingual Applications - Regular Papers<br />

156<br />

Proc. of the 19th IEEE Int. Conf. on Tools with Ar-<br />

tificial Intelligence (ICTAI’07), Washington, DC,<br />

USA. IEEE Computer Society.<br />

Mansur, M., UzZaman, N., Khan, M. (2006): Analysis of<br />

n-gram based text categorization for Bangla in a<br />

newspaper corpus. In Proceedings of the 9th International<br />

Conference on Computer and Information<br />

Technology (ICCIT 2006).<br />

Mehler, A. (2008): Structural similarities of complex<br />

networks: A computational model by example of wiki<br />

graphs. Applied Artificial Intelligence, 22(7&8),<br />

pp. 619–683.<br />

Petrova, S., Solf, M., Ritz, J., Chiarcos, C, Zeldes, A.<br />

(2009): Building and using a richly annotated interlinear<br />

diachronic corpus: The case of old high german<br />

tatian. Journal of Traitement automatique des langues<br />

(TAL), 50(2), pp. 47–71.<br />

Waltinger, U., Mehler, A. (2009): The feature difference<br />

coefficient: Classification by means of feature distributions.<br />

In Proceedings of the Conference on Text<br />

Mining Services (TMS 2009), Leipziger Beiträge zur<br />

Informatik: Band XIV, pp. 159–168. Leipzig University,<br />

Leipzig.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Multilinguale Phrasenextraktion mit Hilfe einer lexikonunabhängigen Analysekomponente<br />

am Beispiel von Patentschriften und nutzergenerierten Inhalten<br />

Daniela Becks, Julia Maria Schulz, Christa Womser-Hacker, Thomas Mandl<br />

<strong>Universität</strong> Hildesheim, Institut <strong>für</strong> Informationswissenschaft und Sprachtechnologie<br />

Marienburger Platz 22, 31141 Hildesheim<br />

E-mail: {daniela.becks, julia-maria.schulz, womser, mandl}@uni-hildesheim.de<br />

Abstract<br />

Die Extraktion von sinntragenden Phrasen aus Korpora setzt in der Regel eine verhältnismäßig tiefe linguistische Analyse der Texte<br />

voraus. Darüber hinaus ist häufig eine Adaptation der verwendeten Wissensbasen sowie der zugrunde liegenden Modelle notwendig,<br />

was sich meist als zeit- und arbeitsintensiv erweist. Der vorliegende Artikel beschreibt einen neuen sprach- und domänenübergrei-<br />

fenden Ansatz, der Aspekte von Shallow und Deep Parsing kombiniert. Ein Vorteil des vorgestellten Verfahrens besteht darin, dass es<br />

sich mit wenig Aufwand und ohne komplexe Lexika realisieren und auf andere Sprachen und Domänen übertragen lässt. Als Beispiel<br />

fungieren englische und deutsche Dokumente aus zwei sehr unterschiedlichen Korpora: Kundenrezensionen (nutzergenerierte Inhalte)<br />

und Patentschriften.<br />

Keywords: Shallow Parsing, Multilinguale Phrasenextraktion<br />

1. Einleitung<br />

Vor dem Hintergrund einer globalisierten Welt liegen<br />

Informationen häufig in Dokumenten vor, die nicht in der<br />

Muttersprache der Benutzer verfasst sind. Um ihnen<br />

dennoch die Möglichkeit zu bieten, diese aufzufinden,<br />

bedarf es spezieller Methoden. Damit beschäftigt sich das<br />

Crosslinguale Information Retrieval (CLIR) 1 . Die in<br />

diesem Kontext entstehenden Herausforderungen werden<br />

unter anderem bei Evaluierungsinitiativen wie CLEF 2<br />

und NTCIR 3<br />

untersucht.<br />

Eine weitere Entwicklung, die sich seit einiger Zeit im<br />

Bereich des Information Retrieval abzeichnet, liegt in der<br />

zunehmenden Ablösung des klassischen Bag-of-Words-<br />

Ansatzes, der bislang sowohl innerhalb des Indexierungsprozesses<br />

als auch im Rahmen der Anfrageformulierung<br />

Anwendung findet. In der Literatur wird derzeit<br />

1<br />

Im crosslingualen Information Retrieval stimmen die<br />

Sprachen der Anfrage- und der Ergebnisdokumente nicht immer<br />

überein. Mit Hilfe einer deutschsprachigen Anfrage können<br />

beispielsweise auch englischsprachige Dokumente gewonnen<br />

werden.<br />

2 Cross Language Evaluation Forum: http://clef<strong>2011</strong>.org,<br />

http://www.clef-campaign.org<br />

3 National Institute of Informatics Test Collection for IR Systems:<br />

http://research.nii.ac.jp/ntcir/index-en.html<br />

vermehrt auf die Vorteile von Phrasen gegenüber einfachen<br />

Termen hingewiesen (vgl. z.B. Tseng et al.,<br />

2007:1222). Diese zeigen sich auch anhand eines einfachen<br />

Recherchebeispiels. Eine Suchanfrage zum Thema<br />

Züge der DB liefert auch Dokumente zum Thema Datenbanken,<br />

da es eine ambige Abkürzung ist, dessen<br />

Bedeutung erst im Kontext eindeutig wird. Begreift man<br />

die einzelnen Terme jedoch als zusammengehörige<br />

Phrase, so wird diese Mehrdeutigkeit aufgelöst und es<br />

werden lediglich diejenigen Dokumente ausgewiesen, in<br />

denen die Kombination der Terme auftritt.<br />

Das Extrahieren geeigneter Phrasen stellt jedoch vor<br />

einem multilingualen Hintergrund eine anspruchsvolle<br />

Aufgabe dar, da jedes Korpus unterschiedliche Besonderheiten<br />

aufweist, die es zu berücksichtigen gilt. Darüber<br />

hinaus spielt die Morphologie der einzelnen Sprachen<br />

eine entscheidende Rolle (z.B. abgetrennte Partikel zusammengesetzter<br />

Verben im Deutschen).<br />

Innerhalb dieses Artikels wird ein Ansatz vorgestellt, der<br />

Shallow und Deep Parsing kombiniert und mit nur geringen<br />

Anpassungen sprach- und domänenübergreifend<br />

<strong>für</strong> die Extraktion von sinntragenden Phrasen verwendet<br />

werden kann. Als Anwendungsbeispiele fungieren Patentschriften<br />

und Kundenrezensionen, die in den Spra-<br />

157


Multilingual Resources and Multilingual Applications - Regular Papers<br />

chen Deutsch und Englisch vorliegen. In Zukunft ist<br />

geplant, Dokumente der Sprachen Spanisch und Französisch<br />

zu untersuchen.<br />

Im Folgenden werden zunächst die beiden Anwendungsbereiche<br />

sowie die zugrunde liegenden Korpora<br />

vorgestellt (2.1). Des Weiteren werden einige Verfahren<br />

der Phrasenextraktion skizziert (3), an die sich die Beschreibung<br />

des sprach- und domänenübergreifenden<br />

Ansatzes anschließt (4). Dieser Artikel schließt mit einer<br />

Beschreibung des verwendeten Evaluierungsansatzes<br />

sowie ersten Ergebnissen ab.<br />

158<br />

2. Kontext der Forschungen<br />

2.1. Anwendungsbereiche<br />

Als Anwendungsbereiche <strong>für</strong> die entwickelte Phrasenextraktionskomponente<br />

werden in diesem Artikel zwei<br />

Projekte vorgestellt. Das erste Projekt findet in Kooperation<br />

mit dem FIZ Karlsruhe statt und fokussiert die<br />

Patentdomäne. Es zielt darauf ab, den Mehrwert von<br />

Phrasen <strong>für</strong> die Patentrecherche zu evaluieren (vgl. Becks,<br />

2010:423). Das zugrunde liegende Korpus beinhaltet<br />

etwa 105.000 Dokumente der CLEF-IP 4<br />

Testkollektion<br />

2009, welche sich aus ca. 1,6 Millionen Patent- und<br />

Anmeldeschriften des Europäischen Patentamtes zusammensetzt.<br />

Die Kollektion umfasst sowohl Dokumente<br />

in Englisch als auch Patente in Deutsch und<br />

Französisch (vgl. Roda et al., 2010:388).<br />

Die Kundenrezensionen, die als Beispiel <strong>für</strong> nutzergenerierte<br />

Inhalte herangezogen werden, stammen aus einem<br />

Projekt, das sich mit crosslingualem Opinion Mining<br />

befasst, und sich dabei ebenfalls mit der Extraktion von<br />

Phrasen beschäftigt. Das Ziel dieses Projektes besteht<br />

darin, Phrasen zu extrahieren, die Meinungen bezüglich<br />

der rezensierten Produkte und deren Eigenschaften<br />

enthalten. Als Grundlage dient in diesem Fall ein Korpus<br />

aus Kundenrezensionen (vgl. Hu, Liu, 2004, Ding et al.,<br />

2008, Schulz et al., 2010).<br />

Insbesondere im Hinblick auf die Länge der Dokumente<br />

unterscheiden sich beide Korpora signifikant, denn im<br />

4<br />

Cross Language Evaluation Forum, Intellectual Property<br />

Track<br />

Falle von Patentschriften handelt es sich um sehr lange<br />

und komplexe Dokumente (vgl. u.a. Iwayama et al., 2003).<br />

Eine wesentliche Herausforderung besteht somit darin,<br />

dass die sprachübergreifende Phrasenextraktion <strong>für</strong> sehr<br />

unterschiedliche Textsorten und Phrasen unterschiedlicher<br />

Komplexität gleichermaßen effektiv funktionieren<br />

soll.<br />

Eine Phrase wird als eine Kombination von Termen ver-<br />

standen, die zueinander in einer Head-Modifier-Relation<br />

stehen. Diese Beziehung kann in verschiedenen Aus-<br />

prägungen (z.B. Adjektiv-Nomen-Relation, Nomen-<br />

Präpositionalphrasen-Relation) auftreten. Die Phrasen<br />

unterscheiden sich jedoch von Chunks, die nach (Abney,<br />

1991) typischerweise aus einem einzelnen Content Word<br />

bestehen, das von einer Konstellation von Funktionswörtern<br />

und Pre-Modifiern umgeben ist, und einem festen<br />

Template folgen (vgl. Abney, 1991:257). Betrachtet<br />

man das folgende Beispiel, so zeigt sich deutlich, dass<br />

eine Phrase über die Grenzen eines Chunks hinausgehen<br />

kann. Aufgrund der fokussierten Anwendungsbereiche<br />

Information Retrieval und Opinion Mining unterscheidet<br />

sich der hier verwendete Phrasenbegriff von der klassischen<br />

linguistischen Definition. Er umfasst auch Mehrwortgruppen<br />

(z.B. information retrieval system) und<br />

Kombinationen aus Subjekt und Prädikat, die im Deutschen<br />

auch diskontinuierlich sein können. Eine Liste der<br />

erfassten Phrasentypen findet sich in Abschnitt 5.<br />

Beispiel:<br />

[a system] [for information retrieval]<br />

vs.<br />

Chunks<br />

a [system for information retrieval]<br />

Phrase<br />

2.2. Problemstellung und Anforderungen an die<br />

Phrasenextraktion<br />

Die Entwicklung einer geeigneten Extraktionskomponente<br />

wird innerhalb des Projektkontextes durch zwei<br />

wesentliche Zielsetzungen bestimmt:<br />

� Die Phrasenextraktion soll mit geringem Anpassungsaufwand<br />

<strong>für</strong> verschiedene europäische<br />

Sprachen realisierbar sein (ressourcenarmer<br />

Extraktionsansatz).


Multilingual Resources and Multilingual Applications - Regular Papers<br />

� Obgleich linguistische Ressourcen noch nicht<br />

flächendeckend verfügbar sind, soll die Phrasenextraktion<br />

<strong>für</strong> mehrere Sprachen möglich<br />

sein.<br />

Die Phrasenextraktion muss dem „Unknown Words<br />

Problem“ entgegenwirken. Infolgedessen soll das System<br />

in der Lage sein, Wörter zu bearbeiten, die bislang weder<br />

in den vom System benutzen Korpora noch in Wörterbüchern<br />

erfasst sind (vgl. Uchimoto et al., 2001:91). Dies<br />

spielt insbesondere innerhalb der Patentdomäne eine<br />

bedeutende Rolle.<br />

3. Verwandte Ansätze<br />

Zu den traditionellen Methoden der Phrasenextraktion<br />

zählen unter anderem regelbasierte Verfahren wie das<br />

Begrenzerverfahren von Jaene und Seelbach (vgl. Jaene<br />

& Seelbach, 1975). Die Autoren haben es sich zur Aufgabe<br />

gemacht, <strong>für</strong> die Inhaltserschließung Phrasen in<br />

Form von Mehrwortgruppen, die sie als mehrere eine<br />

syntaktisch-semantische Einheit bildende Wörter definieren<br />

(vgl. Jaene & Seelbach, 1975:9), aus englischen<br />

Fachtexten zu ermitteln. Zu diesem Zweck definieren<br />

Jaene und Seelbach sogenannte Begrenzerpaare, die die<br />

zu extrahierenden Nominalphrasen einschließen (vgl.<br />

Jaene & Seelbach, 1975:7). Ein ähnliches Verfahren <strong>für</strong><br />

die Extraktion von Nominalphrasen maximaler Länge,<br />

die mit dem Ziel der Identifikation von Fachtermini aus<br />

französischen Dokumenten dreier Domänen extrahiert<br />

werden, beschreiben (Bourigault & Jacquemin, 1999). In<br />

diesem Zusammenhang werden die Nominalphrasen in<br />

einem zweiten Schritt in ihre Bestandteile (Head und<br />

Modifier) zerlegt. Für den Extraktionsprozess werden<br />

sowohl Begrenzerpaare als auch die grammatische<br />

Struktur der Phrasen herangezogen. Vergleichbare Ansätze<br />

beschreiben (Tseng et al., 2007) <strong>für</strong> die Patentdomäne.<br />

Phrasen oder Schlüsselwörter werden hier auf<br />

Basis einer Stoppwortliste extrahiert. Als besonders geeignet<br />

erweisen sich dabei die längsten sich wiederholenden<br />

Phrasen (vgl. Tseng et al., 2007:1223). Auch (Guo<br />

et al., 2009) verwenden im Bereich Opinion Mining <strong>für</strong><br />

die Extraktion von Produkteigenschaften aus Satzsegmenten<br />

im semistrukturierten Bereich von Kundenrezensionen<br />

Stoppwörter, ergänzt durch meinungstragende<br />

Wörter (z.B. Adjektive). Anhand dieses kurzen Überblicks<br />

zeigt sich bereits, dass sich die Phrasenextraktion<br />

bislang überwiegend auf die Identifikation einfacher<br />

Nominalstrukturen konzentriert. In diesem Zusammenhang<br />

kommen neben den regelbasierten Ansätzen auch<br />

wörterbuchabhängige Verfahren wie beispielsweise das<br />

Dependenzparsing zum Einsatz. Im Information Retrieval<br />

kommen Dependenzrelationen häufig in Form von<br />

Head/Modifier-Paaren zum Einsatz, welche sich aus<br />

einem Head und einem Modifier zusammensetzen, wobei<br />

letzterer den Head präzisiert (vgl. Koster, 2004:423).<br />

Head/Modifier-Paare bieten den Vorteil, dass sie neben<br />

syntaktischer auch semantische Information beinhalten<br />

(vgl. u.a. Ruge, 1989:9). Infolgedessen kommen sie vor<br />

allem innerhalb des Indexierungsprozesses zum Einsatz<br />

(vgl. Koster, 2004; Ruge, 1995) und erweisen sich in<br />

Form von Tripeln (Term-Relation-Term) im Zusammenhang<br />

mit Klassifikationsaufgaben als vorteilhaft<br />

(vgl. Koster, Beney, 2009).<br />

4. Domänen- und sprachübergreifende<br />

Phrasenextraktion<br />

Dieser Artikel beschreibt eine neue Methode <strong>für</strong> die<br />

Extraktion von Phrasen, der die beiden zuvor genannten<br />

Kategorien vereinigt. Das Ziel des dargestellten Extraktionsverfahrens<br />

besteht im Wesentlichen darin, ein<br />

Werkzeug <strong>für</strong> die Identifikation von Phrasen zur Verfügung<br />

zu stellen, das sich mit geringem Aufwand <strong>für</strong> unterschiedliche<br />

Domänen und Sprachen adaptieren lässt<br />

(z.B. Anpassung einzelner Begrenzerpaare oder der zulässigen<br />

Präpositionen bei Nomen-Genitiv-Phrasen (NG)<br />

bzw. Nomen-Präpositionalphrasen (NP)). Dabei wird auf<br />

den Einsatz von domänenspezifischen Wissensbasen<br />

verzichtet, um die Domänenunabhängigkeit zu gewährleisten.<br />

Die Semantik der extrahierten Phrasen darf dabei<br />

nicht außer Acht gelassen werden. Infolgedessen handelt<br />

es sich um ein Mischverfahren, das die Funktionalität<br />

eines Shallow Parsers aufweist, aber eine flache semantische<br />

Klassifikation aufgrund linguistischer Regeln<br />

gewährleistet (vgl. Becks & Schulz, <strong>2011</strong>).<br />

Innerhalb der Phrasenextraktionskomponente findet ein<br />

regelbasiertes Verfahren Anwendung, das das Begrenzerverfahren<br />

(vgl. Jaene & Seelbach, 1975, Bourigault &<br />

Jacquemin, 1999) und die Grundzüge des Dependenzparsings<br />

(vgl. z.B. Ruge, 1995) aufgreift. Die Extraktion<br />

der Phrasen erfolgt in diesem Fall mit Hilfe verschiede-<br />

159


Multilingual Resources and Multilingual Applications - Regular Papers<br />

ner Regeln, in denen jeweils Paare von Begrenzern, de-<br />

finiert sind. Die Begrenzer sind, anders als in bisherigen<br />

Ansätzen, nicht Wörter, sondern morphosyntaktische<br />

Wortklassen (Pos-Tags). An dieser Stelle zeigt sich be-<br />

reits, dass das entwickelte System lediglich auf die Im-<br />

plementierung entsprechender Regeln sowie einen<br />

Part-of-Speech-Tagger angewiesen ist. Es handelt sich<br />

somit um einen ressourcenarmen Ansatz.<br />

Die implementierten Regeln variieren je nach Phrasentyp.<br />

Im Falle einer Adjektiv-Nomen-Relation (AN-R) wird<br />

die Phrase häufig von der Klasse Artikel und einem<br />

Interpunktionszeichen oder einer Präposition eingeschlossen<br />

(siehe Abb. 1). Darüber hinaus muss diese<br />

mindestens ein Adjektiv und ein Nomen enthalten, damit<br />

es sich um eine gültige AN-R handelt. Da die Kategorie<br />

Artikel sowohl die deutschen Artikel der, die, das als<br />

auch das englische Pendant the umfasst, kann diese Regel<br />

auch auf andere Sprachen angewendet werden. Diese<br />

abstrahierte Version des Begrenzerverfahrens ist demnach<br />

generalisierbar. Eine Einbindung komplexer Wortlisten<br />

erübrigt sich.<br />

Wie bereits erwähnt, wurde zudem auf Grundzüge des<br />

Dependenzparsings zurückgegriffen. Daher verfügt jede<br />

der extrahierten Phrasen sowohl über einen Head als auch<br />

einen Modifier, deren Ermittlung ebenfalls regelbasiert<br />

erfolgt. Im Falle der in Abbildung 1 dargestellten Beispiele<br />

befindet sich der Head am Ende der Phrase („stud“;<br />

„front panel button layout“). Der Modifier ist diesem<br />

vorangestellt.<br />

Anhand der Beispiele wird deutlich, dass es sich im<br />

Falle der extrahierten Phrasen nicht unbedingt um<br />

Head/Modifier-Paare handeln muss, sondern auch län-<br />

gere Phrasen mit mehreren Head/Modifier-Relationen<br />

durch dieses Verfahren abgebildet werden können.<br />

160<br />

linker Begrenzer: a (DT)<br />

rechter Begrenzer: with (IN)<br />

5. Evaluierung<br />

In der Regel erfolgt die Beurteilung der Qualität des<br />

gewonnen Outputs anhand eines definierten Goldstandards.<br />

Dieser Ansatz wurde beispielweise von Verbene<br />

und Kollegen gewählt (vgl. Verbene et al., 2010).<br />

Als Evaluierungsbasis dient eine manuell annotierte<br />

Stichprobe bestehend aus 100 Sätzen. Die Berechnung<br />

der Precision erfolgt auf Basis eines Vergleichs der extrahierten<br />

Phrasen mit der annotierten Stichprobe<br />

(vgl. Becks & Schulz <strong>2011</strong>: 391).<br />

Für die Erstellung des hier verwendeten Goldstandards<br />

werden zunächst aus den beiden in Abschnitt 2.1 beschriebenen<br />

Korpora <strong>für</strong> die Sprachen Deutsch und Englisch<br />

zufällig Sätze mit einem jeweiligen Gesamtumfang<br />

von ca. 2000 Tokens ausgewählt. Basis <strong>für</strong> die Berechnung<br />

der Anzahl der Tokens ist der vom Pos-Tagger<br />

generierte Output. Infolgedessen gelten auch Interpunktionszeichen<br />

jeweils als ein Token. Diese werden manuell<br />

jeweils unabhängig von zwei Annotatoren (der erste und<br />

der zweite Autor des Papers) hinsichtlich der folgenden<br />

Phrasentypen annotiert:<br />

� Subjekt-Prädikat (z. B. he thinks)<br />

� Prädikat-Objekt (z. B. extract phrases)<br />

� Verb-Adverb (z. B. extract easily)<br />

� Mehrwortgruppen (z. B. information retrieval<br />

system)<br />

� Adjektiv-Nomen (z. B. linguistic phrases)<br />

� Nomen-Präpositionalphrase (z.B. system for<br />

retrieval)<br />

� Nomen-Genitiv (z. B. rules of extraction)<br />

� Nomen-Relativsatz bzw. Nomen-Partizip (z. B.<br />

phrases extracted by the system)<br />

linker Begrenzer: a (DT)<br />

rechter Begrenzer: ,<br />

a shank-like stud with a very good front panel button layout ,<br />

Abbildung 1: Beispiel einer extrahierten Adjektiv-Nomen-Phrase; links: Patentschrift (EP-1120530-B1),<br />

rechts: Kundenrezension (Hu & Liu 2004)<br />

Insgesamt sind <strong>für</strong> die Auswahl der englischen Sätze aus<br />

den Kundenrezensionen 688 und <strong>für</strong> die deutschen Sätze<br />

639 Phrasen annotiert. Für die Patentdomäne liegen im<br />

Englischen 619 und im Deutschen 499 Phrasen vor. Von


Multilingual Resources and Multilingual Applications - Regular Papers<br />

den insgesamt 2445 Phrasen im Goldstandard sind ca.<br />

51% unkontrovers, d.h. bei diesen Phrasen stimmen<br />

sowohl die von beiden Annotatoren identifizierten<br />

Phrasengrenzen als auch die annotierten Relationen<br />

überein. Weitere 27% der Phrasen weisen eine identische<br />

syntaktische Relation auf, unterscheiden sich jedoch im<br />

Hinblick auf die annotierten Phrasengrenzen. Diese<br />

Fehlerkategorie umfasst beispielsweise koordinierte<br />

Phrasen. Im Falle der nicht bzw. nur teilweise übereins-<br />

timmenden Phrasengrenzen wurde mittels Diskussion<br />

oder durch Hinzuziehen einer dritten unabhängigen<br />

Meinung eine Einigung herbeigeführt. Die zuvor genannten<br />

Prozentangaben weisen bereits darauf hin, dass<br />

sich die von den Annotatoren identifizierten und klassifizierten<br />

Phrasen sehr häufig decken. Die exakte Übereinstimmungsrate<br />

lässt sich anhand des berechneten<br />

Kappa ablesen. Folgende Formel wurde in diesem Zusammenhang<br />

zugrunde gelegt:<br />

(aus Cohen, 1<strong>96</strong>0:40)<br />

Gemäß dieser Gleichung ergibt sich ein Kappa von 0.61.<br />

Vor dem Hintergrund, dass es sich bei den betrachteten<br />

Domänen um sehr divergierende Anwendungsfelder<br />

handelt und, dass sehr verschiedenartige, zum Teil diskontinuierliche<br />

Phrasen zu annotieren waren, kann dieser<br />

Wert als gut erachtet werden.<br />

Für die Evaluierung werden die zusammengestellten<br />

Stichproben mit Hilfe der Phrasenextraktionskompo-<br />

nente automatisch annotiert. Der resultierende Output<br />

wird anschließend gegen den Goldstandard evaluiert,<br />

welcher neben der Phrase die syntaktische Relation und<br />

die Angabe der relativen Häufigkeit innerhalb der Stichprobe<br />

beinhaltet. Die Evaluierung geht demzufolge über<br />

einen Vergleich der Zeichenketten hinaus und erfolgt<br />

zusätzlich unter Einbeziehung der folgenden Kriterien:<br />

� Syntaktische Relation<br />

� Häufigkeit<br />

Der Evaluierung liegen somit drei Faktoren zugrunde,<br />

welche als gleichgewichtet betrachtet werden. Es werden<br />

sowohl Exact als auch Partial Matches mit einer Abweichung<br />

von einem Term berücksichtigt. Phrasen, die<br />

im Hinblick auf die Phrasengrenze, die identifizierte<br />

Relation und die Häufigkeit mit der innerhalb des Gold-<br />

standards hinterlegten Phrase übereinstimmen, gelten im<br />

Rahmen der Evaluierung als Exact Matches.<br />

Für das Englische wird domänenübergreifend eine Precision<br />

von ca. 52% erzielt. Dabei werden <strong>für</strong> einige<br />

Phrasentypen deutlich bessere Werte erreichen (AN:<br />

86,5%; NG: 76,7%; NN: 76%, VA: 71,8%). Die<br />

schlechtere Precision im Falle der übrigen Phrasentypen<br />

ist einerseits auf fehlerhafte Pos-Tags (dies gilt besonders<br />

<strong>für</strong> die Patentdomäne) und andererseits auf die Diskontinuität<br />

der Phrasen zurückzuführen, welche die Formalisierung<br />

deutlich erschwert.<br />

I.d.R. zeigt sich, dass sowohl die Precision- als auch die<br />

Recall-Werte im Bereich der nutzergenerierten Inhalte<br />

durchschnittlich 13 bzw. 18 Prozentpunkte über denen im<br />

Patentbereich liegen. Dies unterstreicht die Schwierigkeit<br />

in diesem Kontext und legt die Vermutung nahe, dass es<br />

innerhalb der Patentdomäne gewisser Anpassungen be-<br />

darf. Um dies zu überprüfen, wurden <strong>für</strong> die Patentdo-<br />

mäne einige leichte Modifizierungen, z. B. Erweiterung<br />

der maximalen Phrasenlänge sowie die Berücksichtigung<br />

von Gerundien im Englischen, vorgenommen. Bereits<br />

eine geringe Anpassung der Verbalphrasen erhöht die<br />

Precision insgesamt auf 60,5% (+8,5%).<br />

Im Deutschen zeigt sich <strong>für</strong> die bislang realisierten Nominalphrasen<br />

ein ähnliches Bild. Hier wird domänenübergreifend<br />

eine Precision von 63% erreicht. Auch<br />

hier scheiden einzelne Phrasentypen deutlich besser ab<br />

(z. B.: AN: 89,4%).<br />

Insgesamt fällt auf, dass der Recall <strong>für</strong> beide Sprachen<br />

(ca. 38%) nicht sehr hoch ist. Dies lässt sich ebenfalls auf<br />

den Anwendungskontext zurückführen, denn <strong>für</strong> die<br />

Phrasenextraktion kommt der Precision in diesem Fall<br />

eine deutlich größere Bedeutung zu.<br />

6. Schlussbetrachtung<br />

Dieser Artikel bestätigt, dass sich mit einem ressourcenarmen,<br />

sprach- und domänenübergreifendem Ansatz z. T.<br />

gute Precision-Werte, die <strong>für</strong> die Phrasenextraktion im<br />

Retrieval-Kontext von vorrangiger Bedeutung sind, erzielen<br />

lassen. Allerdings weisen die Ergebnisse darauf<br />

hin, dass gewisse Modifikationen (z.B. innerhalb der<br />

161


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Patentdomäne) zu einer Steigerung der Ergebnisse führen Proceedings of the SIGIR’03. New York, NY, USA:<br />

können.<br />

ACM, S. 251-258.<br />

Jaene, H.; Seelbach, D. (1975): Maschinelle Extraktion von<br />

Zukünftig soll der Ansatz auf weiteren Sprachen (Fran- zusammengesetzten Ausdrücken aus englischen Fachzösisch,<br />

Spanisch) getestet und der Einfluss des texten. Berlin, Köln, Frankfurt (Main): Beuth.<br />

Pos-Tagging Modells untersucht werden, um die Ge- Koster, C. H. A. (2004): Head/Modifier Frames for Infornauigkeit<br />

des Algorithmus weiter zu verbessern.<br />

mation Retrieval. In: Proceedings of the CICLing’04.<br />

7. References<br />

Seoul, Korea: Springer (LNCS 2945), S. 420–432.<br />

Koster, C. H. A.; Beney, J. G.. (2009): Phrase-based docu-<br />

Abney, S. P. (1991): Parsing by Chunks. In: Berwick, R. C.;<br />

ment categorization revisited. In: Proceedings of PaIR’09.<br />

Abney, S. P.; Tenny, C. (Hrsg.): Principle-based parsing.<br />

New York, NY, USA: ACM, S. 49-56.<br />

Computation and psycholinguistics. Dordrecht: Kluwer<br />

Roda, G.; Tait, J.; Piroi, F.; Zenz, V. (2010): CLEF-IP 2009:<br />

(Studies in linguistics and philosophy, 44), S. 257-278.<br />

Retrieval Experiments in the Intellectual Property Do-<br />

Becks, D. (2010): Begriffliche Optimierung von Patentmain.<br />

In: Peters, C.; Di Nunzio, G.; Kurimo, M.; Mandl,<br />

anfragen. In: Information - Wissenschaft & Praxis, Jg. 61,<br />

Th.; Mostefa, D.; Peñas, A.; Roda, G. (Hrsg.): Multilin-<br />

H. 6-7, S. 423.<br />

gual Information Access Evaluation I. Text Retrieval<br />

Becks, D.; Schulz, J. M. (<strong>2011</strong>): Domänenübergreifende<br />

Experiments. Proceedings of CLEF '09. Berlin, Heidel-<br />

Phrasenextraktion mithilfe einer lexikonunabhängigen<br />

berg: Springer (Lecture Notes in Computer Science),<br />

Analysekomponente. In: Griesbaum, J.; Mandl, Th.;<br />

Bd. 6241, S. 385–409.<br />

Womser-Hacker, Ch. (Hrsg.): Information und Wissen:<br />

Ruge, G. (1989): Generierung semantischer Felder auf<br />

global, sozial und frei? Boizenburg: Werner Hülsbusch<br />

der Basis von Frei-Texten. In: LDV Forum 6, H. 2,<br />

(Schriften zur Informationswissenschaft Bd. 58),<br />

S. 3–17.<br />

S. 388–392.<br />

Ruge, G. (1995): Wortbedeutung und Termassoziation.<br />

Bourigault, D.; Jacquemin, Ch. (1999): Term extraction +<br />

Methoden zur automatischen semantischen Klassifika-<br />

term clustering: an integrated platform for computtion.<br />

Hildesheim, Zürich, New York: Olms.<br />

er-aided terminology. In: Proceedings of the EACL‘99<br />

Schulz, J. M.; Womser-Hacker, Ch.; Mandl, Th. (2010):<br />

Stroudsburg, PA, USA: Association for Computational<br />

Multilingual Corpus Development for Opinion Mining.<br />

Linguistics, S. 15-22.<br />

In: Calzolari, N.; Choukri, K.; Maegaard, B.; Mariani, J.;<br />

Cohen, J. (1<strong>96</strong>0): A Coefficient of Agreement for Nominal<br />

Odijk, J.; Piperidis, S.; Rosner, M.; Tapias, D. (Hrsg.):<br />

Scales. In: Educational and Psychological Measurement<br />

Proceedings of the LREC‘10. Valletta, Malta:<br />

20 (1), S. 37–46.<br />

European Language Resources Association (ELRA),<br />

Ding, X.; Liu, B.; Yu, P. S. (2008): A holistic lexicon-based<br />

S. 3409–3412.<br />

approach to opinion mining. In: Proceedings of the<br />

Tseng, Y.-H.; Lin, C.-J.; Lin, Y.-I (2007): Text mining<br />

WSDM’08. Palo Alto, California, USA: ACM, S.<br />

techniques for patent analysis. In: Information<br />

231–240.<br />

Processing and Management, Jg. 43, H. 5, S. 1216-1247.<br />

Guo, H.; Zhu, H.; Guo, Z.; Zhang, X. X.; Su, Z. (2009):<br />

Uchimoto, K.; Sekinez, S.; Isahara, H. (2001): The Un-<br />

Product feature categorization with multilevel latent<br />

known Word Problem: a Morphological Analysis of<br />

semantic association. In: Proceeding of the CIKM’09.<br />

Japanese Using Maximum Entropy Aided by a Dictio-<br />

Hong Kong, China: ACM, S. 1087–10<strong>96</strong>.<br />

nary. In: Lee, L.; Harman, D. (Hrsg.): Proceedings of the<br />

Hu, M.; Liu, B. (2004): Mining Opinion Features in<br />

EMNLP ´01: ACL, S. 91–99.<br />

Customer Reviews. In: Mcguinness, D. L.; Ferguson, G.<br />

Verbene, S.; D'hondt, E.; Oostdijk, N. (2010): Quantifying<br />

(Hrsg.): AAAI: AAAI Press/The MIT Press, S. 755–760.<br />

the Challenges in Parsing Patent Claims. In: Proceedings<br />

Iwayama, M.; Fujii, A.; Kando, N.; Marukawa, Y. (2003):<br />

An Empirical Sudy on Retrieval Models for Different<br />

Document Genres: Patents and Newspaper Articles. In:<br />

of AsPIRe'10. Milton Keynes, S. 14–21.<br />

162


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Die Digitale Rätoromanische Chrestomathie – Werkzeuge und Verfahren <strong>für</strong><br />

die Korpuserstellung durch kollaborative Volltexterschließung<br />

Claes Neuefeind, Jürgen Rolshoven, Fabian Steeg<br />

Institut <strong>für</strong> Linguistik, Sprachliche Informationsverarbeitung<br />

<strong>Universität</strong> zu Köln<br />

Albertus-Magnus-Platz<br />

50923 Köln<br />

E-mail: c.neuefeind@uni-koeln.de, rols@spinfo.uni-koeln.de, fabian.steeg@uni-koeln.de<br />

Abstract<br />

Das Paper beschreibt die Entwicklung und den Einsatz von Werkzeugen und Verfahren <strong>für</strong> die kollaborative Korrektur bei der<br />

Erstellung eines rätoromanischen Textkorpus mittels digitaler Tiefenerschließung. Textgrundlage bildet die „Rätoromanische<br />

Chrestomathie“ von Caspar Decurtins, die 1891-1919 in der Zeitschrift „Romanische Forschungen“ erschienen ist. Bei dem hier<br />

vorgestellten Ansatz werden manuelle und automatische Korrektur unter Einbeziehung von Angehörigen und Interessierten der<br />

rätoromanischen Sprachgemeinschaft über eine kollaborative Arbeitsumgebung kombiniert. In dem von uns entwickelten<br />

netzbasierten Editor werden die automatisch gelesenen Texte den Digitalisaten gegenübergestellt. Korrekturen, Kommentare und<br />

Verweise können nach Wiki-Prinzipien vorgeschlagen und eingebracht werden. Erstmalig wird so die Sprachgemeinschaft einer<br />

Kleinsprache aktiv in den Prozess der Dokumentation und Bewahrung ihres eigenen sprachlichen und kulturellen Erbes<br />

eingebunden. In diesem Paper wird die konkrete Umsetzung der kollaborativen Arbeitsumgebung beschrieben, von der<br />

architektonischen Grundlage und aktuellen technologischen Umsetzung bis hin zu Weiterentwicklungen und Potentialen. Die<br />

Entwicklung erfolgt von Beginn an quelloffen unter http://github.com/spinfo/drc.<br />

Keywords: Volltexterschließung, Korpuserstellung, kollaborative Korrektur<br />

1. Einleitung<br />

Für die Digitalisierung von Texten gibt es seitens der<br />

nationalen und internationalen Förderinstitutionen eine<br />

Vielzahl von Initiativen, Programmen und Projekten.<br />

Über die reine Massendigitalisierung hinaus zielen die<br />

Maßnahmen auch auf die digitale Tiefenerschließung<br />

von Texten. Diese ermöglicht zum einen den Zugriff<br />

über Volltextsuche, zum anderen kann sie zur Erstellung<br />

von spezialisierten Korpora genutzt werden, etwa auf<br />

Grundlage historischer Textsammlungen.<br />

Ein wesentliches Problem der automatischen Volltexterschließung<br />

sind Lesefehler bei der optischen Zeichenerkennung<br />

(Optical Character Recognition, OCR).<br />

Besonders bei älteren Texten machen die unterschiedlichen<br />

Verschriftungstraditionen und variierenden<br />

Typographien eine fehlerfreie OCR faktisch unmöglich.<br />

Im Zuge der hier beschriebenen Digitalisierung der<br />

“Rätoromanischen Chrestomathie” setzen wir deshalb<br />

bei der Korrektur der OCR-Fehler auf die Einbindung<br />

von Angehörigen und Interessierten der rätoromanischen<br />

Sprachgemeinschaft über eine netzbasierte Arbeitsumgebung,<br />

in der die OCR-gelesenen Texte den<br />

zugrunde liegenden Digitalisaten gegenübergestellt sind.<br />

2. Ähnliche Arbeiten<br />

Die Idee einer kollaborativen Korrektur von OCR-<br />

Ergebnissen findet zunehmend auch im Kontext<br />

größerer strategischer Digitalisierungsprogramme<br />

Beachtung, so z.B. im IMPACT-Projekt1 der Europäischen<br />

Kommission. Die Einschätzung, dass die<br />

Einbindung freiwilliger Korrektoren eine realistische<br />

Option ist, wird dabei u. a. durch die positiven<br />

Erfahrungen des Australian Newspapers Digitisation<br />

Program (ANDP) 2 der National Library of Australia<br />

1 Improving Access To Text; http://www.impact-project.eu.<br />

2 Siehe http://www.nla.gov.au/ndp/.<br />

163


Multilingual Resources and Multilingual Applications - Regular Papers<br />

gestützt, das im Zuge der Volltexterschließung der<br />

zwischen 1803 und 1954 in Australien erschienenen<br />

Zeitungen bereits seit 2008 erfolgreich eine Communityorientierte<br />

Fehlerkorrektur umsetzt (Holley, 2009).<br />

Einen vergleichbaren Ansatz plant auch das Deutsche<br />

Textarchiv (DTA) 3 . In der dort bislang nur intern<br />

eingesetzten Korrekturumgebung können Fehler<br />

allerdings nicht direkt vom Nutzer bearbeitet, sondern<br />

lediglich anhand einer differenzierten Fehlertypologie<br />

markiert und an die Mitarbeiter des DTA gemeldet<br />

werden, die diese anschließend offline korrigieren. Das<br />

Konzept der Verknüpfung von Digitalisat und Text in<br />

einem Editor wird zudem in dem im Rahmen des<br />

Textgrid-Projekts4 entwickelten Text-Bild-Link-Editor<br />

aufgenommen, der zwar eine kontrollierte Metadaten-<br />

Annotation von Bildelementen durch entsprechend<br />

qualifizierte Nutzer ermöglicht, jedoch aufgrund der<br />

fehlenden Benutzerverwaltung und Versionierung sowie<br />

der ausschließlich manuellen Text-Bild-Verknüpfung<br />

nicht <strong>für</strong> eine netzbasierte, kollaborative Korrektur von<br />

OCR-Ergebnissen durch interessierte Laien ausgelegt<br />

ist. Da die weiteren Ansätze zu Beginn unserer Arbeiten<br />

an der Digitalen Rätoromanischen Chrestomathie zum<br />

Teil noch nicht vorlagen (IMPACT, DTA), oder aber<br />

starke Differenzen im Ausgangsmaterial und damit im<br />

Digitalsierungs-Workflow aufweisen (großformatige<br />

Zeitungsseiten im ANDP), haben wir uns <strong>für</strong> eine<br />

Eigenentwicklung entschieden, um dadurch auch auf die<br />

speziellen Anforderungen einer mehrsprachigen<br />

Textbasis und das Fehlen von Korrekturlexika eingehen<br />

zu können. Während im DTA wie auch im Textgrid-<br />

Projekt der Schwerpunkt auf exakten Metadaten liegt,<br />

zielt der hier vorgestellte Ansatz auf die originalgetreue<br />

Wiedergabe des Textes anhand der Vorlage, weshalb auf<br />

elaborierte Korrekturguidelines verzichtet wurde.<br />

3. Die Digitale Rätoromanische<br />

Chrestomathie<br />

Die “Rätoromanische Chrestomathie” (RC) von Caspar<br />

Decurtins, die 1891-1919 in der Zeitschrift<br />

“Romanische Forschungen” (Erlangen: Junge) erschienen<br />

ist, gilt als wichtigste Textsammlung des<br />

Rätoromanischen (Egloff & Mathieu, 1986:7). Damit ist<br />

3 Siehe http://www.deutschestextarchiv.de/.<br />

4 Siehe http://www.textgrid.de/.<br />

164<br />

sie eine hervorragende Basis <strong>für</strong> die Erstellung eines<br />

rätoromanischen Textkorpus. Mit ihren etwa 8000 Seiten<br />

Text aus vier Jahrhunderten, ihrer thematischen Vielfalt,<br />

unterschiedlichen Textsorten und Genres sowie der<br />

Abdeckung der fünf Hauptidiome des Bündnerromanischen<br />

ist sie <strong>für</strong> nahezu alle sprach- und<br />

kulturwissenschaftlichen Disziplinen von außerordentlichem<br />

Interesse. Sie stimuliert lexikographisches<br />

und lexikologisches, morphologisches und<br />

syntaktisches, semantisches und textlinguistisches,<br />

literaturwissenschaftliches, volkskundliches und<br />

historisches Arbeiten. Sie ermöglicht datenbasierte<br />

Untersuchungen zu Strukturen und Textsorten und ist<br />

aufgrund ihres Varietätenreichtums von hohem Wert <strong>für</strong><br />

diachrone (über vier Jahrhunderte reichende) und<br />

diatopische (fünf Hauptidiome umfassende) Untersuchungen,<br />

etwa zu Sprachkontakt, Sprachverwandtschaft<br />

und Sprachwandel.<br />

3.1. Digitalisierung und OCR<br />

Ausgangspunkt der Korpuserstellung sind die<br />

Digitalisate der RC aus der Zeitschrift "Romanische<br />

Forschungen", die von der Staats- und <strong>Universität</strong>sbibliothek<br />

Göttingen im Rahmen des Digizeitschriften-<br />

Projekts5 digitalisiert und zusammen mit den in einem<br />

METS-basierten Format6 erstellten Metadaten zur<br />

Verfügung gestellt wurden. Um die Digitalisate <strong>für</strong> die<br />

textuelle Verarbeitung zugänglich zu machen, werden<br />

sie mittels OCR in eine maschinenlesbare Form<br />

überführt. Die hohe typographische und orthographische<br />

Vielfalt der RC stellt dabei eine besondere<br />

Herausforderung <strong>für</strong> die OCR dar, um so mehr, als der<br />

Zeichenerkennung keine angemessenen Korrekturlexika<br />

<strong>für</strong> die verschiedenen Idiome zur Verfügung stehen.<br />

Gerade die älteren Texte der Chrestomathie sind<br />

orthographisch nicht normiert, weil die Idiome des<br />

Bündnerromanischen unterschiedlichen Verschriftungsformen<br />

und -traditionen folgen. Auf Grundlage des<br />

OCR-Ergebnisses werden PDF-Dateien generiert, bei<br />

denen der erkannte Text unter dem Digitalisat<br />

positioniert wird. Das generierte PDF enthält damit nicht<br />

nur den gesamten Text, sondern auch die<br />

5 Siehe http://www.digizeitschriften.de<br />

6 Siehe http://gdz.sub.unigoettingen.de/entwicklung/standardisierung/


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Positionskoordinaten der einzelnen Wörter. Die<br />

Extraktion der Wörter mitsamt ihrer Positions-<br />

koordinaten erfolgte mit der Software-Bibliothek<br />

PDFBox 7 . Die ausgelesenen Informationen (Wort und<br />

Position) werden in XML-Form abgelegt und stellen die<br />

Grundlage <strong>für</strong> das Highlighting-Feature in der<br />

Korrekturumgebung dar.<br />

3.2. Der DRC-Editor<br />

Kern des hier beschriebenen Ansatzes ist die Erstellung<br />

einer kollaborativen Korrekturumgebung, in der die<br />

Digitalisate und die mittels OCR gewonnenen Texte<br />

zusammengeführt werden. Mit dem Editor können die<br />

elektronisch eingelesenen Texte der RC durchsucht,<br />

gelesen und bearbeitet werden.<br />

Abbildung 1: Screenshot des Editors (Beta-Version)<br />

Die Auswahl der zu bearbeitenden Seiten erfolgt über<br />

Volltextsuche sowie über die aus dem Digizeitschriften-<br />

Projekt übernommenen Metadaten. Ziel der Bearbeitung<br />

ist die Erstellung einer fehlerfreien digitalen Textfassung,<br />

weshalb zu Vergleichszwecken stets die<br />

Originalfassung als digitales Faksimile mit angezeigt<br />

wird. Die Bilddarstellung ist dabei mit dem Text<br />

gekoppelt: Während man den Text wortweise bearbeitet,<br />

wird das jeweils korrespondierende Wort unter Nutzung<br />

der bei der OCR gewonnenen Positionskoordinaten auf<br />

dem Bild hervorgehoben (siehe Abbildung 1, Bereich<br />

Verifitgar). Die Synchronisation von Text und<br />

7 Siehe http://pdfbox.apache.org/.<br />

Bildkoordinaten bleibt auch bei Korrekturen bestehen,<br />

da die vorgenommenen Änderungen ebenso wie die<br />

Positionskoordinaten der ursprünglichen Wortform<br />

zugeordnet werden. Über die Tastatur nicht verfügbare<br />

Sonderzeichen können über ein Auswahlfenster<br />

hinzugefügt werden.<br />

Als zusätzliches Hilfsmittel besteht die Option zur<br />

Anzeige von Korrekturvorschlägen (siehe Abbildung 1,<br />

Propostas da Correcturas), die auf Grundlage von<br />

Wortlisten über die Levenshtein-Distanz, einen<br />

Algorithmus <strong>für</strong> den Stringvergleich, ermittelt werden.<br />

Da solche Wortlisten bzw. Prüfklassen derzeit nur <strong>für</strong><br />

eines der Idiome, das Surselvische, verfügbar sind, ist<br />

zusätzlich ein automatisierter Auf- und Ausbau von<br />

Benutzerlexika geplant, indem die manuellen<br />

Korrekturen unter Nutzung der Versionierungsmechanismen<br />

der Korrekturumgebung aufgezeichnet<br />

werden. Hieraus resultiert eine stetig wachsende Liste<br />

von als korrekt bestätigten Wörtern, die einerseits als<br />

Grundlage <strong>für</strong> die Berechnung von Korrekturvorschlägen<br />

dient, andererseits dazu eingesetzt werden<br />

kann, dem Nutzer nach einer vorgenommenen Korrektur<br />

Verbesserungsvorschläge an anderen, gleichen oder<br />

ähnlichen Stellen des Textes vorzuschlagen. Sämtliche<br />

Bearbeitungen werden unter Angabe von Nutzer und<br />

Bearbeitungszeitpunkt protokolliert. Damit verbunden<br />

ist ein einfaches Bewertungs- und Wettbewerbssystem,<br />

das über die Korrekturen Buch führt.<br />

Die Erfahrungen im laufenden Projekt haben gezeigt,<br />

dass über die reine Korrektur hinaus auch die<br />

Möglichkeit zu einer Verschlagwortung und<br />

Kommentierung nutzerseitig gewünscht ist, da dies<br />

neben erweiterten Recherchemöglichkeiten auch die<br />

Möglichkeit zur Markierung unklarer oder (im Sinne der<br />

kollaborativen Bearbeitung) strittiger Fälle bietet. In der<br />

aktuellen Beta-Version können die Daten deshalb auf<br />

Seitenebene mittels frei wählbarer Tags oder durch<br />

Hinzufügung von Freitext-Kommentaren annotiert<br />

werden. Über eine fehlerfreie Dokumentation hinaus<br />

erfolgt auf diese Weise auch eine Anreicherung der<br />

Texte. Hierbei wird die Textbasis in gewissem Sinne<br />

'aktualisiert', indem das Wissen der Sprecher in Form<br />

von Metadaten (Schlagworte, Verweise, Nutzungskontexte)<br />

in die Texte zurückfließt.<br />

165


Multilingual Resources and Multilingual Applications - Regular Papers<br />

3.3. Das DRC-Portal<br />

Für den Datenzugriff wurde neben dem Editor eine<br />

mehrsprachige Portalseite erstellt, die als zentraler<br />

Anlaufpunkt <strong>für</strong> interessierte Nutzer dient (vgl.<br />

Abbildung 2). Über das Portal kann der DRC-Editor<br />

heruntergeladen und ein Account <strong>für</strong> dessen Benutzung<br />

angelegt werden.<br />

166<br />

Abbildung 2: Portalseite der DRC (siehe<br />

http://www.crestomazia.ch)<br />

Neben Hilfestellungen und Hinweisen zum Editor bietet<br />

die Portalseite erweiterte Recherchemöglichkeiten und<br />

enthält Hintergrundinformationen zum Projekt sowie zu<br />

ausgewählten Aspekten der bearbeiteten Daten.<br />

3.4. Einbindung der Sprachgemeinschaft<br />

Von zentraler Bedeutung <strong>für</strong> das hier beschriebene<br />

Vorgehen war die Frage, wie die Einbindung von<br />

Sprechern in einen kollaborativen Erschließungsprozess<br />

erfolgen kann. Um die Beteiligung einer ausreichenden<br />

Zahl von Sprechern sicherzustellen, setzten wir auf die<br />

Zusammenarbeit mit Partnern vor Ort, die neben der<br />

Presse- und Öffentlichkeitsarbeit auch eine Nutzer-<br />

akquise übernehmen. Das Projekt wurde mit Hilfe der<br />

Schweizer Partner über die lokalen und überregionalen<br />

Medien propagiert. In Kombination mit einer gezielten<br />

Nutzerakquise konnte dadurch bereits <strong>für</strong> die aktuelle<br />

Beta-Version des DRC-Editors eine größere Anzahl an<br />

Nutzern gewonnen werden. So waren im August <strong>2011</strong><br />

ca. 100 Nutzer angemeldet, seit dem Schaltungstermin<br />

der DRC Anfang Juni <strong>2011</strong> wurde etwa ein Drittel der<br />

Texte bearbeitet.<br />

4. Systemarchitektur<br />

Der Natur des Vorhabens wird eine dreischichtige<br />

Systemarchitektur gerecht: Gesamtziel ist die kollabora-<br />

tive Produktion annotierter, textueller Daten. Diese<br />

Daten sind <strong>für</strong> alle Benutzer des Systems identisch, und<br />

können daher zentral gespeichert werden (Datenschicht).<br />

Verschiedene Nutzer sollen unabhängig voneinander auf<br />

diese Daten zugreifen und diese verändern können,<br />

wobei die Integrität der Daten gewährleistet werden<br />

muss (Logikschicht). Der Zugriff erfolgt über eine<br />

graphische Benutzerschnittstelle (Präsentationsschicht).<br />

Abbildung 3: Grundlegende Systemarchitektur<br />

Die Präsentationsschicht kommuniziert mit der Logik-<br />

schicht und diese mit der Datenschicht. Da es keine<br />

direkte Verbindung zwischen Präsentations- und Daten-<br />

schicht gibt, ist das System lose gekoppelt und erlaubt<br />

Austausch und Wiederverwendung der Schichten, etwa<br />

<strong>für</strong> eine Nutzung der Daten in anderen Kontexten.<br />

4.1. Technologien<br />

Aufgrund des modernen Programmierkonzepts, der<br />

hohen Modularität und Wiederverwertbarkeit durch<br />

OSGi 8 , der nativen GUI-Technologie sowie der<br />

Integration von Webstandards (z.B. CSS zur Gestaltung<br />

der GUI), haben wir uns <strong>für</strong> Eclipse4 9 als Technologie<br />

<strong>für</strong> die Präsentationsschicht entschieden. Für eine<br />

kompakte und zugleich effiziente und kompatible<br />

Logikschicht setzen wir auf die JVM-Sprache Scala 10 .<br />

Für die Datenschicht wird mit eXist 11 eine XML-<br />

Datenbank eingesetzt. Da eXist über eine eingebaute<br />

Serverfunktionalität verfügt, war es zweckmäßig, die<br />

Logikschicht als Teil des Clients umzusetzen, und so<br />

keine eigenen serverseitigen Komponenten imple-<br />

8 Open Service Gateway Initiative, siehe http://www.osgi.org/.<br />

9 Siehe http://eclipse.org/eclipse4/.<br />

10 Siehe http://www.scala-lang.org/.<br />

11 Siehe http://exist.sourceforge.net/.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

mentieren zu müssen. Über den Datenbankserver<br />

können die Daten unabhängig von der beschriebenen<br />

Infrastruktur über standardisierte REST-Schnittstellen 12<br />

zur Verfügung gestellt werden.<br />

4.2. Implementierungen<br />

Die Beta-Version des Editors ist als Eclipse-basierte<br />

Desktop-Applikation realisiert, die als Client des<br />

Datenbankservers fungiert. Der Editor wurde mit auto-<br />

matischen Aktualisierungen versehen, um neue<br />

Funktionalitäten und Fehlerbehebungen in der Software<br />

ohne Aufwand seitens der Nutzer bereitzustellen. Damit<br />

ergibt sich die folgende technologische Umsetzung der<br />

oben skizzierten Architektur:<br />

Abbildung 4: Implementierung der Architektur in der<br />

aktuellen Beta-Version<br />

Derzeit arbeiten wir an alternativen Umsetzungen der<br />

GUI. Die aktuelle Beta-Version ermöglicht sowohl eine<br />

Weiterentwicklung zu einer Offline-fähigen Desktop-<br />

Applikation, die ohne Netzzugang verwendet werden<br />

kann und bei Bedarf die Daten mit dem Server<br />

synchronisiert, als auch die automatische Generierung<br />

einer Web-Oberfläche mithilfe des RAP-Frameworks13 (vgl. Abbildung 5).<br />

Abbildung 5: Alternative Umsetzungen der Architektur<br />

12 Representational State Transfer, vgl. dazu (Fielding, 2000).<br />

13 Rich Ajax Platform, siehe http://eclipse.org/rap/.<br />

Die Software-Entwicklung erfolgte von Beginn an<br />

quelloffen; der vollständige Programmcode steht ebenso<br />

wie die jeweils aktuelle Version des Editors unter<br />

https://github.com/spinfo/drc frei zur Verfügung.<br />

5. Erweiterungen<br />

Mit der Digitalen Rätoromanischen Chrestomathie wird<br />

erstmals der digitale Zugriff auf eine größere<br />

rätoromanische Textsammlung geschaffen. Über die<br />

reine Dokumentation und Archivierung hinaus kann eine<br />

frei zugängliche RC eine Vielzahl neuer Impulse <strong>für</strong> die<br />

wissenschaftliche, mediale, edukative, aber auch private<br />

Nutzung geben. Die Möglichkeiten reichen von<br />

historischen und genealogischen Recherchen nach<br />

Personen und Ortsnamen über die kreative<br />

Auseinandersetzung durch Hinzufügung eigener Texte<br />

oder Übersetzungen, bis hin zur lexikographischen<br />

Arbeit mit der RC. Für eine Nutzung jenseits einfacher<br />

Suchanfragen ist zudem eine Annotation der Texte mit<br />

linguistischen Merkmalen geplant14 . Insbesondere <strong>für</strong><br />

eine adäquate (computer-)linguistische Nachnutzung<br />

bedarf es einer linguistischen Aufbereitung der<br />

erschlossenen Texte, da die reine Volltexterschließung<br />

nur als ein erster Schritt auf dem Weg zur Bereitstellung<br />

von computer- bzw. korpuslinguistisch ausgiebig<br />

nutzbaren Ressourcen betrachtet werden kann.<br />

Analog zum hier beschriebenen Vorgehen soll auch die<br />

linguistische Annotation durch die Kombination<br />

automatischer und manueller Verfahren erfolgen. Um<br />

der weitgehend fehlenden orthographischen Normierung<br />

der RC zu begegnen, sollen in einem Folgeprojekt<br />

zunächst die <strong>für</strong> die fünf Hauptidiome verfügbaren<br />

lexikalischen Ressourcen digital aufbereitet werden. Auf<br />

dieser Grundlage automatisch vorgenommene Annotationen<br />

können anschließend über das entsprechend<br />

erweiterte Editor-Werkzeug durch Muttersprachler und<br />

Interessierte kollaborativ überprüft und ggf. korrigiert<br />

bzw. ergänzt werden. Das aus lexikalischer und<br />

manueller Annotation gewonnene Wissen soll mittels<br />

spezialisierter Lernverfahren zur erneuten automatischen<br />

Annotation der Texte genutzt werden.<br />

14 Vgl. dazu bspw. das Vorgehen im Projekt “Text+Berg digital”<br />

(Volk et al., 2010), siehe auch http://www.textberg.ch.<br />

167


Multilingual Resources and Multilingual Applications - Regular Papers<br />

168<br />

6. Potentiale<br />

Durch den hier vorgestellten Ansatz werden Probleme<br />

der OCR gerade bei älteren und typographisch varianten<br />

Schriftsystemen abgefangen. Die im Vorhaben<br />

eingesetzten Erschließungs- und Auszeichnungs-<br />

techniken können in der Folge auf weitere Text-<br />

sammlungen des Bündnerromanischen und anderer<br />

kleiner Sprachen angewandt werden. Über das konkrete<br />

materielle Ziel der Erstellung eines rätoromanischen<br />

Textkorpus hinaus werden damit übertragbare und somit<br />

nachhaltige, kompetenzorientierte Verfahren entwickelt,<br />

die <strong>für</strong> die Tiefendigitalisierung des schriftlichen<br />

kulturellen Erbes kleinerer Sprachgemeinschaften<br />

prototypisch sind. Von besonderem Interesse ist hier<br />

auch die Möglichkeit <strong>für</strong> Mitglieder solcher Sprach-<br />

gemeinschaften, über Wiki-Technologien den Erhalt des<br />

eigenen sprachlichen und kulturellen Erbes aktiv zu<br />

unterstützen.<br />

7. Danksagung<br />

Die Digitale Rätoromanische Chrestomathie ist ein<br />

gemeinsames Projekt der Sprachlichen Informations-<br />

verarbeitung und der <strong>Universität</strong>s- und Stadtbibliothek<br />

Köln. Für die Organisation und Durchführung in der<br />

Schweiz konnten wir mit Dr. Florentin Lutz einen<br />

ausgewiesenen Linguisten und sehr gut vernetzten<br />

Muttersprachler gewinnen. Das DRC-Projekt wird von<br />

der Deutschen Forschungsgemeinschaft gefördert. In der<br />

Schweiz erhielt das Projekt zusätzliche finanziellen<br />

Unterstützung durch das Legat Anton Cadonau, das<br />

Institut <strong>für</strong> Kulturforschung Graubünden und das<br />

Kulturamt des Kantons Graubünden. Auch seitens der<br />

rätoromanischen Verbände und Organisationen erfuhr<br />

das Projekt regen Zuspruch und weitere Unterstützung,<br />

insbesondere durch die Lia Rumantscha 15 , den<br />

Dachverband der Bündnerromanen, sowie die Societad<br />

Retorumantscha 16 , den Trägerverein des „Dicziunari<br />

Rumantsch Grischun“, einem der vier nationalen<br />

Wörterbücher der Schweiz. All diesen Einrichtungen<br />

schulden wir unseren herzlichsten Dank.<br />

15 Siehe http://www.liarumantscha.ch.<br />

16 Siehe http://www.drg.ch/main.php?l=r&a=srr.<br />

8. Referenzen<br />

Decurtins, C. (1984-1986): Rätoromanische<br />

Chrestomathie. Band 1-14. Chur: Octopus-Verlag /<br />

Società Retorumantscha.<br />

Egloff, P., Mathieu, J. (1986): Rätoromanische<br />

Chrestomathie - Register. In: Rätoromanische<br />

Chrestomathie, Band 15. Chur: Octopus-Verlag /<br />

Società Retorumantscha.<br />

Fielding, R (2000): Architectural Styles and the Design<br />

of Network-based Software Architectures.<br />

Doktorarbeit, University of California, Irvine.<br />

Holley, R. (2009): Many Hands Make Light Work:<br />

Public Collaborative OCR Text Correction in<br />

Australian Historic Newspapers. National Library of<br />

Australia.<br />

Volk, M., Bubenhofer, N., Althaus, A. , Bangerter, M.,<br />

Furrer, L., Ruef, B. (2010): Challenges in Building a<br />

Multilingual Alpine Heritage Corpus. In: Seventh<br />

International Conference on Language Resources and<br />

Evaluation (LREC), Malta 2010.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Ein umgekehrtes Lehnwörterbuch als Internetportal und elektronische<br />

Ressource: Lexikographische und technische Grundlagen<br />

Peter Meyer, Stefan Engelberg<br />

Institut <strong>für</strong> Deutsche Sprache<br />

Mannheim<br />

E-mail: meyer@ids-mannheim.de, engelberg@ids-mannheim.de<br />

Abstract<br />

Der vorliegende Beitrag stellt einen neuartigen Typ von mehrsprachiger elektronischer Ressource vor, bei dem verschiedene<br />

Lehnwörterbücher zu einem ‚umgekehrten Lehnwörterbuch‘ <strong>für</strong> eine bestimmte Gebersprache zusammengefasst werden. Ein<br />

solches Wörterbuch erlaubt es, die zu einem Etymon der Gebersprache gehörigen Lehnwörter in verschiedenen Nehmersprachen zu<br />

finden. Die Entwicklung einer solchen Webanwendung, insbesondere der zugrundeliegenden Datenbasis, ist mit zahlreichen<br />

konzeptionellen Problemen verbunden, die an der Schnittstelle zwischen lexikographischen und informatischen Themen liegen. Der<br />

Beitrag stellt diese Probleme vor dem Hintergrund wünschenswerter Funktionalitäten eines entsprechenden Internetportals dar und<br />

diskutiert einen möglichen Lösungsansatz: Die Artikel der Einzelwörterbücher werden als XML-Dokumente vorgehalten und dienen<br />

als Grundlage <strong>für</strong> die gewöhnliche Online-Ansicht dieser Wörterbücher; insbesondere <strong>für</strong> portalweite Abfragen werden aber<br />

grundlegende, standardisierte Informationen zu Lemmata und Etyma aller Portalwörterbücher samt deren Varianten und<br />

Wortbildungsprodukten (hier zusammenfassend als ‚Portalinstanzen‘ bezeichnet) sowie die verschiedenartigen Relationen zwischen<br />

diesen Portalinstanzen zusätzlich in relationalen Datenbanktabellenabgelegt, die performante und beliebig komplex strukturierte<br />

Suchabfragen gestatten.<br />

Keywords: Lehnwörter, elektronische Lexikografie, mehrsprachige Ressource, Internetportal<br />

1. Ein Lehnwörterbuchportal als<br />

‚umgedrehtes Lehnwörterbuch‘<br />

Ziel des vorgestellten Projekts ist ein<br />

Internet-Wörterbuchportal <strong>für</strong> Lehnwörterbücher, die<br />

Entlehnungen aus dem Deutschen dokumentieren. Dieses<br />

Portal ist dadurch gekennzeichnet, dass zum einen die<br />

eingestellten Wörterbücher als Einzelwerke<br />

veröffentlicht werden und zum anderen auf Portalebene<br />

komplexe Abfragen über sämtliche integrierte<br />

Wörterbücher hinweg formuliert werden können, zum<br />

Beispiel nach dem Weg einzelner deutscher Quellwörter<br />

über Mittlersprachen in die verschiedenen Zielsprachen,<br />

nach sämtlichen Lehnwörtern in bestimmten historischen<br />

Zeitspannen und geographischen Räumen, oder auch<br />

nach sämtlichen deutschen Lehnwörtern, die bestimmte<br />

Charakteristika aufweisen (z. B. Wortart, semantische<br />

Klasse). Das Portal realisiert damit – nicht in den<br />

Einzelwörterbüchern, aber in seiner Gesamtheit – als<br />

umgekehrtes Lehnwörterbuch das Konzept eines neuen<br />

Wörterbuchtyps. 1<br />

Während es in der Sprachkontaktlexikographie<br />

– etwa in Fremdwörterbüchern – üblich ist,<br />

Entlehnungsprozesse aus der Perspektive der Zielsprache<br />

zu beschreiben, erfasst das geplante Portal aus der<br />

Perspektive der Quellsprache die Wege, die deutscher<br />

Wortschatz in andere Sprachen genommen hat<br />

(Engelberg, 2010). Gegenwärtig wird am Institut <strong>für</strong><br />

Deutsche Sprache (Mannheim) im Rahmen eines über 18<br />

Monate laufenden und vom Beauftragten der<br />

Bundesregierung <strong>für</strong> Kultur und Medien geförderten<br />

Pilotprojektes die grundsätzliche Softwarearchitektur des<br />

Portals entwickelt und implementiert sowie die<br />

Integration dreier Lehnwörterbücher in das Portal<br />

vorgenommen, und zwar zu deutschen Entlehnungen im<br />

Polnischen (Vincenz & Hentschel, 2010), zu deutschen<br />

Entlehnungen im Teschener Dialekt des Polnischen<br />

(Menzel & Hentschel, 2005) und zu deutschen<br />

1 Wiegand (2001) spricht in diesem Zusammenhang von aktiven<br />

bilateralen Sprachkontaktwörterbüchern. Wörterbücher dieses<br />

Typs sind extrem selten, vgl. auch (Engelberg, 2010). (Görlach,<br />

2001) ist das einzige nennenswerte Beispiel.<br />

169


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Entlehnungen im Slovenischen (Striedter-Temps, 1<strong>96</strong>3).<br />

Da das Portal auf Offenheit bezüglich der Integration<br />

weiterer Ressourcen konzipiert ist, können jederzeit<br />

weitere Lehnwörterbücher integriert wird. Entsprechende<br />

Wörterbücher zu Entlehnungen aus dem Deutschen<br />

existieren zu relativ vielen Sprachen (Englisch, Japanisch,<br />

Portugiesisch, Schwedisch, Serbokroatisch, Tok Pisin,<br />

Ukrainisch, Usbekisch, …). Hier wären entsprechende<br />

Kooperationen anzustreben und Rechtefragen zu klären. 2<br />

170<br />

2. Nutzen eines Lehnwörterbuchportals<br />

Das Portal soll sowohl <strong>für</strong> Laien wie <strong>für</strong> Wissenschaftler<br />

nutzbar sein. Die Laiennutzung kann über einfache<br />

Suchanfragen erfolgen, die wissenschaftliche Nutzung<br />

orientiert sich an der Möglichkeit komplexer<br />

Suchanfragen und an direkten Schnittstellen<br />

(Webservices). Dabei wird sowohl die<br />

sprachwissenschaftliche Sprachkontaktforschung wie<br />

auch die historisch, soziologisch oder anthropologisch<br />

ausgerichtete Kulturkontaktforschung Nutzen aus dem<br />

Portal ziehen.<br />

Im Rahmen der wissenschaftlichen Nutzung soll das<br />

Portal nicht nur philologisch motivierte, interpretative<br />

Einzelstudien unterstützen, sondern durch die in ihm<br />

realisierte Kumulation von Daten auch spezifische<br />

neuartige, zum Teil quantitativ orientierte<br />

Forschungsfragen<br />

Untersuchungen<br />

ermöglichen. Dazu gehören<br />

� zum Zusammenhang zwischen bestimmten Typen<br />

von soziokulturellen Entwicklungen<br />

(Herrschaftswechsel, Migration, Technologieschub)<br />

und Zeitverlaufstypen der Entlehnungsfrequenzen<br />

von Lexemen (wie etwa eine plötzliche oder eine<br />

eher graduelle quantitative Zunahme von<br />

�<br />

Entlehnungen),<br />

zu Faktoren und Prozessen der Etablierung von<br />

Lehnwörtern, 3<br />

� dazu, ob verschiedene Typen des Sprachkontakts<br />

typische quantitative und zeitliche<br />

Verteilungsmuster von Lehnwörtern hervorbringen, 4<br />

2 Zum Teil ist die Beschreibungssprache in diesen Wörtern die<br />

Quellsprache (z. B. Usbekisch, Portugiesisch), so dass im Falle<br />

entsprechende Übersetzungen erforderlich wären.<br />

3 Solche Studien können auf lexikographischer und<br />

sprachübergreifender Basis Ergebnisse aus korpusbasierten<br />

Arbeiten zum lexikalischen Entrenchment von Entlehnungen<br />

komplementieren, vgl. (Chesley & Baayen, 2010).<br />

4 Sprachkontakttypen wären etwa (i) langandauernder Kontakt<br />

� zur Lebensdauer von Lehnwörtern (insoweit die<br />

integrierten Wörterbücher auch das Verschwinden<br />

oder die Obsoletheit von Entlehnungen verzeichnen),<br />

abhängig von onomasiologischen, grammatischen<br />

und anderen Faktoren, vgl. etwa (Schenke, 2009;<br />

Hentschel, 2009),<br />

� zu Lehnwortketten (z. B. Deutsch > Polnisch ><br />

Weißrussisch > Russisch > Usbekisch) im<br />

Zusammenhang mit onomasiologischen und<br />

�<br />

quantitativen Faktoren,<br />

zu „Germanoversalien“, d. h. etwa dazu, ob<br />

bestimmte phonologische, morphologische oder<br />

semantische Eigenschaften deutscher Lexeme<br />

besonders entlehnungsfördernd sind.<br />

3. Grundsätzliche Überlegungen<br />

zur lexikographischen Datenstruktur<br />

des Portals<br />

Hinsichtlich der Datenorganisation des<br />

Lehnwörterbuchportals lassen sich auf einer<br />

konzeptionellen Ebene grob drei Bereiche unterscheiden:<br />

(1) Lexikographische Grundlage des Portals sind<br />

einzelne Lehnwörterbücher traditionellen Zuschnitts, die<br />

nach den fremdsprachigen Lehnwörtern einer<br />

bestimmten Nehmersprache lemmatisiert sind. (2) Um<br />

sprach- und wörterbuchübergreifende Suchen im Portal<br />

zu ermöglichen, muss über diese Datengrundlage eine<br />

möglichst dünne Zugriffsstruktur gelegt werden, die von<br />

den Idiosynkrasien der Einzelwörterbücher abstrahiert.<br />

(3) Für die Etyma der Gebersprachemuss eine<br />

‚Metalemmaliste‘ erstellt werden, deren Einträge jeweils<br />

über die unter Punkt 2 genannte Abstraktionsschicht<br />

untereinander und mit zugehörigen Artikeln der<br />

Einzelwörterbücher vernetzt sind.<br />

Die folgenden Unterabschnitte stellen die in den drei<br />

genannten Bereichen auftretenden lexikographischen<br />

und technischen Anforderungen und Probleme<br />

ausführlicher dar, bevor im letzten Abschnitt die<br />

technische Umsetzung ihres Zusammenspiels erörtert<br />

wird.<br />

an Bevölkerungsgrenzen (Deutsch – Slowenisch, Deutsch –<br />

Polnisch), (ii) Kontakt durch Emigration mit<br />

Sprachinselbildung (Deutsch – Rumänisch, Deutsch – Russisch,<br />

Deutsch – Amerikanisches Englisch) und Kontakt durch<br />

Elitenaustausch (Deutsch – Japanisch, Deutsch – Russisch,<br />

Deutsch – Britisches Englisch, Deutsch – Tok Pisin).


3.1. Die Ebene der Einzelwörterbücher<br />

Multilingual Resources and Multilingual Applications - Regular Papers<br />

Die zugrundeliegenden Lehnwörterbücher werden im<br />

Regelfall bereits existierende Werke sein, die nicht von<br />

vornherein <strong>für</strong> ein Lehnwörterbuchportal des hier<br />

diskutierten Typs entwickelt worden sind. Technische<br />

Minimalanforderung <strong>für</strong> die Verwendung im Portal ist,<br />

dass die Wörterbücher in geeigneter Form digitalisiert<br />

bzw. retrodigitalisiert als XML-Dokumente vorliegen. 5<br />

Auch eine Bilddigitalisierung ist denkbar, sofern zu<br />

jedem Artikel zusätzlich ein XML-Dokument mit den<br />

portalrelevanten lexikographischen Daten (und<br />

gegebenenfalls Verweisen auf Bildkoordinaten im<br />

Digitalisat) vorliegt. Angesichts der enormen Vielfalt<br />

möglicher Makro- und Mikrostrukturen in<br />

Wörterbüchern ist es nicht praktikabel, <strong>für</strong> das Portal ein<br />

festes XML-Schema vorzugeben, in das sich die<br />

XML-Repräsentationen aller Wörterbücher überführen<br />

lassen müssen. Es wird jedoch, um weitgehend<br />

automatisierte Verarbeitung zu ermöglichen, vom<br />

XML-Schema <strong>für</strong> die Einzelartikel eines jeden<br />

Wörterbuchs jeweils verlangt, dass es möglichst<br />

weitgehend von Layout- und Präsentationsaspekten<br />

abstrahiert, etwa im Sinne der<br />

TEI.dictionaries-Richtlinien; vgl. (Burnard & Bauman,<br />

2010). Es gibt gute Gründe, die XML-Digitalisate der<br />

Ausgangswörterbücher selber nicht mit portalrelevanten<br />

Informationen anzureichern. Abgesehen von<br />

urheberrechtlichen Erwägungen und dem angestrebten<br />

Erhalt der Wörterbücher als digitalen<br />

Einzelpublikationen ist es so möglich, dass an den<br />

Einzelwörterbüchern völlig unabhängig von ihrer<br />

Nutzung im Lehnwörterbuchportal weiterhin<br />

Veränderungen und Erweiterungen von den Autoren des<br />

betreffenden Werks vorgenommen werden.<br />

Ähnlich wie bei anderen Portalen können ganze<br />

Wörterbuchartikel oder Teile davon (XML-Dokumente<br />

bzw. XML-Fragmente) beispielsweise durch<br />

5 Aus expositorischen Gründen wird hier auf der Ebene der<br />

Einzelwörterbücher durchgehend von einer XML-basierten<br />

Datenhaltung ausgegangen, so wie sie im Projekt selber<br />

tatsächlich verwendet wird. Technisch lassen sich die<br />

Mikrostrukturen von Wörterbüchern natürlich auch in<br />

relationalen Datenbankschemata abbilden, was aus<br />

Performanzgründen ratsam sein kann. Andererseits können<br />

einige moderne Datenbankmanagementsysteme (z. B. Oracle)<br />

XML-Daten mit fester Struktur intern ohnehin relational<br />

repräsentieren. Vgl. z. B. (Müller-Spitzer & Schneider, 2009)<br />

<strong>für</strong> das OWID-Portal als ein konkretes Beispiel zur<br />

texttechnologischen Umsetzung von XML-Verarbeitung in<br />

einem Wörterbuchportal.<br />

XSL-Transformationen in eine geeignete<br />

HTML-Präsentation überführt werden. Dies ist die<br />

Grundlage <strong>für</strong> eine wörterbuchspezifische<br />

Online-Ansicht der Einzelwörterbuchartikel, vgl.<br />

(Engelberg & Müller-Spitzer, <strong>2011</strong>) <strong>für</strong> eine<br />

ausführlichere Darstellung. 6<br />

Die XML-Repräsentation<br />

ermöglicht außerdem im Prinzip beliebig komplexe<br />

Suchvorgänge auf den Einzelwörterbüchern, da konkrete<br />

Informationen über Abfragesprachen wie XPath und<br />

XQuery aus den Artikeln ausgelesen werden können.<br />

Allerdings sind solche XML-basierten Abfragen häufig<br />

datenbankseitig mit hohen Verarbeitungskosten versehen<br />

und daher <strong>für</strong> performante Webanwendungen kaum<br />

praktikabel. Dies ist ein wesentlicher Grund, die <strong>für</strong><br />

wörterbuchspezifische sowie portalweite<br />

(wörterbuchübergreifende) Suchen relevanten<br />

Informationen zusätzlich in separaten relationalen<br />

Datenbanktabellen vorzuhalten. Diese zusätzlichen<br />

Tabellen ermöglichen nicht nur ungleich performantere<br />

Datenbankanfragen, sie dienen auch, wie im folgenden<br />

ausgeführt wird, dazu, von den Spezifika der<br />

Einzelwörterbücher zu abstrahieren.<br />

3.2. Wörterbuchübergreifende<br />

Abstraktionsschicht<br />

Im Normalfall werden die einzelnen Lehnwörterbücher<br />

hinsichtlich ihrer Artikel- und Lemmatisierungsstruktur<br />

sowie der <strong>für</strong> Periodisierung und Lokalisierung der<br />

Entlehnung verwendeten Begriffe und Angabeformate<br />

nicht vollständig kompatibel sein. Auch hinsichtlich der<br />

zugrunde gelegten grammatischen Beschreibungssprache<br />

kann es Differenzen geben. Der hier vorzustellende<br />

Ansatz zur Lösung dieser Probleme stellt insbesondere<br />

<strong>für</strong> wörterbuchübergreifende Suchen eine eigene,<br />

relational aufbereitete Datenschicht bereit, die <strong>für</strong> das<br />

Portal relevante Informationen zu allen vorliegenden<br />

lexikalischen Einheiten aus den verschiedenen<br />

Wörterbüchern in portaleinheitlicher Weise erfasst. In<br />

einer wörterbuchübergreifenden Datenbanktabelle<br />

werden daher alle Lemmata, alle in den betreffenden<br />

Artikeln genannten (diasystematischen, ggf. auch<br />

orthographischen) Ausdrucksvariantender Lemmata<br />

sowie sämtliche in Einzelartikeln aufgeführten Derivate<br />

6 In der skizzierten Weise wird auch bei dem am Institut <strong>für</strong><br />

deutsche Sprache entwickelten OWID-Wörterbuchportal<br />

verfahren (http://www.owid.de/index.html).<br />

171


Multilingual Resources and Multilingual Applications - Regular Papers<br />

und Komposita der Lemmata als je eigene Entitäten – im<br />

Folgenden als ‚Portalinstanzen‘ bezeichnet – behandelt,<br />

also in jeweils einer separaten Tabellenzeile aufgeführt.<br />

Eine Tabellenzeile spezifiziert außer dem Wörterbuch,<br />

aus dem die Instanz (also das gegebene Lemma bzw. die<br />

gegebene Ausdrucksvariante, das Derivat oder<br />

Kompositum) stammt, u.a. folgende weiteren<br />

Informationen (Attribute), sofern das Wörterbuch diese<br />

zur Verfügung stellt: (a) eine räumliche, zeitliche und<br />

diasystematische Einordnung des Entlehnungsvorganges;<br />

(b) grammatische Informationen, insbesondere Wortart;<br />

(c) ggf. eine semantische/onomasiologische<br />

Kategorisierung. Außerdem muss jeweils angegeben<br />

werden, ob es sich bei der betreffenden Instanz um die<br />

Lemmavariante des zugehörigen Wörterbuchartikels<br />

handelt, so dass sich aus der Tabelle der Instanzen die<br />

Lemmalisten der Einzelwörterbücher ableiten lassen.<br />

Falls ein verwendetes Lehnwörterbuch innerhalb eines<br />

Artikels z. B. Lesarten unterscheidet, <strong>für</strong> die<br />

unterschiedliche Etymologien diskutiert werden, sind<br />

diese in je separaten Portalinstanzen zu kodieren, da von<br />

makrostrukurellen Eigenheiten der Einzelwörterbücher<br />

abstrahiert werden muss.<br />

Bei hinreichend komplexer und rigider XML-Kodierung<br />

eines Lehnwörterbuchs können die Portalinstanzen<br />

weitestgehend automatisiert aus den Originalartikeln<br />

extrahiert werden. Die Portalinstanzen sollten keine<br />

Informationen aus den Lehnwörterbüchern duplizieren;<br />

daher enthalten sie außerdem Verweise auf den<br />

zugehörigen Artikel und gegebenenfalls auf das dem<br />

relevanten Artikelausschnitt entsprechende<br />

XML-Element, so dass sämtliche weiteren <strong>für</strong> die Instanz<br />

relevanten Informationen mechanisch aus dem<br />

Ursprungsartikel gewonnen und z.B. <strong>für</strong> eine<br />

HTML-basierte Darstellung aufbereitet werden können.<br />

Damit portalweite, wörterbuchübergreifende<br />

Suchvorgänge möglich sind, müssen zur Erstellung der<br />

Portalinstanzen die Angaben der Ausgangswörterbücher<br />

zur zeitlichen und räumlichen Einordnung des<br />

Entlehnungsvorgangs sowie grammatische<br />

Informationen in ein einheitliches konzeptuelles Schema<br />

überführt werden. Neben komplexen Technologien wie<br />

Raum- und Zeitontologien stehen <strong>für</strong> das Pilotprojekt<br />

einfachere Lösungen wie die wörterbuchspezifisch<br />

definierte Abbildung von Sprachstufenangaben auf<br />

standardisierte Jahresintervalle zur Verfügung. Auch der<br />

172<br />

Einsatz von Georeferenzierungsverfahren ist in einer<br />

späteren Ausbaustufe des Projektes denkbar, um<br />

kartographische Visualisierungen zu ermöglichen.<br />

Wichtig ist, dass Portalinstanzen mit Informationen<br />

angereichert werden können, die keinerlei Entsprechung<br />

im zugrundeliegenden Lehnwörterbuch haben. So kann<br />

jede Instanz einem Synset einer WordNet-artigen<br />

Ressource zugeordnet oder anderweitig semantisch<br />

klassifiziert werden, um Abfragen mit<br />

onomasiologischer Komponente zu ermöglichen.<br />

Schwierig ist dies sicherlich besonders in<br />

Wortschatzbereichen, aus denen intensiv und bis hin in<br />

fachsprachliche Details entlehnt wurde (z. B. Bergbau,<br />

Chemie, Religion).<br />

Auch die Einführung von zusätzlichen Portalinstanzen<br />

kann sinnvoll sein; ist etwa ein deutsches Wort über das<br />

Polnische in das Russische gelangt, kann der womöglich<br />

im polnischen Lehnwörterbuch des Portals gar nicht<br />

verzeichnete polnische ‚Zwischenschritt‘ als eigene<br />

Portalinstanz hinzugefügt werden.<br />

3.3. Metalemmaliste und etymologische<br />

Information<br />

Die lexikographisch und linguistisch anspruchsvollste<br />

und zum Großteil manuell zu erstellende Datenschicht ist<br />

die Erarbeitung einer Metalemmaliste der Etyma der<br />

Gebersprache. Da Lehnwörterbücher häufig mehrere<br />

diasystematische bzw. Wortbildungsvarianten der Etyma<br />

angeben (darunter auch bloß rekonstruierte Formen) und<br />

verschiedene mögliche Etymologisierungen diskutieren,<br />

muss – auch angesichts der Probleme mit verschiedenen<br />

Transkriptionen – ein möglichst allgemeiner Ansatz<br />

gewählt werden. In der von uns gewählten Lösung<br />

werden <strong>für</strong> die in den Einzelwörterbüchern genannten<br />

Etymonformen – als tertia comparationis des<br />

umgekehrten Lehnwörterbuchs – jeweils ebenfalls<br />

Portalinstanzen angelegt, die in der Datenbanktabelle mit<br />

einem speziellen Attribut als (deutsche) Etymonformen<br />

gekennzeichnet werden. Im folgenden bezeichnen wir<br />

solche Portalinstanzen kurz als ‚Etymoninstanzen‘.<br />

Taucht ein deutsches Lexem in mehreren Wörterbüchern<br />

als Herkunftswort auf, wird <strong>für</strong> jedes Wörterbuch eine<br />

eigene Etymoninstanz angelegt, da die Identifikation<br />

dieser Instanzen ja erst in einem nachgelagerten<br />

lexikographischen Arbeitsschritt auf der Portalebene<br />

geschieht. Entscheidend ist daher die Identifizierung von


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Gruppen „zusammengehöriger“ Etymoninstanzen. In der<br />

von uns vorgeschlagenen Datenmodellierung wird <strong>für</strong><br />

jede solche Gruppe eine wörterbuchunabhängige<br />

Etymon-Instanz erstellt, die unter verschiedenen<br />

lexikographischen Gesichtspunkten ein besonders<br />

geeigneter Kandidat <strong>für</strong> ein Metalemma ist, also<br />

prototypischerweise ein heute noch gebräuchliches,<br />

standardsprachliches deutsches Simplex. Dieses<br />

‚Meta-Etymon‘ kann sinnvoll insbesondere in einer<br />

Stichwortliste aller deutschen Etyma des Portals<br />

verwendet werden. Alle synchronen oder diachronen<br />

Varianten, Wortbildungsprodukte/<br />

-bestandteile usw. eines Etymons werden dann auf die im<br />

folgenden Abschnitt geschilderte Weise mit ihren<br />

zugehörigen Meta-Etyma vernetzt. Es kann<br />

wünschenswert sein, zusätzliche Meta-Etyma<br />

aufzunehmen, etwa, damit der Benutzer zu einem<br />

deutschen Simplex auch dann Entlehnungen von daraus<br />

gebildeten Komposita findet, wenn dieses Simplex selber<br />

in keinem Wörterbuch als Herkunftswort geführt wird.<br />

4. Zur Architektur der Webanwendung<br />

Die Einführung einer Tabelle von Portalinstanzen<br />

ermöglicht die saubere Entkopplung der Portalerstellung<br />

von der Ebene der Einzelwörterbücher. Typische<br />

portalbezogene Suchvorgänge operieren i.a.<br />

ausschließlich auf dieser Abstraktionsschicht.<br />

4.1. Kodierung und Verwaltung der<br />

Vernetzungen zwischen Portalinstanzen<br />

Portalinstanzen müssen untereinander vernetzt werden,<br />

etwa zur Modellierung von Wortbildungsrelationen. Eine<br />

besondere Rolle spielen etymologische Angaben, die als<br />

Vernetzungen von Portalinstanzen auf Etymoninstanzen<br />

kodiert werden können. Der häufigste Fall ist die<br />

Vernetzung von Portalinstanzen, die demselben<br />

Quellwörterbuch zugeordnet sind. Um Verkettungen von<br />

Entlehnungsvorgängen zu modellieren oder<br />

‚Identitätsbeziehungen‘ zwischen Etymoninstanzen<br />

sowie zwischen Lemmata in sehr eng verwandten<br />

Sprachformen zu formulieren, müssen aber auch<br />

Vernetzungen zwischen aus verschiedenen Quellen<br />

stammenden Portalinstanzen angesetzt werden.<br />

Zur Modellierung der Vernetzungen zwischen Artikeln<br />

und Instanzen könnten im Prinzip standardisierte<br />

Repräsentationsformate wie RDF und die da<strong>für</strong><br />

entwickelten Speicher- und Zugriffstechnologien<br />

verwendet werden, vgl. (Hitzler, Krötzsch & Rudolph,<br />

2009).Da aber die Vernetzungsstruktur des Portals sehr<br />

regelmäßig ist, ziehen wir eine einfachere Lösung vor,<br />

die alle Vernetzungen in einer separaten relationalen<br />

Datenbanktabelle als geordnete Paare aus einer Quellund<br />

einer Zielinstanz repräsentiert. Jede Vernetzung von<br />

Portalinstanz P auf Portalinstanz Q wird per Attribut<br />

einem bestimmten Typ zugeordnet; unter anderem sind<br />

folgende Typen vorgesehen: (i) P ist Variante von Q<br />

(dabei können Varianten verschiedenen Typs<br />

unterschieden werden, z.B.<br />

orthographisch/synchron/diachron); (ii) P ist Derivat von<br />

Q; (iii) P ist Kompositum zu Q; (iv) P hat Q als Etymon;<br />

(v) P ist dasselbe Lexem / dieselbe Lexemvariante wie Q<br />

(wenn in einer Entlehnungskette das Lehnwort P selber<br />

wieder als Grundlage eines Entlehnungsprozesses<br />

gedient hat, wird <strong>für</strong> dieses Lehnwort eine zweite<br />

Portalinstanz Q angesetzt, die das Wort in seiner Rolle als<br />

Ausgangswort <strong>für</strong> die weitere Entlehnung repräsentiert);<br />

(vi) P gehört im jeweiligen Einzelwörterbuch zum<br />

Lemma bzw. Meta-Etymon Q.<br />

Weitere Attribute von Vernetzungen sind die Quelle der<br />

Vernetzungsinformation sowie eine einfache,<br />

ordinalskalierte Kategorisierung der in der Quelle selber<br />

angegebenen Verlässlichkeit dieser Information.<br />

Die Vernetzungen bilden einen gerichteten azyklischen<br />

Graphen (DAG). Bei typischen Suchvorgängen müssen<br />

im DAG Pfade von ggf. vorab nicht bekannter Länge<br />

ermittelt werden – etwa, um Entlehnungsketten zu finden<br />

oder ausgehend von einem Meta-Etymon E nach<br />

Derivaten/Varianten/…von Entlehnungen beliebiger<br />

Derivaten/Varianten/… von E zu suchen. Um<br />

performante SQL-Abfragen auf den Tabellen<br />

durchführen zu können, wird in der Vernetzungstabelle<br />

der transitive Abschluss der Vernetzungsrelationen oder<br />

eine geeignete Teilmenge davon abgebildet, d.h. es<br />

werden – zumindest auf der Ebene der Meta-Etyma und<br />

Einzelwörterbuch-Lemmata – auch<br />

‚indirekte‘ Vernetzungen gespeichert und als solche<br />

etikettiert. Die Verwaltung der Verweisstrukturen<br />

zwischen den Datenschichten muss softwaregestützt<br />

erfolgen. 7<br />

7 Änderungen an den Einzelwörterbüchern ziehen<br />

entsprechende Änderungen in den relationalen Instanzen- und<br />

Vernetzungstabellen nach sich, die in den meisten Fällen<br />

173


Multilingual Resources and Multilingual Applications - Regular Papers<br />

4.2. Präsentation<br />

Der Benutzer kann die Einzelwörterbücher mit jeweils<br />

eigener (neben der Suchformular-/Artikelansicht<br />

ausschnittsweise angezeigten) Lemmaliste und<br />

Suchfunktionalität nutzen. Die Etymoninstanzen bilden<br />

die Grundlage <strong>für</strong> ein separates umgekehrtes<br />

Lehnwörterbuch, also das Portalwörterbuch der<br />

deutschen Herkunftswörter, dessen Lemmaliste aus den<br />

Meta-Etyma erstellt wird. Suchvorgänge in diesem<br />

Portalwörterbuch erzeugen eine Liste von Verweisen auf<br />

passende Artikel in den Einzelwörterbüchern.<br />

174<br />

5. Literatur<br />

Burnard, L., Bauman, S. (2010): TEI P5: Guidelines for<br />

Electronic Text Encoding and Interchange.<br />

Encoding Initiative. Online:<br />

http://www.tei-c.org/release/doc/tei-p5-doc/en/<br />

Guidelines.pdf.<br />

Text<br />

Chesley, P., Baayen, R.H. (2010): Predicting new words<br />

from newer words: Lexical borrowings in French.<br />

Linguistics 48 (4), pp. 1343-1374.<br />

Engelberg, S. (2010): An inverted loanword dictionary of<br />

German loanwords in the languages of the South<br />

Pacific. In A. Dykstra & T. Schoonheim<br />

(Eds.),Proceedings of the XIV EURALEX<br />

International Congress (Leeuwarden, 6-10 July 2010).<br />

Ljouwert<br />

pp. 639-647.<br />

(Leeuwarden): Fryske Akademy,<br />

Engelberg, S., Müller-Spitzer, S. (erscheint <strong>2011</strong>):<br />

Dictionary portals. In R. Gouws, U. Heid, W.<br />

Schweickard, & H.E. Wiegand (Eds.),Wörterbücher /<br />

Dictionaries / Dictionnaires. Ein internationales<br />

Handbuch zur Lexikographie / An International<br />

Encyclopedia of Lexicography / Encyclopédie<br />

internationale de lexicographie. Bd. 4. Berlin, New<br />

York: de Gruyter.<br />

Görlach, M. (Ed.) (2001): A Dictionary of European<br />

Anglicisms: a Usage Dictionary of Anglicisms in<br />

Sixteen European Languages. Oxford etc.: Oxford<br />

University Press.<br />

Hentschel, G. (2009): Intensität und Extensität<br />

deutsch-polnischer Sprachkontakte von den<br />

automatisch durch Datenbanktrigger durchgeführt werden<br />

können. Durch solche Trigger können auch<br />

Konsistenzprüfungen durchgeführt werden, die manuellen<br />

Anpassungsbedarf feststellen und melden.<br />

mittelalterlichen Anfängen bis ins 20. Jahrhundert am<br />

Beispiel deutscher Lehnwörter im Polnischen. In Stolz,<br />

Ch. (Ed.): Unsere sprachlichen Nachbarn in Europa.<br />

Die Kontaktbeziehungen zwischen Deutsch und seinen<br />

Grenznachbarn. Bochum: Brockmeyer, pp. 155-171.<br />

Hitzler, P., Krötzsch, M., Rudolph, S. (2009):<br />

Foundations of Semantic Web Technologies. Boca<br />

Raton, FL etc.: Chapman & Hall/CRC Textbooks in<br />

Computing.<br />

Menzel, T., Hentschel, G., unter Mitarbeit von P. Jančák<br />

und J. Balhar (2005): Wörterbuch der deutschen<br />

Lehnwörter im Teschener Dialekt des Polnischen.<br />

Studia slavica Oldenburgensia, Band 10 (2003).<br />

Oldenburg: BIS-Verlag. 2., ergänzte und korrigierte<br />

elektronische Ausgabe. Online:<br />

http://www.bkge.de/14451.html.<br />

Müller-Spitzer, C., Schneider, R. (2009): Ein<br />

XML-basiertes Datenbanksystem <strong>für</strong> digitale<br />

Wörterbücher. Ein Werkstattbericht aus dem Institut<br />

<strong>für</strong> Deutsche Sprache. it - Information Technology<br />

4/2009, pp. 197-206.<br />

Schenke, M. (2009): Sprachliche Innovation – lokale<br />

Ursachen und globale Wirkungen. Das ‚Dynamische<br />

Sprachnetz‘. Saarbrücken: Südwestdeutscher Verlag<br />

<strong>für</strong> Hochschulschriften.<br />

Striedter-Temps, H. (1<strong>96</strong>3): Deutsche Lehnwörter im<br />

Slovenischen. Wiesbaden: Harrassowitz.<br />

Vincenz, A. de, Hentschel, G. (2010): Wörterbuch der<br />

deutschen Lehnwörter in der polnischen Schrift- und<br />

Standardsprache. Von den Anfängen des polnischen<br />

Schrifttums bis in die Mitte des 20. Jahrhunderts.<br />

Studia slavica Oldenburgensia, Band 20. Oldenburg:<br />

BIS-Verlag. Online:<br />

http://www.bis.uni-oldenburg.de/bis-verlag/wdlp.<br />

Wiegand, H.E. (2001): Sprachkontaktwörterbücher:<br />

Typen, Funktionen, Strukturen. In: B. Igla, P. Petkov &<br />

H.E. Wiegand (Eds.). Theoretische und praktische<br />

Probleme der Lexikographie. 1. Internationales<br />

Kolloquium zur Wörterbuchforschung am Institut<br />

Germanicum der St. Kliment-Ohridski-<strong>Universität</strong><br />

Sofia, 7. bis 8. Juli 2000 (= Germanistische Linguistik,<br />

161-162). Hildesheim, Zürich, New York: Georg Olms<br />

Verlag, pp. 115-224.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Localizing A Core HPSG-based Grammar for Bulgarian<br />

Petya Osenova<br />

The Sofia University and IICT-BAS<br />

25 A, Acad. G. Bonchev Str., Sofia 1113<br />

E-mail: petya@bultreebank.org<br />

Abstract<br />

The paper presents the main directions, in which the localization of an HPSG-based Formal Core Grammar (called Grammar Matrix)<br />

has been performed for Bulgarian. On the one hand, the adoption process took into account the predefined theoretical schemas and<br />

their adequacy with respect to the Bulgarian language model. On the other hand, the implementation within a typological framework<br />

posited some challenges with respect to the language specific features. The grammar is being further developed, and it is envisaged to<br />

be extensively used for parsing and generation of Bulgarian texts.<br />

Keywords: localization, core grammar, HPSG, Bulgarian<br />

1. Introduction<br />

Recently, a number of successful attempts have been<br />

made towards the design and application of<br />

wide-coverage grammars, which have incorporated deep<br />

linguistic knowledge and have been tested on several<br />

natural languages. Especially active in this area have<br />

been the lexicalist frameworks, such as HPSG<br />

(Head-driven Phrase Structure Grammar), LFG<br />

(Lexical-Functional Grammar) and LTAG (Lexicalized<br />

Tree Adjoining Grammar). A lot of NLP applications<br />

have been performed within HPSG-based<br />

implementation – treebanks (the LinGO Redwoods<br />

Treebank, Polish HPSG Treebank, Bulgarian<br />

HPSG-based Treebank, among others), grammar<br />

developing tools, parsers, etc.<br />

In HPSG there already exist quite extensive implemented<br />

formal grammars – for English (Flickinger, 2000),<br />

German (Muller & Kasper, 2000), Japanese (Siegel, 2000;<br />

Siegel & Bender, 2002). They provide semantic analyses<br />

in the Minimal Recursion Semantics framework<br />

(Copestake et al., 2005). HPSG is the underlying theory<br />

of the international initiative LinGO Grammar Matrix<br />

(Bender et al., 2010; Bender et al., 2002). At the moment,<br />

precise and linguistically motivated grammars,<br />

customized on the base of the Grammar Matrix, have<br />

been or are being developed for Norwegian, French,<br />

Korean, Italian, Modern Greek, Spanish, Portuguese,<br />

etc. 1<br />

. The most recent developments in the Grammar<br />

Matrix framework report also on successful<br />

implementation of grammars for endangered languages,<br />

such as Wambaya (Bender, 2008).<br />

In addition to the HPSG framework and the Grammar<br />

Matrix architecture, there is also an open source software<br />

system, which support the grammar and lexicon<br />

development – LKB (Linguistic Knowledge Builder)<br />

( http://wiki.delph-in.net/moin/LkbTop) 2<br />

.<br />

Our motivation to start the development of a Bulgarian<br />

Resource Grammar in the above-mentioned setting was<br />

as follows: there already was an HPSG-based Treebank<br />

of Bulgarian (BulTreeBank), constructed in a<br />

semi-automatic way. The knowledge within the treebank<br />

seemed to be sufficient for the construction of a wide<br />

coverage and precise formal grammar, which to parse and<br />

generate Bulgarian texts. Bulgarian is considered neither<br />

an endangered language, nor a less-processed language<br />

any more. However, it still lacks a deep linguistic<br />

grammar. Bulgarian is viewed as a “classic and exotic”<br />

language, because it combines Slavic features with<br />

Balkan Sprachbund peculiarities. These factors make<br />

Bulgarian a real challenge for the computational<br />

modeling.<br />

1<br />

http://www.delph-in.net/index.php?page=3<br />

2<br />

The projects DELPH-IN and Deep Thought are also closely<br />

related to the Grammar Matrix initiative.<br />

175


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Our preliminary supporting components were the<br />

following ones: the HPSG theoretical framework for<br />

modeling the linguistic phenomena in Bulgarian; a<br />

suitable Bulgarian corpus, which is HPSG-based, and<br />

supporting pre-processing modules; the LinGO<br />

Matrix-based Grammars software environment for<br />

encoding and integrating the suitable components; the<br />

best practices from the work on other languages. More on<br />

the current grammar model and implementation of the<br />

Bulgarian Grammar can be read in (Osenova, 2010).<br />

176<br />

2. Grammar Matrix Architecture<br />

The Grammar Matrix (Bender et al., 2002) has been<br />

intended as a typological core for initiating the grammar<br />

writing on a specific language. It also provides a<br />

customization web interface (Bender et al., 2010). The<br />

purpose of such a core is, on the one hand, to ensure a<br />

common basis for comparing various language grammars,<br />

and thus – to focus on typological similarities and<br />

differences, and on the other hand, to speed up the<br />

process of the grammar development. Thus, it supplies<br />

the skeleton of the grammar – the type hierarchy with<br />

basic types and features as well as the basic inheritance<br />

directions. Grammar Matrix is based on the experience<br />

with several languages (predominantly English and<br />

Japanese), and it is being developed further when new<br />

languages are modeled in the framework.<br />

In spite of supporting all the linguistic levels of<br />

representation, the Grammar Matrix aims at semantic<br />

modeling of a language. It introduces referential entities<br />

and events; semantic relations; semantic encoding and<br />

contribution of the linguistic phenomena (definiteness/<br />

indefiniteness; aspect; tense, among others). For example,<br />

the verbs, the adjectives, the adverbs and the prepositions<br />

are canonically viewed as introducing events, while<br />

nouns are considered introducing referential entities.<br />

Such an approach is a challenge for a language like<br />

Bulgarian, which grammaticalizes a lot of linguistic<br />

phenomena. Thus, the most common level of description<br />

would be the morphosyntactic level rather than the<br />

semantic one. Consequently, the balance of represented<br />

information between semantics and morphosyntax<br />

should be detected and distributed in an adequate way.<br />

Ideally, one should only inherit from Grammar Matrix<br />

types, without changing them. In real life, however, it<br />

turns out that each language challenges, and is challenged<br />

by the Matrix model. On the one hand, Matrix predefines<br />

some phenomena too strictly, on the other – it gives<br />

possibilities for generalizations. All this is inevitable,<br />

since the ideal granularity between specificity and<br />

universality is difficult to be established.<br />

The localization goes into several directions. First, the<br />

Grammar Matrix is implemented in accordance with<br />

some version of the HPSG theory – thus it implies certain<br />

decisions with respect to the possible analyses. However,<br />

the grammar developer in adapting the Grammar Matrix<br />

to a new language might want to apply another analysis<br />

within the language specific grammar. This is the case for<br />

Portuguese, for example. Instead of working with<br />

head-specifier and head-adjunct phrases, which are part<br />

of the standard HPSG94, the grammar adopted the more<br />

recent head-functor approach to these phrases. Another<br />

direction would be the preference towards the linguistic<br />

phenomena. Thus, in Portuguese the preferences concern<br />

agreement, modification and basic phrase structures,<br />

while in Modern Greek the phenomena to start with were<br />

clitization, word order, politeness constructions. In this<br />

respect, only a common testset might ensure the<br />

implementation of common linguistic phenomena. Such<br />

a testset is briefly discussed in 3.1. Thus, depending on<br />

the preference, grammar developers might have to extend<br />

and/or change the core grammar. For example, the<br />

addition of types for contracted or missing determiners in<br />

Modern Greek, since this information influences the<br />

semantics.<br />

Last, but not least, it is up to the grammar developer how<br />

much information to encode within the grammar, and<br />

which steps to be manipulated outside the grammar. For<br />

example, the Portuguese grammar uses a<br />

morphologically preprocessed input, while in Modern<br />

Greek grammar all the analyses are handled within the<br />

system.<br />

3. Localization in Bulgarian<br />

3.1. The Multilingual Testset.<br />

The Grammar Matrix is equipped with a testset in<br />

English, which has been already translated into a number<br />

of other languages. It comprises around 100 sentences,<br />

which in the Bulgarian translated set became 178. The<br />

grammar development started with the aim this set to be<br />

covered, since it represented some very important


Multilingual Resources and Multilingual Applications - Regular Papers<br />

common phenomena. Needless to say, the translated set<br />

incorporated also a bunch of language specific<br />

phenomena, which will be discussed in more detail below.<br />

Thus, some additional test sentences have been<br />

incorporated into the common testset, which made the<br />

positive sentences 193. Also, 20 ungrammatical<br />

sentences have been included, which checked the<br />

agreement, word order of clitics, definiteness, subject<br />

control, etc. The whole set is 213 sentences, which is<br />

comparable to the testset for Portuguese in the first phase<br />

of the grammar development. The common phenomena<br />

are as follows: complementation, modification,<br />

coordination, agreement, control, quantification,<br />

negation, illocutionary force, passivization,<br />

nominalization, relative clauses, light verb constructions,<br />

etc. The types in the initial grammar are 297. It is<br />

expected that they will expand dramatically when the<br />

lexicon is enriched further. Let us comment on some<br />

localization specificities in the translated set, which made<br />

it larger in comparison to the English testset.<br />

First of all, Bulgarian is a pro-drop language. Thus, it has<br />

always counterparts with null subjects. In the discourse, it<br />

can also omit its complements in many cases. Second,<br />

Bulgarian verbs encode aspect lexically. The English<br />

sentences often have been translated with verbs in both<br />

aspects (perfective or imperfective). When combined<br />

with the tense, the translation counterparts became even<br />

more. For example, the sentence Abrams wondered<br />

which dog barked might have two possibilities for the<br />

matrix verb (imperfect tense, imperfective and aorist<br />

tense, perfective), while the verb in the subordinate<br />

clause might have normally three possibilities (present<br />

tense, imperfective; aorist tense, perfective and imperfect<br />

tense, imperfective).<br />

In some sentences more Bulgarian verb synonyms have<br />

been provided to the English one. For example, the verb<br />

to hand in the sentence Abrams handed the cigarette to<br />

Browne can be translated into at least four Bulgarian<br />

verbs – дам (give), подам (pass), връча (deliver),<br />

предам (hand in).<br />

Next, Bulgarian has clitic counterparts to the<br />

complements as well as a clitic reduplication mechanism.<br />

Thus, translations with a clitic and a full-fledged<br />

complement have been provided to the single English one,<br />

when appropriate. Bulgarian polar questions are formed<br />

with a special question particle, which has also a<br />

focalizing role. The modification is mostly done by the<br />

adjectives – garden dog (en) vs. градинско куче (bg,<br />

‘garden-adjective dog’). Some alternations that are<br />

challenging for English are not relevant for Bulgarian.<br />

For example: Browne squeezed the cat in and Browne<br />

squeezed in the cat are translated in the same way: Браун<br />

вмъкна котката (Brown put-inside cat-the). The same<br />

holds for the well-known give-alternation: Abrams<br />

handed Browne the cigarette and Abrams handed the<br />

cigarette to Browne. The Bulgarian translation just<br />

‘swaps’ the complements, but does not change them:<br />

Абрамс даде на Браун цигарата (Abrams gave to<br />

Brown cigarette-the) and Абрамс даде цигарата на<br />

Браун (Abrams gave cigarette-the to Brown). At the<br />

same time, the Bulgarian version of the testset provided<br />

examples for aspect/tense combinations, clitic behavior,<br />

verbal complex, agreement patterns, etc.<br />

3.2. The Language Specific Phenomena<br />

Concerning Bulgarian, its rich morphology seems to<br />

conflict with the requirements behind the semantic<br />

approach. Thus, the information has to be often split<br />

between the semantic phenomenon and its realization.<br />

For example, the adjectives, participles, numerals happen<br />

to have morphologically definite forms, while the<br />

definiteness marker is not a semantic property of these<br />

categories. For that reason, the most important thing in<br />

the grammar was to keep Syntactic and Semantic features<br />

separate (for example, agreement, which is separated into<br />

semantic and syntactic ones in accordance with the ideas<br />

in Kathol 1997). In this way, the definiteness operates via<br />

the MOD(ifier) feature. The event selects for a<br />

semantically definite:<br />

[SYNSEM.LOCAL.HOOK.INDEX.DEF+],<br />

but morphologically indefinite noun:<br />

[SYNSEM.LOCAL.AGR.DEF-]<br />

As it can be seen, the semantic feature ‘definiteness’ lies<br />

in the syntactic-semantic area of local features, and more<br />

precisely within the feature INDEX. The<br />

morphosyntactic one follows the same path of locality,<br />

but it is within the feature AGR(eement). For example, in<br />

the phrase старото куче ‘old-the dog’, the adjective<br />

‘old-the’ selects for the semantically definite, but<br />

morphologically indefinite noun ‘dog’. The analysis is<br />

linguistically sound, since the definiteness marker is<br />

considered a phrasal affix in Bulgarian, not a word one.<br />

177


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Other examples are the categories of tense, aspect, mood.<br />

Tense and Mood are currently encoded as a feature of<br />

AGR.E 3 .TENSE or AGR.E.MOOD, while aspect is a<br />

feature of the head HEAD.TAM 4<br />

.ASPECT. However,<br />

in these cases at the moment there is no different<br />

contribution from semantics and morphosyntax. Thus,<br />

Grammar Matrix provides several possibilities to get the<br />

semantic information. For tense and mood the aggregated<br />

one has been chosen (AGR.E) in the current version,<br />

while for aspect – the separated encodings. The<br />

aggregated way is a better choice for unified<br />

syntactic-semantic analysis, while the separated<br />

representation leaves out an opportunity for different<br />

manipulation of syntactic and semantic contribution.<br />

Thus, Bulgarian seems to require a systematic balance<br />

between the semantic contribution and the morphological<br />

marking of the same category within the overall<br />

architecture. This fact posited some difficulties in the<br />

starting design, since the categories had to be considered<br />

whether to be approached separately on both - semantic<br />

and morphological grounds, or not.<br />

Bulgarian has a double negation mechanism (the<br />

so-called negative concord) similarly to other Slavic<br />

languages and in contrast to English. Within the proposed<br />

Grammar Matrix architecture, the negation particle has<br />

been modeled as a verb, since particles had not been<br />

presented in the Grammar Matrix, and there was no<br />

mechanism of introducing semantic relations. It scopes<br />

over the following proposition, and introduces a negation<br />

relation. At the same time the negative pronoun in the<br />

concord introduces a negative relation.<br />

Another area, in which the rich morphology plays role, is<br />

the level of type’s generalization. Very often, in<br />

Bulgarian the generalization cannot be kept at higher<br />

levels, because of the variety in the morphosyntactic<br />

behaviour types within the Bulgarian constructions. Such<br />

examples are the copula constructions. Although<br />

adjectives, adverbs and prepositions have an event index,<br />

they cannot share the same generalized type. Adjectives<br />

structure-share their PNG (person, number and gender)<br />

characteristics with the copula’s XARG – the subject.<br />

The adverbs have to be restricted to intersective<br />

3<br />

E stands for Event.<br />

4<br />

TAM stands for an aggregate feature Tense, Aspect,<br />

Mood.<br />

178<br />

modifiers when taken as complements. The common<br />

behaviour is that all these heads raise their semantic<br />

index to the copula, which is semantically vacuous itself.<br />

The nouns, however, have a referential index. In this<br />

case, the copula behaves like a transitive verb, which<br />

selects for its complement. No index is raised from the<br />

noun complement up to the copula. In this grammar<br />

version, 8 lexical types are introduced: two for present<br />

and past copula forms. Each of the two then is divided<br />

into four subtypes depending on the complement (present<br />

copula –noun; present copula – adjective; present copula<br />

– adverb; present copula - PP; past copula – noun; past<br />

copula – adjective; past copula – adverb; past copula -<br />

PP). The present-past distinction was necessary, because<br />

the past form can be in a sentence initial position, while<br />

the present form cannot.<br />

Localization took into account the relatively free word<br />

order of Bulgarian. Thus, most of the rules include all the<br />

possible orders in spite of the canonical readings. For<br />

example, there are rules for head-modifier and<br />

modifier-head; clitic-head and head-clitic; also for the<br />

head’s complement swap. The order combinations result<br />

into a proliferation of possible analyses, for whose<br />

discrimination an additional mechanism is needed. For<br />

the moment, the BulTreeBank resource is used as a<br />

discriminative tool, because it comprises the canonical<br />

and most preferred analysis per sentence.<br />

Combining the application of the clitic rules which<br />

produce lexical signs, and the complement rules, which<br />

produce phrases, the clitic doubling examples have been<br />

successfully parsed. The incorporation of Bulgarian<br />

argument-related clitics required a new mechanism. The<br />

clitics are viewed as lexical projections of the head (i.e.<br />

operated by special rules), while the regular forms are<br />

treated as head arguments (complements) (i.e. operated<br />

by head-complement principles). The clitic does not<br />

contribute its separate semantics, because it is not a<br />

full-fledged complement. Instead, the verb incorporates<br />

clitic’s contribution in its own semantics. Thus, the<br />

personal pronoun clitic lexemes have an empty relation<br />

list, while the regular pronoun forms have a pronoun<br />

relation.<br />

Another localization, which reflects the modeling of the<br />

lexicon rather than the type hierarchy, is the<br />

representation of the lexical entries. Bulgarian is a<br />

rich-inflected language, but in contrast to other Slavic


Multilingual Resources and Multilingual Applications - Regular Papers<br />

languages, its richness lies in the verbal system, rather<br />

than in the nominal one. Thus, two ways of morphology<br />

incorporation were possible. The first is to re-design the<br />

whole systematic and unsystematic morphology within<br />

the grammar, which would be a linguistically sound, but<br />

time-consuming step. Since Bulgarian verbs show a lot of<br />

alternations and irregularities across their grammatical<br />

categories (conjugation, tense, aspect, finite vs. infinite<br />

forms, other synthetic grammatical categories, such as<br />

imperative, etc.), the full paradigms per conjugation in<br />

the lexical types were abandoned as a generalization<br />

opportunity. Instead, the inflection classes of the<br />

morphological dictionary for Bulgarian (Popov et al.,<br />

2003), have been transferred into the grammar. Each verb<br />

type was viewed as a combination of the appropriate<br />

subparadigms from the given morphological and/or<br />

lexical categories. The set of the respective subparadigms<br />

per category was attached to each verb in the lexicon.<br />

Thus, the lexicon was also “dressed” with the<br />

morphologically specific information for the distinct<br />

verbs. The transfer of the morphosyntactic paradigms<br />

resulted into over 2600 rules for personal verbs only.<br />

Hence, the morphological work has been suppressed in<br />

the name of syntactic and semantic modeling. Also, in a<br />

lexicalist framework, such as HPSG, a large lexicon<br />

could not operate without the complete set of the<br />

morphosyntactic types and rules. Compare the<br />

morphologically poor and morphologically rich<br />

presentation of the verbs in the lexicon in a. and b.:<br />

a.<br />

ima_v1 := v_there-is_le &<br />

[ STEM < "има" >,<br />

SYNSEM.LKEYS.KEYREL.PRED "има_v_1_rel" ].<br />

b.<br />

ima_v1 := v_there-is_le &<br />

[ STEM < "има" >,<br />

SYNSEM [ LKEYS.KEYREL.PRED "има_v_1_rel",<br />

LOCAL.CAT.HEAD.MCLASS<br />

[ FIN-PRESENT finite-present-101,<br />

FIN-AORIST finite-aorist-080,<br />

FIN-IMPERF finite-imperf-025,<br />

PART-IMPERF participle-imperf-024,<br />

PART-AORIST participle-aorist-095] ] ].<br />

In case a. the impersonal verb има ‘there is’ introduces its<br />

type from which inherits the specific template<br />

(v_there-is_le). Then it presents the stem, i.e. word itself,<br />

and the relation. In case b. there is also a morphological<br />

class (MCLASS), which is augmented with the<br />

respective paradigms for the relevant grammatical<br />

categories. The second one is maintained in the grammar<br />

development.<br />

The evaluation of the current grammar version was done<br />

within the system [tdsb] (Oepen, 2001). The coverage of<br />

the first version of the grammar is as follows: 213<br />

sentences, from which 193 grammatical ones. The<br />

average of distinct analyses is 3.73. The ambiguity of<br />

analyses is mainly due to the following factors: 1.<br />

morphological homonymy of the word forms; 2. more<br />

than one possible word order; 3. more than one possible<br />

attachment; 4. competing rules in the grammar (see more<br />

in Osenova & Simov, 2010). The first one concerns forms<br />

like the word form of the verb ‘come’: дойде, which is<br />

ambiguous between present tense and aorist, 2 nd or 3 rd<br />

person. The second one has to do with cases like: The dog<br />

chases Brownie, where Brownie also might be the subject<br />

in some reading. The third one considers the attachment<br />

of adjuncts at the verb level as well as at the sentence<br />

level. The last factor affects mostly the coordination rules,<br />

but also some rules for modification, where the split due<br />

to the specificities of a head allow for duplication in the<br />

remaining cases. Factor 1 requires an external<br />

disambiguation filter, which is typically done by taggers.<br />

Factor 2 also requires an additional filter to pick up the<br />

most typical reading without excluding the rest. Factor 3<br />

considers spurious cases and requires a linguistic<br />

decision for the various types of adjuncts. Factor 4 needs<br />

a change in the grammar architecture by the grammar<br />

writer.<br />

The Grammar Matrix based representation, in which the<br />

Bulgarian Resource Grammar should be compatible with<br />

the other grammars on a semantic level, is MRS.<br />

4. Conclusions and Future Work<br />

The existence of a Core Grammar proved out to be very<br />

useful in the initial steps of the grammar writing, and not<br />

only, since it provides the typological background to start<br />

with and to maintain compatibility with the other<br />

language descriptions. At the same time, depending on<br />

the purposes and tasks, the grammar writer has the<br />

possibility to re-model or even override some parts<br />

within the preliminary structure. Such developments<br />

179


Multilingual Resources and Multilingual Applications - Regular Papers<br />

would give feedback to the Core Grammar developers,<br />

and would contribute to better generalizations over more<br />

languages.<br />

The Bulgarian Resource Grammar together with the<br />

English Resource Grammar are envisaged to be used for<br />

the purposes of Machine Translation within the context<br />

of European EuroMatrixPlus project. We are using the<br />

infrastructure established within DELPH-IN and<br />

LOGON. For this task MRSes of the parallel sentences,<br />

parsed by both grammars, have been aligned on lexical<br />

and phrasal level. Transfer rules are being defined on the<br />

basis of this alignment. For example, in the alignment of<br />

MRSes for the sentence No cat barked in Bulgarian there<br />

will be an additional negation relation, coming from the<br />

negated verb. Otherwise, the arguments of the first<br />

negation relation coincide as well as the argument<br />

structures of the intransitive verbs. At the same time a set<br />

of valency frames for 3000 have been extracted from<br />

BulTreeBank, and will be added to the grammar lexicon.<br />

Additionally, the arguments in the valency frames have<br />

been assigned ontological classes. This step will help in<br />

selecting only one possible analysis in cases like: John<br />

read a book (John as subject and a book as a complement),<br />

and keeping the two possible analyses in cases like:<br />

Abrams chased a dog (Abrams as subject or complement,<br />

and the same for a dog).<br />

180<br />

5. Acknowledgements<br />

The work in this paper has been supported by the<br />

Fulbright Foundation and the EU project EuroMatrix+. It<br />

profited a lot from the collaboration with Dan Flickinger<br />

(Stanford University). The author would like to thank<br />

Kiril Simov (IICT-BAS) for his valuable comments on<br />

the earlier drafts of the paper, and also the two<br />

anonymous reviewers for the very useful critical reviews.<br />

6. References<br />

Bender, E. M., Drellishak, S., Fokkens, A., Poulson, L.,<br />

Saleem, S. (2010): Grammar Customization. In:<br />

Research on Language and Computation, vol. 8 (1),<br />

pp. 23-72.<br />

Bender, E., Flickinger, D., Good, J., Sag, I. (2004):<br />

Montage: Leveraging Advances in Grammar<br />

Engineering, Linguistic Ontologies, and Mark-up for<br />

the Documentation of Underdescribed Languages. In<br />

Proceedings of the Workshop on First Steps for<br />

Language Documentation of Minority Languages:<br />

Computational Linguistic Tools for Morphology,<br />

Lexicon and Corpus Compilation, LREC 2004, Lisbon,<br />

Portugal.<br />

Bender, E. (2008): Evaluating a Crosslinguistic Grammar<br />

Resource: A Case Study of Wambaya. In Proceedings<br />

of ACL08: HLT, Columbus, OH.<br />

Copestake, A., Flickinger, D., Pollard, C., Sag, I. (2005):<br />

Minimal Recursion Semantics: An Introduction. In<br />

Research on Language and Computation (2005) 3,<br />

pp. 281–332.<br />

Flickinger, D. (2000): On building a more efficient<br />

grammar by exploiting types. In: Natural Language<br />

Engineering, 6 (1) (Special Issue on Efficient<br />

Processing with HPSG), pp. 15-28.<br />

Kathol, A. (1997): Agreement and the Syntax-<br />

Morphology Interface in HPSG. In R. Levine and G.<br />

Green (eds.) Studies in Current Phrase Structure<br />

Grammar. Cambridge University Press. pp. 223-274.<br />

Muller, S., Kasper, W. (2000): HPSG analysis of German.<br />

In W. Wachsler (ed.), Verbmobil. Foundations of<br />

speech-to-speech translation. (Artificial Intelligence<br />

ed., pp. 238-253). Berlin, Germany: Springer.<br />

Oepen, St. (2001): [incr tsdb()] – competence and<br />

performance laboratory. User manual. Technical<br />

Report, Saarland University, Saarbruecken, Germany.<br />

Osenova, P. (2010): Bulgarian Resource Grammar.<br />

Modeling Bulgarian in HPSG. Verlag Dr. Muller,<br />

pp. 71.<br />

Osenova, P., Simov, K. (2010): Using the linguistic<br />

knowledge in BulTreeBank for the selection of the<br />

correct parses. In: TLT Proceedings, pp. 163-174.<br />

Popov, D., Simov, K., Vidinska, Sv., Osenova, P. (2003):<br />

Spelling Dictionary of Bulgarian language. Nauka i<br />

izkustvo. Sofia, 2003. (in Bulgarian)<br />

Siegel, M. (2000): HPSG Analysis of Japanese. In:<br />

W. Wahlster (ed.): Verbmobil: Foundations of Speechto-Speech<br />

Translation. Springer Verlag.<br />

Siegel, M., Bender, E. (2002): Efficient deep processing<br />

of Japanese. In Proceedings of the 3rd Workshop on<br />

Asian Language Resources and International<br />

Standardization, Taipei, Taiwan.


Multilingual Resources and Multilingual Applications - Regular Papers<br />

Poster Presentations<br />

181


Multilingual Resources and Multilingual Applications - Posters<br />

182


Multilingual Resources and Multilingual Applications - Posters<br />

Autorenunterstützung <strong>für</strong> die Maschinelle Übersetzung<br />

Melanie Siegel<br />

Acrolinx GmbH<br />

Rosenstr. 2, 10178 Berlin<br />

E-mail: melanie.siegel@acrolinx.com<br />

Abstract<br />

Der Übersetzungsprozess der Technischen Dokumentation wird zunehmend mit Maschineller Übersetzung (MÜ) unterstützt. Wir<br />

blicken zunächst auf die Ausgangstexte und erstellen automatisch prüfbare Regeln, mit denen diese Texte so editiert werden können,<br />

dass sie optimale Ergebnisse in der MÜ liefern. Diese Regeln basieren auf Forschungsergebnissen zur Übersetzbarkeit, auf<br />

Forschungsergebnissen zu Translation Mismatches in der MÜ und auf Experimenten.<br />

Keywords: Machine Translation, Controlled Language<br />

1. Einleitung<br />

Mit der Internationalisierung des Markts <strong>für</strong><br />

Technologien und Technologieprodukte steigt die<br />

Nachfrage nach Übersetzungen der Technischen<br />

Dokumentation. Vor allem in der Europäischen Union<br />

steigt das Bewusstsein, dass es nicht ausreicht,<br />

englischsprachige Dokumentation zu liefern, sondern<br />

dass Dokumentation in die Muttersprache der Kunden<br />

übersetzt werden muss. Diese Übersetzungen müssen<br />

schnell verfügbar, aktualisierbar, in mehreren Sprachen<br />

gleichzeitig verfügbar und von hoher Qualität sein.<br />

Gleichzeitig gibt seit einigen Jahren erhebliche<br />

technologische Fortschritte in der Maschinellen<br />

Übersetzung: Es gibt regelbasierte 1 und statistische<br />

Systeme 2 , aber auch hybride Übersetzungsverfahren 3<br />

.<br />

Diese Situation hat dazu geführt, dass Firmen mehr und<br />

mehr versuchen, ihre Übersetzungsanstrengungen mit<br />

MÜ zu unterstützen. Dabei treten allerdings eine Reihe<br />

von Problemen auf. Die Nutzer kennen die<br />

Möglichkeiten und Grenzen der MÜ nicht gut genug. Sie<br />

werden in ihren Erwartungen enttäuscht.<br />

Um die Systeme zu testen, werden völlig ungeeignete<br />

4<br />

Texte übersetzt, wie z. B. Prosa .<br />

1<br />

Z.B. das System Systran (http://www.systran.de/), das aber<br />

jetzt auch mit statistischen Verfahren angereichert wird<br />

(Callison-Burch et al. 2009)<br />

2 Z.B. das System Moses (Koehn, 2009; Koehn et al., 2007)<br />

oder google translate (translate.google.com)<br />

3 Z.B. Federmann et al., 2010.<br />

4 Beispiel hier: Saarbrücker Zeitung vom 6.10.2009: “Vom<br />

Auch Technische Dokumentation, die an die MÜ<br />

geschickt wird, ist oft nicht von ausreichender Qualität,<br />

ebenso wenig wie Texte, die an humane Übersetzer<br />

geschickt werden. Allerdings können humane Übersetzer<br />

diesen Mangel an Qualität im Ausgangsdokument<br />

ausgleichen, während MÜ-Systeme dazu nicht in der<br />

Lage sind.<br />

Statistische MÜ-Systeme müssen auf parallelen Daten<br />

trainiert werden. Oft werden da<strong>für</strong> TMX-Dateien<br />

verwendet, die aus Translation Memory – Systemen<br />

herausgezogen werden. Da aber diese Daten oft unsauber<br />

sind und fehlerhafte und inkonsistente Übersetzungen<br />

enthalten, ist auch die Qualität der trainierten<br />

Übersetzung schlecht.<br />

Wir haben uns mit der Frage beschäftigt, wie die Autoren<br />

Technischer Dokumentation darin unterstützt werden<br />

können, Dokumente <strong>für</strong> die MÜ optimal vorzubereiten,<br />

um auf diese Weise optimale Übersetzungsergebnisse zu<br />

bekommen. Das Ziel der Untersuchungen ist, die<br />

Möglichkeiten und Grenzen der MÜ genauer zu<br />

spezifizieren, daraus Handlungsoptionen <strong>für</strong> Autoren<br />

abzuleiten und diese durch automatische Verfahren zu<br />

unterstützen. Dabei gehen wir in drei Schritten vor:<br />

1) Wir untersuchen die Schwierigkeiten, die ein<br />

humaner Übersetzer hat, darauf, ob sie auf<br />

MÜ-Systeme übertragbar sind.<br />

2) Wir experimentieren mit automatisch prüfbaren<br />

Leid mit der Übersetzung“, von Michael Brächer. Test hier mit<br />

Auszügen aus Goethes „Erlkönig“.<br />

183


Multilingual Resources and Multilingual Applications - Posters<br />

Regeln der Autorenunterstützung und übersetzen<br />

Texte vor und nach der Umformulierung mit MÜ.<br />

3) Wir ziehen Untersuchungen zu „Translation<br />

Mismatches“ in der MÜ heran, um Strukturen zu<br />

finden, die besonders schwer automatisch<br />

übersetzbar sind.<br />

184<br />

2. Schwierigkeiten von humanen<br />

Übersetzern – Schwierigkeiten<br />

von MÜ-Systemen<br />

Heizmann (1994:5) erläutert den Übersetzungsprozess<br />

<strong>für</strong> humane Übersetzer: "In our opinion, translation is<br />

basically a complex decision process. The translator has<br />

to base his or her decisions upon available information,<br />

which he or she can get from various sources." Diese<br />

Aussage ist auch auf den Übersetzungsprozess in der MÜ<br />

übertragbar und verdeutlicht schon, dass es notwendig ist,<br />

der Maschine möglichst wenige komplexe<br />

Entscheidungsprozesse aufzubürden.<br />

Ausgehend davon, dass ein MÜ-System einem eher<br />

unprofessionellen Übersetzer ähnlich ist, dem die Texte<br />

<strong>für</strong> die Übersetzung so vorbereitet werden sollten, dass<br />

sie einfacher übersetzbar sind, ziehen wir Parallelen vom<br />

unprofessionellen Übersetzer zum MÜ-System. Der<br />

Ausgangstext <strong>für</strong> Übersetzer wie <strong>für</strong> ein MÜ-System<br />

muss so angepasst werden, dass die Probleme möglichst<br />

umgangen werden, die der unprofessionelle Übersetzer<br />

und das MÜ-System haben:<br />

Die Übersetzung einzelner Wörter, Phrasen und Sätze,<br />

ohne die Möglichkeit, größere Übersetzungseinheiten in<br />

Betracht zu ziehen, erfordert, dass satzübergreifende<br />

Bezüge vermieden werden müssen, wie z.B. Anaphern.<br />

Die Unmöglichkeit der Paraphrasierung erfordert<br />

einfache Satzstrukturen ohne Ambiguitäten. Wichtig ist<br />

es auch, metaphorische Sprache zu vermeiden, da diese<br />

oft nicht einfach übersetzt werden kann, sondern<br />

Paraphrasierung erfordert.<br />

Eine Übersetzung ohne Weltwissen führt dazu, dass<br />

Wörter mit unterschiedlichen Bedeutungen in<br />

verschiedenen Domänen (Homonyme) falsch übersetzt<br />

werden. Solche potentiell ambigen Wörter müssen<br />

vermieden werden.<br />

Da das Spektrum von Übersetzungsvarianten potentiell<br />

größer als bei professionellen Übersetzern ist, ist eine<br />

systematische Terminologiearbeit am Ausgangstext<br />

hilfreich, die Terminologievarianten im Ausgangstext<br />

schon mal eliminiert.<br />

Da die MÜ ebenso wie der unprofessionelle Übersetzer<br />

wenige Hilfsmittel hat, die Hintergrundwissen zum<br />

beschriebenen Sachverhalt geben, muss die<br />

Beschreibung möglichst klar und verständlich sein. Das<br />

erfordert einfache Satzstrukturen.<br />

3. Relevanz von automatisch prüfbaren<br />

Regeln der Autorenunterstützung<br />

In einem Experiment haben wir einige Dokumente der<br />

technischen Dokumentation mit dem MÜ-System<br />

Langenscheidt T1 übersetzen lassen. Danach haben wir<br />

die Dokumente mit einer großen Anzahl automatisch<br />

prüfbarer Regeln aus Acrolinx IQ geprüft. Die<br />

Ergebnisse der Prüfungen haben wir umgesetzt, indem<br />

wir die Ausgangstexte umformuliert haben. Diese<br />

umformulierten Texte haben wir dann wieder mit<br />

Langenscheidt T1 automatisch übersetzt und die<br />

Übersetzungen miteinander verglichen. Das Ziel dieses<br />

Experiments ist es, herauszufinden, welche Regeln der<br />

Autorenunterstützung wichtige Effekte auch <strong>für</strong> die MÜ<br />

haben. Einige dieser Regeln haben wir im<br />

vorangegangenen Abschnitt Schwierigkeiten von<br />

humanen Übersetzern – Schwierigkeiten von<br />

MÜ-Systemen schon vorgestellt. Aufgrund dieser<br />

Experimente haben wir ein Regelset zusammengestellt,<br />

das wir im nächsten Abschnitt vorstellen.<br />

4. Erste Ergebnisse der Experimente<br />

Rechtschreibung und Grammatik: Das Regelset <strong>für</strong><br />

die deutschen Ausgangstexte enthält zunächst die<br />

Standard-Grammatik- und Rechtschreibregeln. Die<br />

Experimente haben klar gezeigt, dass ein MÜ-System<br />

keine sinnvollen Ergebnisse liefert, wenn der Eingabetext<br />

Rechtschreib- und Grammatikfehler enthält. Wenn ein<br />

Wort unbekannt ist, weil es falsch geschrieben ist, dann<br />

ist auch keine Übersetzung mit dem MÜ-System möglich.<br />

Allerdings führt nicht jeder Rechtschreibfehler auch zu<br />

Übersetzungsproblemen: Die Experimente haben gezeigt,<br />

dass das untersuchte MÜ-System tolerant zu alter und<br />

neuer deutscher Rechtschreibung ist – beide Varianten<br />

„muß“ und „muss“ wurden korrekt übersetzt.<br />

Regeln zu Formatierung und Zeichensetzung: Der<br />

Gebrauch von Gedankenstrichen führt zu komplexen<br />

Sätzen im Deutschen, die Probleme bei der Übersetzung<br />

bereiten.


Regeln zum Satzbau: Beim Satzbau geht es zunächst<br />

darum, komplexe Satzstrukturen zu vermeiden. Oberstes<br />

Gebot ist hier, zu lange Sätze zu vermeiden. Komplexe<br />

Satzstrukturen entstehen durch die folgenden<br />

Konstruktionen, wie Einschübe, Hauptsatzkoordination,<br />

Trennung von Verben, eingeschachtelte Relativsätze,<br />

Schachtelsätze, Klammern, Häufung von<br />

Präpositionalphrasen, Beschreibung mehrerer<br />

Handlungen in einem Satz, umständliche<br />

Formulierungen und Bedingungssätze, die nicht mit<br />

„wenn“ eingeleitet sind. Ein anderes Problem <strong>für</strong> die MÜ<br />

sind ambige Strukturen, die durch<br />

Substantivkonstruktionen und elliptische Konstruktionen<br />

entstehen.<br />

Regeln zur Wortwahl: Füllwörter und Floskeln sind<br />

deshalb schwierig <strong>für</strong> die MÜ, weil nicht paraphrasiert<br />

werden kann. Das MÜ-System versucht, diese Wörter zu<br />

übersetzen, obwohl ein professioneller Übersetzer sie<br />

weglassen oder umformulieren würde. Umgangssprache<br />

und bildhafte Sprache sind ebenfalls ein großes Problem.<br />

Pronomen sind dann schwierig zu übersetzen, wenn der<br />

Bezug außerhalb des Satzkontexts liegt und unklar ist.<br />

Bei der Verwendung von ambigen Wörtern kann das<br />

MÜ-System in vielen Fällen die Ambiguität nicht<br />

auflösen. Das passiert zum Beispiel bei der Verwendung<br />

von Fragewörtern in anderen Kontexten als einer Frage.<br />

Gerade ausdrucksschwache Verben mit ambigem<br />

Bedeutungsspektrum sind problematisch. Der<br />

Nominalstil, bei dem Verben nominalisiert werden, kann<br />

im Englischen zu komplexen und falschen<br />

Konstruktionen führen.<br />

5. Anwendung der Regeln,<br />

Umformulierungen und Übersetzungen<br />

Ein wichtiger Teil der Fragestellung war aber nun, ob die<br />

Anwendung der implementierten Regeln zur<br />

Autorenunterstützung tatsächlich eine Auswirkung auf<br />

die Ergebnisse der MÜ hat. Im oben beschriebenen<br />

Experiment haben wir die aufgestellten und<br />

implementierten Regeln zur Autorenunterstützung auf<br />

zwei Dokumente angewendet und die Texte nach den<br />

Empfehlungen der Regeln umformuliert. Anschließend<br />

haben wir untersucht, welche der Regeln am häufigsten<br />

auftraten und die meisten Effekte <strong>für</strong> die Qualität der<br />

MÜ-Ausgaben hatten. Hier muss jedoch angemerkt<br />

werden, dass dieses Experiment bisher nur mit zwei<br />

Multilingual Resources and Multilingual Applications - Posters<br />

Dokumenten durchgeführt wurde, einer Anleitung zum<br />

Ausbau von Zündkerzen am Auto und einer Anleitung<br />

zur Installation einer Satellitenschüssel. Ein interessantes<br />

Ergebnis: In fast der Hälfte der Fälle konnte der Satz<br />

anhand von lexikalisch-basierten Regeln so verbessert<br />

werden, dass die Maschinelle Übersetzung gute<br />

Ergebnisse lieferte.<br />

6. Untersuchungen zu Translation<br />

Mismatches und daraus<br />

resultierende Empfehlungen<br />

Kameyama et al. (1991) verwendeten den Begriff<br />

"Translation Mismatches", um ein Schlüsselproblem der<br />

maschinellen Übersetzung zu beschreiben. Bei<br />

Translation Mismatches handelt es sich um Information,<br />

die in der einen am Übersetzungsprozess beteiligten<br />

Sprache explizit nicht vorhanden ist, die aber in der<br />

anderen beteiligten Sprache gebraucht wird. Der Effekt<br />

ist, dass die Information in der einen<br />

Übersetzungsrichtung verloren geht und in der anderen<br />

hinzugefügt werden muss. Das hat - wie Kameyama<br />

beschreibt - zwei wichtige Konsequenzen:<br />

“First in translating a source language sentence,<br />

mismatches can force one to draw upon information not<br />

expressed in the sentence - information only inferrable<br />

from its context at best. Secondly, mismatches may<br />

necessitate making information explicit which is only<br />

implicit in the source sentence or its context.” (S.194)<br />

Translation Mismatches sind <strong>für</strong> die Übersetzung eine<br />

große Herausforderung, weil Wissen, das nicht direkt<br />

sprachlich kodiert ist, inferiert werden muss. Welche<br />

Translation Mismatches relevant sind, das hängt aber<br />

stark von der Information ab, die in den beteiligten<br />

Sprachen kodiert ist. Für das Sprachpaar<br />

Deutsch-Englisch konnten wir in den Experimenten die<br />

folgenden Translation Mismatches identifizieren:<br />

Lexikalische Mismatches. Die Bedeutung ambiger<br />

Wörter in der Ausgangssprache muss in der Zielsprache<br />

aufgelöst werden, wie z.B. bei „über“ -> „about“,<br />

„above“.<br />

Nominalkomposita. Nach den Regeln der deutschen<br />

Rechtschreibung müssen Nominalkomposita entweder<br />

zusammen oder mit Bindestrich geschrieben werden.<br />

Wenn sie zusammengeschrieben werden, muss die<br />

Analyse der MÜ die Teile identifizieren. Das ist aber<br />

nicht immer eindeutig im Deutschen. Wenn andererseits<br />

185


Multilingual Resources and Multilingual Applications - Posters<br />

auch im Deutschen wie im Englischen ein Leerzeichen<br />

zwischen den Teilen des Kompositums steht, dann ist die<br />

MÜ-Analyse of überfordert, weil die Beziehung<br />

zwischen den Nomen unklar bleibt. Z.B.: „bei den<br />

heutzutage verwendeten Longlife Kerzen“ - „at the<br />

nowadays used ones“<br />

Metaphorik. Bildhafte Sprache lässt sich nicht wörtlich<br />

übertragen. Ein Beispiel aus den Experimenten: „Man ist<br />

daher leicht geneigt“ – „One is therefore slightly only<br />

still to“<br />

Pronomen. Das Pronomen „Sie“ meint im Deutschen<br />

sowohl die 3. Person Singular als auch die 2. Person<br />

Singular, abhängig von der Großschreibung. Wenn das<br />

„Sie” aber am Satzanfang steht, bleibt unklar, welche<br />

Variante gemeint ist. Beispiel: „Sie haben es fast<br />

geschafft“ – „her it have created almost“.<br />

7. Zusammenfassung und nächste Schritte<br />

Wir haben ein Regelset <strong>für</strong> die automatische<br />

Autorenunterstützung aufgestellt. Dieses Regelset basiert<br />

auf Untersuchungen zu Problemen humaner Übersetzer,<br />

auf Experimenten mit MÜ und Umformulierungen und<br />

auf Untersuchungen zu Translation Mismatches in der<br />

MÜ. In einem nächsten Schritt haben wir das entstandene<br />

Regelset in Experimenten mit verschiedenen<br />

MÜ-Systemen validiert. Die Übersetzungen wurden<br />

dieses Mal von professionellen Übersetzern und<br />

Übersetzerinnen validiert. Eine erste Auswertung der<br />

Validierungen ergab:<br />

� Umformulierungen durch Regeln hatten keinen<br />

Einfluss auf das Ranking der Ergebnisse<br />

verschiedener MÜ-Systeme.<br />

� Die Anzahl der klassifizierbaren Fehler der<br />

MÜ-Systeme steigt, während die Anzahl der nicht<br />

klassifizierbaren Fehler sinkt. Übersetzungen der<br />

umformulierten Texte enthalten weniger<br />

Grammatikfehler.<br />

� Die Anzahl der korrekten Übersetzungen steigt<br />

stark.<br />

Die Regeln <strong>für</strong> das Pre-Editing können zum Teil<br />

automatische Vorschläge <strong>für</strong> die Umformulierung geben.<br />

Wir suchen nach einem Weg, aus diesen Vorschlägen ein<br />

automatisches Pre-Editing zu erzeugen.<br />

186<br />

8. Acknowledgements<br />

Dieses Vorhaben wird durch die TSB<br />

Technologiestiftung Berlin aus Mitteln des<br />

Zukunftsfonds des Landes Berlin gefördert, kofinanziert<br />

von der Europäischen Union – Europäischer Fonds <strong>für</strong><br />

Regionale Entwicklung. Investition in Ihre Zukunft!<br />

9. Literatur<br />

Callison-Burch, C., Koehn, P., Monz, C., Schroeder, J.<br />

(2009): Findings of the 2009 Workshop on Statistical<br />

Machine Translation. In Proceedings of the Fourth<br />

Workshop on Statistical Machine Translation<br />

(WMT09), March.<br />

Drewer, P., Ziegler, W. (<strong>2011</strong>): Technische<br />

Dokumentation. Übersetzungsgerechte Texterstellung<br />

und Content-Management. Vogel-Verlag Würzburg.<br />

Federmann, C., Eisele, A., Uszkoreit, H., Chen, Y.,<br />

Hunsicker, S., Xu, J. (2010): Further Experiments with<br />

Shallow Hybrid MT Systems. In: Callison-Burch, C.,<br />

Koehn, P., Monz, C., Peterson, K., Zaidan, O. (eds.):<br />

Proceedings of the Joint Fifth Workshop on Statistical<br />

Machine Translation and MetricsMATR, Pages 77-81,<br />

Uppsala, Sweden, ACL, Association for<br />

Computational Linguistics (ACL), 209 N. Eighth<br />

Street Stroudsburg, PA 18360 USA, 7/2010<br />

Heizmann, S. (1994): Human Strategies in Translation<br />

and Interpreting - what MT can Learn from Translators.<br />

Verbmobil Report 43. <strong>Universität</strong> Hildesheim.<br />

Kameyama, M., Ochitani, R., Peters, S. (1991):<br />

Resolving Translation Mismatches With Information<br />

Flow. In: Proceedings of the 29th Annual Meeting of<br />

the Association for Computational Linguistics,<br />

Berkeley: 193-200.<br />

Klausner, K. (<strong>2011</strong>): Einsatzmöglichkeiten kontrollierter<br />

Sprache zur Verbesserung maschineller Übersetzung.<br />

BA-Arbeit, Fachhochschule Potsdam, Januar <strong>2011</strong>.<br />

Koehn, P. (2009): A Web-Based Interactive Computer<br />

Aided Translation Tool. In Proceedings of the<br />

ACL-IJCNLP 2009 Software Demonstrations, Suntec,<br />

Singapore.<br />

Koehn, P., Hoang, H., Birch, A. (2007): ‘Moses: Open<br />

Source Toolkit for Statistical Machine Translation’.<br />

Paper presented at the Annual Meeting of the<br />

Association for Computational Linguistics (ACL),<br />

Prague, Czech Republic.<br />

Siegel, M. (1997): Die maschinelle Übersetzung<br />

aufgabenorientierter japanisch-deutscher Dialoge.<br />

Lösungen <strong>für</strong> Translation Mismatches. Berlin: Logos.


Multilingual Resources and Multilingual Applications - Posters<br />

Experimenting with Corpus-Based MT Approaches<br />

Monica Gavrila<br />

University of Hamburg,<br />

Vogt-Kölln Str. 30, 22527, Hamburg, Germany<br />

E-mail: gavrila@informatik.uni-hamburg.de<br />

Abstract<br />

There is no doubt that in the last years corpus-based machine translation (CBMT) approaches have been in focus. Among them, the<br />

statistical MT (SMT) approach has been by far the more dominant, although the Workshop on example-based MT (EBMT) at the end<br />

of 2009 showed a revived interest the other important CBMT approach: EBMT. In this paper several MT experiments for English and<br />

Romanian are presented. In the experimental settings several parameters have been changed: the MT system, the corpus type and size,<br />

the inclusion of additional linguistic information. The results obtained by a Moses-based SMT system are compared with the ones<br />

given by Lin-EBMT, a linear EBMT system implemented during the research. Although the SMT systems outperforms the EBMT<br />

system in all the experiments, different behaviors of the systems have been noticed while changing the parameters in the<br />

experimental settings, which can be of interest for further research in the area.<br />

Keywords: Machine Translation, SMT, EBMT, Moses, Lin-EBMT<br />

1. Introduction<br />

There is no doubt that in the last years corpus-based<br />

machine translation (CBMT) approaches have been in<br />

focus. Among them, the statistical machine translation<br />

(SMT) approach has been by far the more dominant.<br />

However, the Workshop on example-based MT (EBMT)<br />

at the end of 2009 1<br />

showed a revived interest in the other<br />

important CBMT approach: EBMT.<br />

Between these two MT approaches has always been a<br />

'competition'. The similar and unclear definitions and the<br />

mixture of ideas make the difference between them<br />

difficult to distinguish. In order to show the advantages of<br />

one or another method, comparisons between SMT and<br />

EBMT (or hybrid) systems are found in the literature.<br />

The results, depending on the data type and on the<br />

systems considered, seemed to be positive for both<br />

approaches: (Way & Gough, 2005) and (Smith & Clark,<br />

2009). Considering English-Romanian as language-pair,<br />

results for both SMT and EBMT systems are reported,<br />

although a comparison between the two approaches has<br />

not been made. SMT systems are presented in (Cristea,<br />

2009) and (Ignat, 2009); results of an EBMT system are<br />

1<br />

http://computing.dcu.ie/~mforcada/ebmt3/ - last accessed on<br />

January <strong>2011</strong>.<br />

reported in (Irimia, 2009).<br />

In this paper several MT experiments for English (ENG)<br />

and Romanian (RON) are presented. In the experimental<br />

settings several parameters have been changed: the MT<br />

system (approach), the type and size of the corpus, the<br />

inclusion of additional part-of-speech (POS) information.<br />

The results obtained by a Moses-based SMT system are<br />

compared with the ones given by Lin-EBMT, a linear<br />

EBMT system implemented during the research. The<br />

same training and test data have been used for both MT<br />

systems.<br />

The following section will briefly present both MT<br />

systems. The data used and the translation results will be<br />

described in Section 3. Additionally, a very brief analysis<br />

of the results will be made. The paper will end with<br />

conclusions and some ideas about further work.<br />

2. System Description<br />

In this section the two CBMT systems are briefly<br />

characterized.<br />

The SMT system used follows the description of the<br />

baseline architecture given for the Sixth Workshop on<br />

SMT 2<br />

and it is based on Moses (Koehn et al., 2007).<br />

2<br />

http://www.statmt.org/wmt11/baseline.html - last accessed on<br />

June <strong>2011</strong>.<br />

187


Multilingual Resources and Multilingual Applications - Posters<br />

Moses is an SMT system that allows the user to train<br />

automatically translation models for the language pair<br />

needed, considering that the user has the necessary<br />

parallel aligned corpus. We used in our experiments<br />

SRILM (Stolcke, 2002) for building the language model<br />

and GIZA++ (Och & Ney, 2003) for obtaining the word<br />

alignment. Two changes have been done to the<br />

specifications of the Workshop on SMT: the tuning step<br />

was left out and the language model (LM) order was 3,<br />

instead of 5. Leaving out the tuning step has been<br />

motivated by results we obtained in experiments which<br />

are not the topic of this paper, while comparing different<br />

system settings: not all tests in which tuning was<br />

involved showed an improvement. We changed the LM<br />

order due to results presented in the SMART project 3<br />

.<br />

Lin-EBMT is the EBMT system developed during the<br />

research. It is mainly based on surface forms (linear EMT<br />

system) and uses no additional linguistic resources. Due<br />

to space reasons, the main steps of the Lin-EBMT system<br />

- matching, alignment and recombination - are not<br />

described in detail in this paper. We will just present the<br />

main translation steps.<br />

The test corpus is preprocessed in the same way as in<br />

specification of the Moses-based SMT system:<br />

tokenization and lowercasing. In order to reduce the<br />

search space a word index is used, a method that is often<br />

encountered in the literature, e.g. (Sumita & Iida, 1991).<br />

The information needed in the translation, such as the<br />

4<br />

word-index or the GIZA++ word-alignments, is<br />

extracted prior to the translation process itself.<br />

The main steps in Lin-EBMT, done for each of the input<br />

sentences of the test data, are enumerated below:<br />

5<br />

1) The tokens in the input, excluding punctuation, are<br />

extracted: {token1, token2, ..., tokenn}.<br />

2) Using the word-index, all sentence ids that contain at<br />

least one token from the input are considered:<br />

{sentenceId1, ..., sentenceIdm}. The list of sentence<br />

ids contains no duplicates. The word-index is used in<br />

order to reduce the search space for the matching<br />

step. The matching procedure is run only after the<br />

search space size is decreased, by using this index.<br />

3) Given the preprocessed input sentence and the list of<br />

3 www.smart-project.eu – last accessed on June <strong>2011</strong>.<br />

4 The word-index is in fact a token index, as it contains also<br />

punctuation signs, numbers, etc.<br />

5 A token is a word, a number or a punctuation sign.<br />

188<br />

4)<br />

sentence ids {sentenceId1, ..., sentenceIdm}, the<br />

matching between the input and the 'reduced' source<br />

language (SL) side of the corpus is done. If the input<br />

sentence is encountered in the corpus, the translation<br />

is found and the translation procedure stops. Else, the<br />

most similar sentences are extracted by using a<br />

similarity measure developed during the research.<br />

This measure is based on the longest common<br />

subsequence algorithm found in (Bergroth et al.,<br />

2000).<br />

After obtaining the sentences which maximum cover<br />

the input, the corresponding word alignments are<br />

extracted, by considering the longest aligned target<br />

language (TL) subsequences possible.<br />

5) Using the "bag of TL sequences" obtained from the<br />

alignment the output is generated by making use of a<br />

recombination matrix, a new approach for<br />

implementing this step.<br />

More details about the Lin-EBMT system can be found in<br />

(Gavrila, <strong>2011</strong>).<br />

3. Evaluation<br />

We used for our evaluation two corpora. The first is a<br />

sub-part of the JRC-Acquis version 2.2 (Steinberger et al.,<br />

2006), a freely available parallel corpus in 22 languages,<br />

which is formed from the European Union documents of<br />

mostly legal nature.; the latter is RoGER, a small<br />

technical manual manually created and corrected<br />

(Gavrila & Elita, 2006). The same training and test data<br />

has been used for both SMT and EBMT experiments.<br />

In the EBMT system, matching is done on the corpus for<br />

the translation model in the SMT system and<br />

recombination on the one for the language model. Both<br />

corpora had to be saved in the format which fits the needs<br />

of each of the MT systems.<br />

The tests on the JRC-Acquis data have been run on 897<br />

sentences, which were not used for training. Sentences<br />

were automatically removed from different parts of the<br />

corpus to ensure a relevant lexical, syntactic and<br />

semantic coverage. Three sets of 299 sentences represent<br />

the data sets Test 1, Test 2, and Test 3, respectively. Test<br />

1+2+3 is formed from all 897 sentences. The test data has<br />

no sentence length restriction, as the training data (see<br />

Moses specification).<br />

From RoGER, 133 sentences (Test R) have been<br />

randomly extracted as the test data, the rest of 2200


sentences representing the training data. When using<br />

RoGER, POS information was considered for some of<br />

the experiments: data set Test RwithPOS 6<br />

.<br />

The obtained translations have been evaluated using two<br />

automatic evaluation metrics: BLEU (Papineni et al.,<br />

2002) and TER (Snover et al., 2006). The choice of the<br />

metrics is motivated by the available resources (software)<br />

and, for comparison reason, by the results reported in the<br />

literature. Due to lack of data and further translation<br />

possibilities, we considered the comparison with only<br />

one reference translation.<br />

We present the evaluation scores obtained in Tables 2<br />

and 3.<br />

ENG-RON<br />

SMT Lin-EBMT<br />

Test 1 0.5007 0.8071<br />

Test 2 0.4898 0.6400<br />

Test 3 0.5208 0.7770<br />

Test 1+2+3 0.5023 0.7326<br />

Test R 0.3784 0.5955<br />

Test RwithPOS 0.4748<br />

RON-ENG<br />

0.6402<br />

SMT Lin-EBMT<br />

Test 1 0.5020 0.7041<br />

Test 2 0.3756 -<br />

Test 3 0.4684 -<br />

Test 1+2+3 0.4457 -<br />

Test R 0.3465 0.5443<br />

Test RwithPOS 0.4000 0.5490<br />

Table 2: Evaluation Results (TER scores)<br />

The lower the TER scores, the better the translation<br />

results. For the BLEU score the relationship between the<br />

scores and the translation quality is the opposite.<br />

While analyzing the behavior of each of the MT system,<br />

when changing the test data-set for one corpus (i.e.<br />

JRC-Acquis) several factors have been found with a<br />

direct influence on the results, such as the number of<br />

out-of-vocabulary words, the number of test sentences<br />

directly found in the training data, sentence length or the<br />

way of extracting the training data: see Test 1 – Test 3.<br />

For a specific dataset (Test 2), the obtained BLEU score<br />

for the EBMT system is similar 7<br />

with one presented in<br />

6<br />

The POS information has been provided by the text<br />

processing web services found on:<br />

www.racai.ro/webservices/TextProcessing.aspx - last accesed<br />

on January <strong>2011</strong>.<br />

7<br />

A one-to-one comparison is not possible, as the data is not the<br />

same.<br />

Multilingual Resources and Multilingual Applications - Posters<br />

(Irimia, 2009), where linguistic resources were used.<br />

Considering the analysis of the behavior of each of the<br />

MT system, when changing the corpus (a larger and a<br />

smaller corpus, which fits the SMT and EBMT<br />

framework, respectively), when comparing Test 1+2+3<br />

and Test R, an improvement is found in both cases for the<br />

RoGER corpus, although usually it is stated that a large<br />

corpus is needed for SMT. This result might be in this<br />

specific case so, due to the data type. This shows the high<br />

influence of the data on the empirical approaches.<br />

ENG-RON<br />

SMT Lin-EBMT<br />

Test 1 0.3997 0.1335<br />

Test 2 0.4179 0.3072<br />

Test 3 0.3797 0.1476<br />

Test 1+2+3 0.4015 0.2125<br />

Test R 0.43<strong>96</strong> 0.2689<br />

Test RwithPOS 0.3879 0.2942<br />

RON-ENG<br />

SMT Lin-EBMT<br />

Test 1 0.2545 0.0855<br />

Test 2 0.5628 -<br />

Test 3 0.4271 -<br />

Test 1+2+3 0.4255 -<br />

Test R 0.4765 0.2783<br />

Test RwithPOS 0.4618 0.3624<br />

Table 3: Evaluation Results (BLEU scores)<br />

The results for the data with additional POS information<br />

(Test RwithPOS) are not conclusive, as when<br />

considering the TER scores worse results are obtained for<br />

both MT systems, but when considering the BLEU score<br />

improvement is noticed for the EBMT system.<br />

In terms of overall BLEU and TER scores, the EBMT<br />

system is outperformed by the SMT one. Still, there are<br />

cases where the EBMT system provides a better<br />

translation, as in the example below:<br />

Input: The EEA Joint Committee<br />

Reference: Comitetul mixt al SEE,<br />

SMT output: SEE Comitetului mixt,<br />

(* ENG: EEA of the Joint Committee)<br />

Lin-EBMT output: Comitetului mixt SEE<br />

(* ENG: of the EEA Joint Committee)<br />

4. Conclusions and Further Work<br />

In this framework - system configuration and data -, in a<br />

direct comparison, the EBMT system was not able to<br />

match the performance of the SMT system, but there<br />

189


Multilingual Resources and Multilingual Applications - Posters<br />

were examples when its translation has been more<br />

accurate. The evaluation scores presented in this paper<br />

show how much training and test data influence the<br />

translation results. In this EBMT implementation not all<br />

the power of the approach was used, so there is room for<br />

improvement. As further work, additional information,<br />

e.g. word-order information from the TL sentences is to<br />

be extracted and used in the recombination step.<br />

190<br />

5. References<br />

Bergroth, L., Hakonen, H., Raita, T. (2000): A survey of<br />

longest common subsequence algorithms. In Proc. of<br />

the Seventh International Symposium on String<br />

Processing and Information Retrieval - SPIRE 2000,<br />

pp. 39-48, Spain. ISBN: 0-7695-0746-8.<br />

Cristea, D. (2009): Romanian language technology and<br />

resources go to Europe. Presentated at the FP7<br />

Language Technology Informative Days. URL:<br />

ftp://ftp.cordis.europe.eu/pub/fp7/ict/docs/language-te<br />

chnologies/cristea en.pdf - last accessed on April 10 th ,<br />

2009.<br />

Gavrila, M., Elita, N. (2006): Roger - un corpus paralel<br />

aliniat. In Resurse Lingvistice si Instrumente pentru<br />

Prelucrarea Limbii Romane Workshop Proceedings,<br />

pages 63-67. Workshop held in November 2006,<br />

Publisher: Ed. Univ. Alexandru Ioan Cuza, ISBN:<br />

978-973-703-208-9.<br />

Gavrila, M. (<strong>2011</strong>): Constrained recombination in an<br />

example-based machine translation system. In Vincent<br />

Vondeghinste Mikel L. Forcada, Heidi Depraetere,<br />

editor, Proceedings of the EAMT-<strong>2011</strong> Conference,<br />

pages 193-200, Leuven, Belgium, May <strong>2011</strong>. ISBN:<br />

9789081486118.<br />

Ignat, C. (2009): Improving Statistical Alignment and<br />

Translation Using Highly Multilingual Corpora. PhD<br />

thesis, INSA - LGeco- LICIA, Strasbourg, France.<br />

URL: http://sites.google.com/site/cameliaignat/home/<br />

phd-thesis (last accessed on August 3 rd , 2009).<br />

Irimia, E. (2009): EBMT experiments for the<br />

English-Romanian language pair. In Proceedings of<br />

the Recent Advances in Intelligent Information<br />

Systems, pages 91-102. ISBN 978-83-60434-59-8.<br />

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,<br />

Federico, M., Bertoldi, N., Cowan, B., Shen, W.,<br />

Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,<br />

A., Herbst, E. (2007): Moses: Open source toolkit for<br />

statistical machine translation. In Annual Meeting of<br />

the Association for Computational Linguistics (ACL),<br />

demonstration session, Prague, Czech Republic.<br />

Och, F. J., Ney, H. (2003): A systematic comparison of<br />

various statistical alignment models. Computational<br />

Linguistics, 29(1), pp. 19-51.<br />

Papineni, K., Roukos, S., Ward, T., Zhu, W.-J. (2002):<br />

Bleu: a method for automatic evaluation of machine<br />

translation. In Proceedings of the 40th Annual Meeting<br />

on Association for Computational Linguistics, Session:<br />

Machine translation and evaluation, pp. 311-318,<br />

Philadelphia, Pennsylvania. Publisher: Association for<br />

Computational Linguistics Morristown, NJ, USA.<br />

Smith, J., Clark, S. (2009): EBMT for SMT: A new<br />

EBMT-SMT hybrid. In Forcada, M. L. and Way, A.,<br />

editors, Proceedings of the 3rd Intyernational<br />

Workshop on Example-Based Machine Translation,<br />

pp. 3-10, Dublin, Ireland.<br />

Snover, M., Dorr, B., Schwartz, R., Micciulla, L.,<br />

Makhoul, J. (2006): A study of translation edit rate<br />

with targeted human annotation. In Proceedings of<br />

Association for Machine Translation in the Americas.<br />

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C.,<br />

Erjavec, T., Tufis, D., Varga, D. (2006): The<br />

JRC-Acquis: A multilingual aligned parallel corpus<br />

with 20+ languages. In Proceedings of the 5th<br />

International Conference on Language Resources and<br />

Evaluation (LREC'2006), Genoa, Italy.<br />

Stolcke, A. (2002): SRILM - An extensible language<br />

modeling toolkit. In Proceedings of the International<br />

Conference on Spoken Language Processing,<br />

pp. 901-904, Denver, Colorado.<br />

Sumita, E., Iida, H. (1991): Experiments and prospects of<br />

example-based machine translation. In Proceedings of<br />

the 29th annual meeting on Association for<br />

Computational Linguistics, pp. 185-192, Morristown,<br />

NJ, USA. Association for Computational Linguistics.<br />

Way, A., Gough, N. (2005): Comparing example-based<br />

and statistical machine translation. Natural Language<br />

Engineering, 11, pp. 295-309. Cambridge University<br />

Press.


Multilingual Resources and Multilingual Applications - Posters<br />

Method of POS-disambiguation Using Information about Words Co-occurrence<br />

(For Russian)<br />

Edward Klyshinsky 1 , Natalia Kochetkova 2 , Maxim Litvinov 2 , Vadim Maximov 1<br />

1 Keldysh IAM<br />

Moscow, Russia, 125047 Miusskaya sq. 4<br />

2 Moscow State Institute of Electronics and Mathematics<br />

Moscow, Russia, 109029 B. Tryokhsvyatitelsky s. 3<br />

E-mail: klyshinsky@itas.miem.edu.ru, natalia_k_11@mail.ru, promithias@yandex.ru, vadimmax2000@mail.ru<br />

Abstract<br />

The article describes the complex method of part-of-speech disambiguation for texts in Russian. The introduced method is based on<br />

the information concerning the syntactic co-occurrence of Russian words. The article also discusses the method of building such<br />

corpus. This project is partially funded by RFBR grant 10-01-00805.<br />

Keywords: learning corpora, words co-occurrence base, POS-disambiguation<br />

1. Introduction<br />

Part-of-speech disambiguation is an important problem<br />

in automatic text processing. At the time there exist many<br />

systems which solve this problem. The earliest projects<br />

use rule-based methods (see, for example, Tapanainen &<br />

Voutilainen, 1994). This approach is based on the<br />

following ideas: the system is supplied with some<br />

limiting rules which forbid or allow some certain words<br />

combinations. However, this method requires a<br />

time-consuming procedure of writing the rules. Besides,<br />

though these rules provide a good result, they often leave<br />

a considerable part of text not covered. In this connection<br />

there have appeared various statistical methods of<br />

automatic generation of such rules (for example Brill,<br />

1995).<br />

The n-gram method uses the statistical distribution of<br />

word combination in the text. Generally, n-gram model<br />

could be written down as follows:<br />

P w ) = arg max P(<br />

w | w − ) * ... * P(<br />

w | w − ). (1)<br />

( i<br />

i i 1 i i N<br />

P(wi) is the probability of an unknown tag <br />

occurrence, if of the neighbours are known.<br />

In order to avoid the problem of rare data and getting a<br />

zero probability for the occurrence of tag combination<br />

, the smoothed probability can be applied<br />

for trigram model. The smoothed trigram model contains<br />

linear combinations of trigram, bigram and unigram<br />

probabilities:<br />

Psmooth ( wi<br />

| wi−<br />

2 * wi−1<br />

) = λ3<br />

* P(<br />

wi<br />

| wi−<br />

2 * wi−1<br />

) + (2)<br />

+ λ2<br />

* P(<br />

wi<br />

| wi−1<br />

) + λ1<br />

* P(<br />

wi<br />

)<br />

where the sum of coefficients λ1+λ2+λ3 = 1, λ1>0, λ2>0,<br />

λ3>0. The values for λ1, λ2, λ3 are obtained by solving the<br />

system of linear equations.<br />

In (Zelenkov, 2005) the authors in their disambiguation<br />

model had defined an unknown tag wi by involving not<br />

only the information on left neighbours, but also the right<br />

ones. We will use the similar approach when our system<br />

works with the trigram model. In this case the unknown<br />

tag is defined by involving the left neighbours<br />

(3), the right ones (4), and<br />

both the left and the right ones (5).<br />

However, both the rule-based and trigram models require<br />

large tagged corpora of texts. The trigram rules which do<br />

not contain the information on a lexeme reflect specific<br />

language features, but the trigrams themselves (with<br />

lexemes inside) reflect rather the lexis in use. If the texts<br />

from another knowledge domain are given, the trigrams<br />

may show considerably worse results than for the initial<br />

corpus.<br />

According to Google researches the digital collection of<br />

English texts they possess contains 10 12 words. The<br />

British (BNC, <strong>2011</strong>) and America (ANC, <strong>2011</strong>) National<br />

Corpora contain about 10 8 tagged words. According to<br />

191


Multilingual Resources and Multilingual Applications - Posters<br />

the information on January, 2008 Russian National<br />

Corpus (RNC, <strong>2011</strong>) contains about 5.8*10 6<br />

disambiguated words (and still remain). At present the<br />

process of filling up the latest corpora is rather frozen<br />

than active (unlike the situation for the first years of the<br />

project when it was being filled up intensively). The task<br />

of tagging (though automated) 10 12 words, seems to be<br />

economically impracticable, and may be even<br />

unnecessary. The realization of practical applications for<br />

processing 10 9 trigrams (the quantity estimation for<br />

English language could be found in Google (2006)) will<br />

require a considerable amount of computational<br />

resources.<br />

At present there are trigram bases accumulated that solve<br />

the problem with 94-95 % accuracy for Russian (Sokirko,<br />

2004). The additional methods increases the quality of<br />

the disambiguation up to 97,3 % (Lyashevskaya, 2010). It<br />

is worthy to note that the application of rule-based<br />

methods requires essential time expenses. The<br />

application of trigrams demands a well-tagged corpus,<br />

and it is a costly problem too. The rule creating is also<br />

connected with a permanent work of linguists. The<br />

results of such work are never in vain, the output remains<br />

applicable to many other projects, but such results are<br />

helpless to improve the accuracy immediately. In this<br />

connection we had set a goal to develop a new method<br />

which would use results of the previous developments<br />

accumulated in this field and information from partial<br />

syntax analysis.<br />

192<br />

2. Obtaining statistical data on<br />

co-occurrence of words<br />

It is widely acknowledged that a resolution of lexical<br />

ambiguity should be provided before a syntactic analysis.<br />

In this case it is recommended to apply methods like<br />

n-grams. However, n-gram method requires a substantial<br />

preliminary work to prepare a tagged text corpus. We<br />

have decided to develop a disambiguation method, which<br />

uses the syntactic information (obtained in the automatic<br />

mode) without carrying out full syntactical parsing. In<br />

our researches we focused on Russian.<br />

As the practice has shown, full parsing that would<br />

provide full constructing of the tree is not required to<br />

remove the most part of a homonymy (about 90%). As it<br />

happens, it is sufficient to include the rules of words<br />

collocation in nominal and verb phrases, folding of<br />

homogeneous parts of sentence, agreement of subject and<br />

predicate, prepositions and case government and some<br />

others, in total not exceeding 20 rules, which are<br />

described by context-free grammar. It is possible to have<br />

a more detailed look at the methods of formal description<br />

of language, for instance, in Ermakov (2002).<br />

To solve the problems mentioned above, it is necessary to<br />

create a method of getting information on a syntactic<br />

relationship for the words which are obtained from a<br />

non-tagged corpus. Preliminary experiments have shown<br />

that in Russian language approximately 50% of words<br />

appear to be part-of-speech unambiguous (up to 80% in<br />

conversation texts, in comparison with less than 40% for<br />

news in English), i.e. there are no lexical homonyms for<br />

each of such words. So the probability to find a group of<br />

unambiguous words in a text is rather high.<br />

The analysis of Russian sentence structure allows to<br />

determine some of its’ syntactic characteristic features.<br />

1) The noun phrase (NP) which follows the sole verb in<br />

the sentence is syntactically dependent on this verb.<br />

2) The sole NP which opens the sentence and is<br />

followed by a verb, is syntactically subordinated to<br />

this verb.<br />

3) The adjectives that are located before the first noun<br />

in the sentence, or between a verb and a noun, are<br />

syntactically subordinated to this noun.<br />

4) The paragraphs 1-3 could be applied also to<br />

adverbial participles, and it is possible to consider<br />

participles instead of adjectives.<br />

We had applied our method to the processing of several<br />

untagged corpora in Russian language. The total amount<br />

of these corpora included more than 4,2 billion of words.<br />

The text sources contain texts on various themes in<br />

Russian. The used corpora include the sources given in<br />

the Table 1.<br />

The morphological tagging was made with the help of<br />

module of morphological analysis “Crosslator”<br />

developed by our team (Yolkeen, 2003). The volume of<br />

the databases obtained is listed in the table below. The<br />

numerator shows the detected total amount of<br />

unambiguous words with the given fixed type of<br />

syntactic relation. The denominator shows the amount of<br />

unique combinations of words of the given type.<br />

The analysis of the results (Table 2) has shown that the<br />

selected pairs contain 22200 verbs from 26400<br />

represented in the morphological dictionary, 55200 nouns


from 83000 and 27600 adjectives from 45300<br />

represented in the dictionary. Such a significant amount<br />

of verbs could be explained by their low degree of<br />

ambiguity as compared with other parts of speech. A<br />

small number of adjectives could be explained by the fact<br />

that from several adjectives located immediately before a<br />

noun, only the first one was entered into the database. It<br />

should be noted that when the largest corpus had been<br />

integrated into the system, the number of lexemes has not<br />

been changed notably, but at the same time the number of<br />

pairs detected significantly increased. For example, the<br />

number of verbs has increased from 21500 up to 22200,<br />

whereas the number of unique combinations of verb +<br />

noun type has increased from 8,3 mln to 10,9. Moreover,<br />

the amount of such combinations that had occurred more<br />

than twice, has increased from 2.3 to 4 mln. Thus, it is<br />

possible to say that when a corpus contains more than one<br />

billion words, the lexis in use achieves its saturation limit,<br />

while its usage continues to change.<br />

Source Amount Source Amount<br />

mln w/u<br />

mln w/u<br />

WebReading 3049 Lenta.ru 33<br />

Moshkov’s 680 Rossiyskay 29<br />

Library<br />

a Gazeta<br />

RIA News 156 PCWeek<br />

RE<br />

28<br />

Fiction coll. 120 RBC 21<br />

Nezavisima 89 Compulent 9<br />

ya Gazeta<br />

a.ru<br />

Total 4214<br />

Table 1: Used corpora<br />

Pair Total, mln >1, mln >2, mln<br />

V+N 243 / 10.89 237 / 5.27 235 / 4<br />

Ger+N 40.8 / 2.76 39.3 / 1.25 38.7 / 0.91<br />

N+Adj 67 / 2.15 66 / 1.13 65.6 / 0.9<br />

Table 2: Obtained results<br />

About 9 % of all word occurrences from the total amount<br />

of the corpus had been used to build a co-occurrence base.<br />

But even this percentage had appeared to be sufficient to<br />

construct a representative sample for a word<br />

co-occurrence statistics. The estimations have shown that<br />

the received word combinations contain not more than<br />

3% of the errors mostly caused by an improper word<br />

order or neglect of some syntactically acceptable variants<br />

of collocations, deviances in projectivity and mistakes in<br />

Multilingual Resources and Multilingual Applications - Posters<br />

the text. It is necessary to stress that all results had been<br />

obtained in the shortest terms without any manual<br />

tagging of the corpus. Probably the results could be more<br />

representative, if we were to use some methods of<br />

part-of-speech disambiguation. However, the best<br />

methods give a 3-5 % error, and it would affect the<br />

accuracy of results but not noticeably. On the other hand,<br />

the sharp increase in corpus volume will allow to neglect<br />

the false alternatives at a higher level of occurrence and<br />

by these means preserve the quality level.<br />

3. Complex Method of Disambiguation<br />

After we had collected the co-occurrence base, which<br />

was sufficiently large, we have got all that was necessary<br />

to solve the main problem, that is, to create a method of<br />

disambiguation for texts in Russian on the basis of<br />

information on a syntactic co-occurrence of words.<br />

Let us assume that in the sentence, which is being parsed,<br />

there are two words between which there are only several<br />

words or no words at all, and it is known that these two<br />

words could be linked by a syntactical relation. In this<br />

case, if we have other less probable variants of tagging<br />

these words, it is possible to assume that the variant with<br />

such link will be more probable. The most difficult thing<br />

is to collect a representative base of syntactic relations.<br />

In this paper the rules shall be understood as an ordered<br />

set: , where vi = is a short<br />

description of the word, pw is a part of speech of the word,<br />

and {pr} is a set of lexical parameters of the word. Thus,<br />

in such rules the lexemes of a word are not taken into<br />

account in contrast to the lexical characteristics of the<br />

word. A rule may be interpreted in different ways and can<br />

be written down as an occurrence vi with regard for its<br />

right neighbours, as an occurrence vi+2 with regard for its<br />

left neighbours or as an occurrence vi+1 with regard for its<br />

both neighbours. The set of rules has been obtained from<br />

the tagged corpus. Following Zelenkov (2005), we will<br />

make tagging of a word considering its right and left<br />

neighbours. In the mentioned above paper a tag of the<br />

word is defined only with regard for the nearest<br />

neighbours of current word. However, it is not necessary<br />

to produce the result that falls within the global<br />

maximum. The exhaustive search of word tagging<br />

variants is usually avoided, as it takes too much time.<br />

As it already has been noted above, the ratio of<br />

unambiguous tokens is about 50% in Russian. In this<br />

193


Multilingual Resources and Multilingual Applications - Posters<br />

connection there is always a sufficient probability to find<br />

a group of two unambiguous words. Moreover, the<br />

chance grows as the length of the sentence increases. If<br />

such groups are not found while searching a global<br />

maximum, the first word in the sentence will indirectly<br />

influence even the last word. In the case such groups are<br />

present, such relationship is cancelled, and the search of<br />

global criterion can be effected over the separate<br />

fragments of the sentence. It allows to increase<br />

essentially the speed of the algorithm. So the sentence<br />

“Так думал молодой повеса, / Летя в пыли на<br />

почтовых, / Всевышней волею Зевеса / Наследник<br />

всех своих родных.” (Such were a young rake's<br />

meditations – / By will of Zeus, the high and just, / The<br />

legatee of his relations – / As horses whirled him through<br />

the dust.) can be split into three independent parts: “Так<br />

думал молодой повеса, Летя”, “Летя в пыли на<br />

почтовых” and “Всевышней волею Зевеса Наследник<br />

всех своих родных”.<br />

Thus, we no longer consider the problem<br />

Psent = argmax(<br />

∏=<br />

i 1<br />

194<br />

ns<br />

P(vi | vi-1, vi-2) ), where ns is a<br />

number of words in the sentence, but<br />

n f<br />

Psent =<br />

∏=<br />

i 1<br />

n fi<br />

argmax( ∏=<br />

i 1<br />

P(vi | vi-1, vi-2) ), where nf is a<br />

number of fragments, nfi is a number of words in the i-th<br />

fragment. According to formulas (2)-(4), we consider<br />

both left and right neighbours of the word.<br />

We seek the optimum from the edges of the fragment<br />

towards its center. It is obvious that product of the<br />

maximal values of probabilities for each word can give a<br />

global maximum. If this is not the case, but the values<br />

obtained from two sides had come to one and the same<br />

disambiguation of the word in the middle of the fragment,<br />

than we will also consider that we have a good enough<br />

solution. If variants of disambiguation of the word in the<br />

middle of the fragment are different for two solutions, the<br />

optimization is carried out for the accessible variants<br />

until they won't achieve one and the same decision. In<br />

any case, the optimization is not carried out even for an<br />

entire fragment, not mentioning the whole sentence.<br />

The amount of unambiguous fragments can be increased<br />

by a preliminary disambiguation using another method.<br />

We use the described above base of syntactic<br />

dependences. So, let we have a set {},<br />

wi = is a complete description of the word<br />

where lw is a word’s lexeme, w1 is a key word in the<br />

word-group (for example, a verb in the pair<br />

«verb+noun»), w2 is a preposition (if any), w3 is a<br />

dependent word, p is a probability of word combination<br />

w1 + w2 + w3. In this case all rules are searched for every<br />

word of the sentence. It should be noticed that no word<br />

can participate in more than two rules. Thus, for each<br />

word it is necessary to calculate argmax (p1 + p2), where<br />

p1 and p2 are the probabilities of rules containing this<br />

word in dominant and dependent position.<br />

Actually, during the check of compatibility of the words<br />

among themselves, our system uses the following bigram<br />

model P( wi<br />

) = arg max P(<br />

wi<br />

| wi−l<br />

) , where l means the<br />

distance (in number of words), at which the unknown<br />

word may stand from the known one. The rule containing<br />

the given word is selected in the following way. We take<br />

the floating window containing 10 words to the right and<br />

left. The dependent word must be located within this<br />

window, the preposition must be located before the<br />

dependent word, but there must be no main word between<br />

them, the adjective must lexically agree with a noun.<br />

4. Results of experiments and discussion<br />

As a result of our work we had obtain the Corpus of<br />

syntactical combinations of Russian words. The relations<br />

were achieved using untagged corpora of general lexis<br />

texts containing more than 4 bln words. The tagging was<br />

carried out “on the fly”. There had been revealed about 6<br />

mln of authentic unique word combinations which had<br />

occurred in the text more than 340 mln times. According<br />

to our estimations, the amount of errors in the obtained<br />

corpora doesn't exceed 3 %. The number of word<br />

combination can be enlarged by processing the texts of a<br />

given new domain. Though, the investigations had shown<br />

that scientific texts use other constructions which reduce<br />

the amount of sampled combinations, for example, for<br />

speech and cognition verbs. Our method extracts about<br />

9 % of tokens from common lexis texts. But news lines<br />

give us just about 5 %. Moreover, for scientific texts this<br />

number shortened to 3 %. So the method shows different<br />

productivity for different domains. Further experiments<br />

have discovered that the received results can be used for<br />

defining the style of texts.<br />

So the suggested method allows almost automatically


obtaining the information on word compatibility which<br />

further can be used, for instance, for parsing or at other<br />

stages of text processing. The method is also not strictly<br />

tied to the texts of a certain domain and has rather low<br />

cost of enlargement.<br />

The estimation of the efficiency of the system with<br />

various parameters was carried out with carefully<br />

tokenized corpora that contained about 2300 words.<br />

Results were checked using Precision and Accuracy<br />

measures. The mere involving of information on word<br />

compatibility in Russian method had shown 71.98%<br />

Precision ratio and <strong>96</strong>.75% Accuracy. This result is<br />

comparable with best results in selected area (Lee, 2010).<br />

The advantage of this method is in its` ability to be<br />

additionally adjusted to a new knowledge domain<br />

quickly and automatically (that is most important), in<br />

case a sufficiently large text corpus is available. The<br />

method gives an acceptable quality of disambiguation,<br />

unfortunately with not too large Precision.<br />

The coverage ratio can be improved by application of<br />

trigram rules, which can be easily received, for example,<br />

from http://aot.ru, or by analysis of the tagged corpus in<br />

Russian (for example, http://ruscorpora.ru). The<br />

coverage ratio in this case has made 78%, but the<br />

accuracy has fallen to 95.6%. In Sokirko (2004) it is<br />

mentioned that the systems Inxight and Trigram provide<br />

94.5% and 94.6% accuracy accordingly, that is<br />

comparable with the results of our system. Further<br />

improvement of coverage ratio up to 81.3 % is possible in<br />

case of the improvement of optimal decision search<br />

algorithm which is described above, but it slightly brings<br />

down the accuracy. In the current state the method is not<br />

able to show an absolute coverage, because the<br />

part-of-speech list applied in this method was not full, it<br />

contained only the following: a verb, a verbal adverb, a<br />

participle, a noun, an adjective, a preposition and an<br />

adverb. Then, there was no information on some types of<br />

relations, for example, «noun+noun». Furthermore, the<br />

information on a compatibility of some words of Russian<br />

conceptually cannot be obtained because of fundamental<br />

homonymy of certain words. For example, the word<br />

"white" can be used both as an adjective and as a noun.<br />

Our results are applicable to some (but not all) European<br />

languages. So the extremely unambiguous English<br />

doesn’t allow construct the words combinations database.<br />

Method can be applied for German or French but the<br />

Multilingual Resources and Multilingual Applications - Posters<br />

rules should be completely rewritten. Problems like<br />

verbal detachable prefixes in German and reverse words<br />

order should be taken into account.<br />

5. References<br />

Tapanainen P., Voutilainen A. (1994): Tagging<br />

accurately - don‘t guess if you know. In Proc. of conf.<br />

on applied natural language processing, 1994.<br />

Brill E. (1995): Unsupervised learning of disambiguation<br />

rules for part of speech tagging. In Proceedings of the<br />

Third Workshop on Very Large Corpora, p. 1-13, 1995.<br />

Zelenkov Yu.G, Segalovich Yu.A., Titov V.A. (2005):<br />

Вероятностная модель снятия морфологической<br />

омонимии на основе нормализующих подстановок и<br />

позиций соседних слов. Материалы Международной<br />

конференции «Диалог’2005»<br />

British National Corpus (<strong>2011</strong>):<br />

http://www.natcorp.ox.ac.uk/<br />

American National Corpus (<strong>2011</strong>):<br />

http://americannationalcorpus.org/<br />

Russian National Corpus (<strong>2011</strong>):<br />

http://www.ruscorpora.ru/<br />

Google (2006): All Our N-gram are Belong to You,<br />

Google research blog,<br />

http://googleresearch.blogspot.com/2006/08/all-our-ngram-are-belong-to-you.html<br />

Sokirko A.V., Toldova S.Yu. (2004): Сравнение<br />

эффективности двух методик снятия лексической и<br />

морфологической неоднозначности для русского языка.<br />

Материалы конференции «Корпусная лингвистика’2004»<br />

Lyashevskaya O. at all (2010): Оценка методов<br />

автоматического анализа текста: морфологические<br />

парсеры русского языка. Материалы Международной<br />

конференции «Диалог’2010»<br />

Ermakov A.E. (2002): Неполный синтаксический анализ<br />

текста в информационно-поисковых системах.<br />

Материалы Международной конференции «Диалог’2002»<br />

Yolkeen S.V., Klyshinsky E.S., Steklyannikov S.E.,<br />

Проблемы создания универсального<br />

морфосемантического словаря. Сб. трудов<br />

Международных конференций IEEE AIS’03 и CAD-2003,<br />

том 1, 2003. стр. 159-163.<br />

Lee Y.K., Haghighi A., Barzilay R. (2010) Simple<br />

Type-Level Unsupervised POS Tagging. In Proc. of<br />

EMNLP 2010<br />

195


Multilingual Resources and Multilingual Applications - Posters<br />

1<strong>96</strong>


Multilingual Resources and Multilingual Applications - Posters<br />

Von TMF in Richtung UML: in drei Schritten zu einem Modell<br />

des übersetzungsorientierten Fachwörterbuchs 1<br />

Georg Löckinger<br />

<strong>Universität</strong> Wien und Österreichische Akademie der Wissenschaften<br />

Wien, Österreich<br />

georg.loeckinger@univie.ac.at<br />

Abstract<br />

Fachübersetzer(innen) brauchen <strong>für</strong> ihre Tätigkeit maßgeschneiderte fachsprachliche Informationen. Zwischen ihrem Bedarf und<br />

den verfügbaren fachsprachlichen Ressourcen besteht jedoch eine große Diskrepanz. In meinem Dissertationsprojekt gehe ich der<br />

zentralen Forschungsfrage nach, ob sich das Fachübersetzen mit einem idealen übersetzungsorientierten Fachwörterbuch effizienter<br />

gestalten lässt. Zur Beantwortung der zentralen Forschungsfrage werden zuerst mehrere Thesen aufgestellt. Davon wird ein Modell<br />

des übersetzungsorientierten Fachwörterbuchs in zwei Detaillierungsgraden hergeleitet, das später mit „ProTerm“, einem Werkzeug<br />

<strong>für</strong> Terminologiearbeit und Textanalyse, in der Praxis experimentell erprobt werden soll. Der vorliegende Aufsatz soll einen Überblick<br />

über die bisherige Forschungsarbeit geben. Zuerst werden in knapper Form 15 Thesen vorgestellt, die auf der einschlägigen<br />

wissenschaftlichen Literatur und meiner eigenen Berufserfahrung in Fachübersetzen und Terminologiearbeit beruhen. Im Hauptteil<br />

des Aufsatzes kommt ein Modell des übersetzungsorientierten Fachwörterbuchs zur Sprache. Das Modell dient als Bindeglied<br />

zwischen den konkreten Anforderungen, die mit den 15 Thesen ausgedrückt werden, und der praktischen Umsetzung mit „ProTerm“.<br />

Der Aufsatz schließt mit einem Ausblick auf die nächsten Schritte in meinem Dissertationsprojekt ab.<br />

Keywords: übersetzungsorientiertes Fachwörterbuch, übersetzungsorientierte Terminografie, Fachlexikografie, Fachübersetzen<br />

1. Einleitung 1<br />

Fachübersetzer(innen) hegen seit Langem den Traum von<br />

einem übersetzungsorientierten Nachschlagewerk(zeug),<br />

das ihrem Bedarf in maximalem Umfang Rechnung trägt.<br />

Den historischen Ausgangspunkt <strong>für</strong> die Beschäftigung<br />

mit der einschlägigen wissenschaftlichen Literatur bildet<br />

Tiktin (1910) mit dem klingenden Titel „Wörterbücher<br />

der Zukunft“. Auch einige andere Literaturstellen bringen<br />

die noch nicht erfüllten Träume zum Ausdruck; vgl.<br />

Hartmann (1988), Snell-Hornby (19<strong>96</strong>), de Schryver<br />

(2003). Die Diskrepanz zwischen den vorhandenen<br />

fachsprachlichen Ressourcen und dem, was Fachübersetzer(innen)<br />

benötigen, hat unter diesen zu einer gewissen<br />

Unzufriedenheit geführt. Infolgedessen begannen<br />

sie, ihre eigenen terminologischen Datenbestände und<br />

Nachschlagewerk(zeug)e zu erstellen. Somit kam zu<br />

ihrer Tätigkeit der Terminologienutzung jene der Terminologieerarbeitung<br />

hinzu.<br />

1 Beim vorliegenden Aufsatz handelt es sich um eine erweiterte<br />

und überarbeitete deutsche Fassung von Löckinger (<strong>2011</strong>).<br />

2. Anforderungen an das übersetzungsorientierte<br />

Fachwörterbuch: 15 Thesen<br />

Entgegen einer weitverbreiteten Meinung ist das Fachübersetzen<br />

ein komplexer Vorgang; vgl. etwa Wilss<br />

(1997). Daher hat das übersetzungsorientierte Fachwörterbuch<br />

(ü. F.) mannigfaltige Anforderungen zu erfüllen.<br />

Im Folgenden stelle ich diese in Form von 15 Thesen dar,<br />

die sich auf die wissenschaftliche Literatur und/oder<br />

eigene Argumente stützen. Die 15 Thesen leiten sich aus<br />

der empirischen Praxis des Fachübersetzens und der<br />

wissenschaftlichen Beschäftigung mit dieser Praxis ab 2<br />

.<br />

Sie werden einer der Kategorien „methodikbezogen“,<br />

„inhaltsbezogen“ bzw. „Darstellung und Verknüpfung<br />

der Inhalte“ zugeordnet, die sich aber – wie auch die<br />

einzelnen Thesen selbst – ergänzen und zum Teil überschneiden.<br />

2 Eine ausführliche Darstellung der Argumente <strong>für</strong> die einzelnen<br />

Thesen mitsamt den jeweiligen Literaturverweisen würde<br />

den Rahmen dieses Aufsatzes sprengen. Ein Literaturverzeichnis<br />

ist beim Autor erhältlich.<br />

197


Multilingual Resources and Multilingual Applications - Posters<br />

2.1. Methodikbezogene Anforderungen<br />

These 1 (systematische Terminologiearbeit): Das ü. F.<br />

muss nach den Grundsätzen und Methoden der systematischen<br />

Terminologiearbeit erstellt worden sein.<br />

These 2 (Beschreibung der angewandten Methodik):<br />

Das ü. F. muss über die (lexikografische und/oder terminografische)<br />

Methodik Aufschluss geben, die bei seiner<br />

Erstellung zum Einsatz kam.<br />

2.2. Inhaltsbezogene Anforderungen<br />

These 3 (Benennungen und Fachwendungen sowie<br />

ihre Äquivalente): Das ü. F. muss Benennungen,<br />

Fachwendungen und Äquivalente in Ausgangssprache<br />

und Zielsprache(n) enthalten.<br />

These 4 (grammatikalische Informationen): Das ü. F.<br />

muss grammatikalische Informationen zu Benennungen,<br />

Fachwendungen und Äquivalenten bieten.<br />

These 5 (Definitionen): Das ü. F. muss Definitionen der<br />

in ihm beschriebenen Begriffe enthalten.<br />

These 6 (Kontexte): Das ü. F. muss authentische Kontexte<br />

(v. a. in der Zielsprache) bereitstellen.<br />

These 7 (enzyklopädische Informationen): Das ü. F.<br />

muss enzyklopädische Informationen (fachgebietsbezogene<br />

Hintergrundinformationen, z. B. Angaben zur<br />

Verwendung eines bestimmten Gegenstandes) enthalten.<br />

These 8 (multimediale Inhalte): Das ü. F. muss nach<br />

Möglichkeit und Bedarf Gebrauch von multimedialen<br />

Inhalten (Grafiken, Diagrammen, Tondateien usw.) machen.<br />

These 9 (Anmerkungen): Das ü. F. muss mit Anmerkungen<br />

zu der in ihm enthaltenen Terminologie versehen<br />

sein, z. B. mit Hinweisen zu Übersetzungsfehlern.<br />

2.3. Anforderungen an Darstellung und Verknüpfung<br />

der Inhalte<br />

These 10 (elektronische Form): Um den meisten anderen<br />

Anforderungen zu entsprechen, muss das ü. F. in<br />

elektronischer Form vorliegen.<br />

These 11 (begriffssystematische und alphabetische<br />

Ordnung): Das ü. F. muss begriffssystematisch und<br />

alphabetisch geordnet sein, um <strong>für</strong> unterschiedlichste<br />

Übersetzungsprobleme brauchbare Lösungen anzubieten.<br />

These 12 (Darstellung von Begriffsbeziehungen): Das<br />

ü. F. muss aufzeigen, wie die einzelnen Begriffe der<br />

jeweiligen Terminologie zusammenhängen (Begriffsbe-<br />

198<br />

ziehungen in Abhängigkeit von der Strukturierung des<br />

jeweiligen Fachgebiets, z. B. Abstraktionsbeziehungen<br />

oder sequenzielle Begriffsbeziehungen).<br />

These 13 (Nutzung von Textkorpora): Da authentische<br />

Textkorpora wertvolle fachsprachliche Informationen<br />

beinhalten, muss das ü. F. auf geeigneten Textkorpora<br />

basieren und gleichzeitig einen Zugriff auf diese bieten.<br />

These 14 (Ergänzungen und Anpassungen durch<br />

die/den Fachübersetzer(in)): Das ü. F. muss der/dem<br />

Fachübersetzer(in) bedarfsgerechte Ergänzungen und<br />

Anpassungen ermöglichen.<br />

These 15 (einheitliche Benutzeroberfläche): Es muss<br />

der/dem Fachübersetzer(in) möglich sein, auf die Informationen<br />

im ü. F. von einer einzigen Benutzeroberfläche<br />

aus zuzugreifen.<br />

3. Modell des übersetzungsorientierten<br />

Fachwörterbuchs<br />

Die 15 Thesen sollen nun in ein geeignetes Modell<br />

übergeführt werden. Da die Thesen Anforderungen an<br />

das ü. F. darstellen, die ausnahmslos auch der empirischen<br />

Praxis des Fachübersetzens entstammen, wird im<br />

Folgenden induktiv ein Modell des ü. F. entworfen.<br />

Mit Ausnahme der Thesen 10, 14 und 15, die die Umsetzung<br />

des Modells betreffen, lassen sich sämtliche<br />

Thesen in einem Modell zusammenführen, das das ü. F.<br />

mit allen erforderlichen Inhalten beschreibt. Ausgehend<br />

von dem TMF-Modell in der internationalen Norm<br />

ISO 16642 (2003) wird das Modell des ü. F. in zwei<br />

Detaillierungsgraden vorgestellt (vgl. Budin (2002)).<br />

Insgesamt entspricht dies der Drei-Ebenen-Einteilung<br />

nach Budin und Melby (2000), die beim Projekt<br />

„SALT“ zum Einsatz kam.<br />

Die Modellierung dient hier in zweifacher Hinsicht „als<br />

Bindeglied zwischen Empirie und Theorie“ (Budin,<br />

19<strong>96</strong>:1<strong>96</strong>): Einerseits werden die 15 Thesen induktiv<br />

zum Modell in den zwei genannten Detaillierungsgraden<br />

umformuliert, andererseits soll das Modell wiederum<br />

deduktiv in die empirische Praxis übergeführt und dort<br />

experimentell erprobt werden. Diese schrittweise Vorgangsweise<br />

hat den Vorteil, dass man sich bei der Modellierung<br />

ganz dem zu schaffenden abstrakten und<br />

implementierungsunabhängigen Modell widmen kann,<br />

ohne sich um die Einzelheiten seiner späteren technischen<br />

Umsetzung kümmern zu müssen (vgl. etwa Sager<br />

(1990)).


Nachstehend geht es um das TMF-Modell (3.1.), das<br />

Modell im ersten Detaillierungsgrad einschließlich des<br />

Modells des terminologischen Eintrags (3.2.) und das<br />

Modell im zweiten Detaillierungsgrad (Datenmodell,<br />

3.3.). Im Mittelpunkt steht das Modell im ersten Detail-<br />

lierungsgrad, da dieses bereits in ausgereifter Form vorliegt.<br />

3.1. TMF-Modell gemäß ISO 16642 (2003)<br />

Die internationale Norm ISO 16642 (2003) beschreibt<br />

ein Rahmenmodell <strong>für</strong> die Auszeichnung terminologischer<br />

Daten (TMF), mit dem Auszeichnungssprachen <strong>für</strong><br />

terminologische Daten definiert werden können, die sich<br />

wiederum mit einem generischen Abbildungswerkzeug<br />

aufeinander abbilden lassen. Die ISO 16642 (2003) hat<br />

zum Ziel, die Nutzung und Weiterentwicklung von<br />

Computeranwendungen <strong>für</strong> terminologische Daten zu<br />

fördern und den Austausch terminologischer Daten zu<br />

erleichtern. Im Gegensatz dazu ist die Festlegung von<br />

Datenkategorien nicht Gegenstand dieser Norm; vgl.<br />

dazu ISO 12620 (1999).<br />

Schematisch ergibt das TMF-Modell folgendes Bild:<br />

Bild 1: Schematische Darstellung des TMF-Modells aus<br />

ISO 16642 (2003).<br />

Die oben abgebildeten Bestandteile des Modells lassen<br />

sich wie folgt beschreiben (von oben nach unten, von<br />

links nach rechts; vgl. DIN 2330 (1993), DIN 2342<br />

(2004), ISO 16642 (2003)):<br />

TDC (terminological data collection): oberste Stufe, die<br />

alle zu einem terminologischen Datenbestand gehörenden<br />

Informationen umfasst;<br />

GI (global information): globale Informationen = administrative<br />

und technische Angaben, die sich auf den gesamten<br />

terminologischen Datenbestand beziehen;<br />

CI (complementary information): zusätzliche Informa-<br />

Multilingual Resources and Multilingual Applications - Posters<br />

tionen = Angaben, die über jene in den terminologischen<br />

Einträgen hinausgehen und üblicherweise von mehreren<br />

terminologischen Einträgen aus angesprochen werden;<br />

TE (terminological entry): terminologischer Eintrag<br />

(Eintragsebene), d. h., jener Teil eines terminologischen<br />

Datenbestands, der terminologische Daten zu einem<br />

einzigen Begriff oder zu mehreren quasiäquivalenten<br />

Begriffen enthält;<br />

LS (language section): Sprachebene, d. h., jener Teil<br />

eines terminologischen Eintrags, in dem sich terminologische<br />

Daten in einer Sprache befinden;<br />

TS (term section): Benennungsebene, d. h., jener Teil der<br />

Sprachebene, der terminologische Daten zu einer oder<br />

mehreren Benennungen bzw. Fachwendungen umfasst;<br />

TCS (term component section): unterste Stufe, die (nicht)<br />

bedeutungstragende Einheiten von Benennungen bzw.<br />

Fachwendungen beschreibt.<br />

3.2. Das Modell des übersetzungsorientierten<br />

Fachwörterbuchs<br />

Als Grundlage dient das Modell eines terminologischen<br />

Eintrags von Mayer (1998). Dieses wird nach den Erfordernissen<br />

meines Dissertationsprojektes so angepasst<br />

und erweitert, dass daraus ein Modell des ü. F. in zwei<br />

Detaillierungsgraden resultiert (Modell im ersten Detaillierungsgrad,<br />

Modell im zweiten Detaillierungsgrad =<br />

Datenmodell).<br />

3.2.1. Das Modell des terminologischen Eintrags<br />

Gemäß dem derzeitigen Stand der Forschung zur terminografischen<br />

Modellierung muss das Modell des terminologischen<br />

Eintrags folgenden fünf Kriterien entsprechen:<br />

Begriffsorientierung (vgl. etwa ISO 16642 (2003)),<br />

Benennungsautonomie (vgl. etwa Schmitz (2001)),<br />

Elementarität (vgl. etwa ISO/PRF 26162 (2010)), Granularität<br />

(vgl. etwa Schmitz (2001)) und Wiederholbarkeit<br />

(vgl. etwa ISO/PRF 26162 (2010)). Von Belang sind<br />

hier ferner die drei oben genannten Ebenen des<br />

TMF-Modells (Eintragsebene, Sprachebene und Benennungsebene).<br />

Die nachstehenden Datenkategorien leiten sich entweder<br />

aus den 15 Thesen oder aus dem derzeitigen Stand der<br />

Forschung zur terminografischen Modellierung ab (vgl.<br />

insbesondere ISO 12620 (1999) und das<br />

ISO-Datenkategorienverzeichnis „ISOcat“ unter<br />

www.isocat.org). Mit einem hochgestellten Pluszeichen<br />

(„ + “) versehene Bezeichnungen beziehen sich auf Da-<br />

199


Multilingual Resources and Multilingual Applications - Posters<br />

tenkategorien, die auf einer oder mehreren der drei oben<br />

genannten Ebenen Datenelemente enthalten können. Ein<br />

hochgestelltes „ W “ zeigt an, dass die jeweilige Datenkategorie<br />

innerhalb der Ebene, auf der sie genannt wird,<br />

wiederholbar sein muss.<br />

Die Eintragsebene umfasst folgende Datenkategorien:<br />

enzyklopädische Informationen + , multimediale Inhalte W ,<br />

Anmerkung +W , Position des Begriffs (wenn nur ein Begriff),<br />

Quellenangabe +W , administrative Angaben +W . Auf<br />

der Sprachebene befinden sich folgende Datenkategorien:<br />

Definition (wenn nur ein Begriff) bzw. Definition W<br />

(wenn mehrere quasiäquivalente Begriffe), enzyklopädische<br />

Informationen + , Anmerkung +W , Position der<br />

Begriffe W (wenn mehrere quasiäquivalente Begriffe),<br />

Quellenangabe +W , administrative Angaben +W . Die Benennungsebene<br />

schließlich besteht aus den Datenkategorien<br />

Benennung/Fachwendung/Äquivalent W , grammatikalische<br />

Informationen W , Kontext W , enzyklopädische<br />

Informationen + , Anmerkung +W , Quellenangabe +W ,<br />

administrative Angaben +W .<br />

3.2.2. Das Modell im ersten Detaillierungsgrad<br />

Das Modell im ersten Detaillierungsgrad, dessen Herzstück<br />

das oben erläuterte Modell des terminologischen<br />

Eintrags bildet, sieht grob so aus:<br />

Bild 2: Überblicksartige schematische Darstellung des<br />

Modells im ersten Detaillierungsgrad.<br />

Zu den drei bereits erwähnten Ebenen (Eintragsebene,<br />

Sprachebene, Benennungsebene) kommen noch die zwei<br />

Bestandteile „globale Informationen“ und „zusätzliche<br />

Informationen“ aus dem TMF-Modell in ISO 16642<br />

(2003) hinzu. Die erforderlichen Datenkategorien leiten<br />

sich erneut entweder aus den 15 Thesen ab oder ergeben<br />

sich aus dem derzeitigen Stand der Forschung zur terminografischen<br />

Modellierung; vgl. insbesondere<br />

ISO 12620 (1999), ISO 16642 (2003), ISO/PRF 26162<br />

(2010), aber auch ISO 1951 (2007). Folglich handelt es<br />

sich bei den globalen Informationen um administrative<br />

und technische Angaben, während zu den zusätzlichen<br />

200<br />

Informationen Begriffspläne, Meta-Informationen zum<br />

ü. F., multimediale Inhalte, alphabetische Auszüge aus<br />

der terminologischen Datenbasis, bibliografische Angaben,<br />

Textkorpora, Quellenangaben und administrative<br />

Angaben zählen.<br />

Das Modell im ersten Detaillierungsgrad lässt sich im<br />

Einzelnen wie folgt darstellen:<br />

Bild 3: Genauere schematische Darstellung des Modells<br />

im ersten Detaillierungsgrad.<br />

3.2.3. Das Modell im zweiten Detaillierungsgrad<br />

(Datenmodell)<br />

Aus dem oben erörterten und abgebildeten Modell im<br />

ersten Detaillierungsgrad soll ein Datenmodell entwickelt<br />

werden, das später in einer empirischen Untersuchung<br />

mit „ProTerm“ praktisch umgesetzt und experimentell<br />

erprobt wird. Hiebei kommt die objektorientierte<br />

Modellierungssprache „Unified Modeling Language“<br />

(UML) zum Einsatz. Diese wird in den einschlägigen<br />

internationalen Normen verwendet (vgl. ISO 16642<br />

(2003) und ISO/PRF 26162 (2010)) und bietet sich vor<br />

allem dann an, wenn ein Datenmodell in Form einer<br />

relationalen Datenbank umgesetzt werden soll.<br />

UML-Modelle sind jedoch implementierungsunabhängig<br />

und können technisch auch anders umgesetzt werden.


Das UML-Modell befindet sich im Entwurfsstadium und<br />

kann daher an dieser Stelle nicht veröffentlicht werden.<br />

Der aktuelle Entwurf kann auf Anfrage zur Verfügung<br />

gestellt werden.<br />

4. Ausblick<br />

Der nächste Schritt nach einer etwaigen Verfeinerung des<br />

Modells im ersten Detaillierungsgrad wird darin bestehen,<br />

ein Datenmodell in Form eines UML-Diagramms zu<br />

entwerfen, das sich <strong>für</strong> die Umsetzung mit „Pro-<br />

Term“ eignet. Eine empirische Untersuchung wird zeigen,<br />

ob das Modell dem Bedarf von Fachübersetzerinnen und<br />

Fachübersetzern in maximalem Umfang Rechnung tragen<br />

und eine Antwort auf die zentrale Forschungsfrage<br />

geben kann. Das Modell des ü. F. ist unabhängig von<br />

einem bestimmten Fachgebiet oder einer bestimmten<br />

Sprachenkombination. Für die empirische Untersuchung<br />

wird das Fachgebiet Terrorismus, Terrorismusabwehr<br />

und Terrorismusbekämpfung in den Sprachen Deutsch<br />

und Englisch herangezogen. Mit der Terminologie dieses<br />

Fachgebiets habe ich mich sowohl wissenschaftlich als<br />

auch in der Berufspraxis eingehend beschäftigt.<br />

5. Literatur<br />

Budin, G. (19<strong>96</strong>): Wissensorganisation und Terminologie:<br />

Die Komplexität und Dynamik wissenschaftlicher<br />

Informations- und Kommunikationsprozesse. Tübingen:<br />

Narr.<br />

Budin, G. (2002): Der Zugang zu mehrsprachigen terminologischen<br />

Ressourcen – Probleme und Lösungsmöglichkeiten.<br />

In K.-D. Schmitz, F. Mayer &<br />

J. Zeumer (Hg.), eTerminology. Professionelle Terminologiearbeit<br />

im Zeitalter des Internet – Akten des<br />

Symposions, Köln, 12.-13. April 2002. Köln: Deutscher<br />

Terminologie-Tag e.V., S. 185–200.<br />

Budin, G., Melby, A. (2000): Accessibility of Multilingual<br />

Terminological Resources – Current Problems<br />

and Prospects for the Future. In A. Zampolli et al.<br />

(Hg.), Proceedings of the Second International Conference<br />

on Language Resources and Evaluation, volume<br />

II. Athens, S. 837–844.<br />

DIN 2342 (2004). Begriffe der Terminologielehre (Entwurf).<br />

DIN 2330 (1993). Begriffe und Benennungen – Allgemeine<br />

Grundsätze.<br />

Hartmann, R. R. K. (1988): The Learner’s Dictionary:<br />

Multilingual Resources and Multilingual Applications - Posters<br />

Traum oder Wirklichkeit? In K. Hyldgaard-Jensen &<br />

A. Zettersten (Hg.), Symposium on Lexicography III.<br />

Proceedings of the Third International Symposium on<br />

Lexicography, May 14–16, 1986 at the University of<br />

Copenhagen. Tübingen: Niemeyer, S. 215–235.<br />

ISO 12620 (1999). Computer applications in terminology<br />

– Data categories.<br />

ISO 16642 (2003). Computer applications in terminology<br />

– Terminological markup framework.<br />

ISO 1951 (2007). Presentation/representation of entries<br />

in dictionaries – Requirements, recommendations and<br />

information.<br />

ISO/PRF 26162 (2010). Systems to manage terminology,<br />

knowledge and content – Design, implementation and<br />

maintenance of Terminology Management Systems.<br />

Löckinger, G. (<strong>2011</strong>): User-Oriented Data Modelling in<br />

Terminography: State-of-the-Art Research on the<br />

Needs of Special Language Translators. In<br />

T. Gornostay & A. Vasiļjevs (Hg.), NEALT Proceedings<br />

Series Vol. 12. Proceedings of the NODALIDA<br />

<strong>2011</strong> workshop, CHAT <strong>2011</strong>: Creation, Harmonization<br />

and Application of Terminology Resources, May 11,<br />

<strong>2011</strong>, Riga, Latvia. Northern European Association for<br />

Language Technology, S. 44-47.<br />

Mayer, F. (1998): Eintragsmodelle <strong>für</strong> terminologische<br />

Datenbanken. Ein Beitrag zur übersetzungsorientierten<br />

Terminographie. Tübingen: Narr.<br />

Sager, J. C. (1990): A Practical Course in Terminology<br />

Processing. Amsterdam: Benjamins.<br />

Schmitz, K.-D. (2001): Systeme zur Terminologieverwaltung.<br />

Funktionsprinzipien, Systemtypen und<br />

Auswahlkriterien (online edition). technische kommunikation,<br />

23(2), S. 34–39.<br />

de Schryver, G.-M. (2003): Lexicographers’ Dreams in<br />

the Electronic-Dictionary Age. International Journal of<br />

Lexicography, 16(2), S. 143–199.<br />

Snell-Hornby, M. (19<strong>96</strong>): The translator’s dictionary –<br />

An academic dream? In M. Snell-Hornby (Hg.),<br />

Translation und Text. Ausgewählte Vorträge. Wien:<br />

WUV-<strong>Universität</strong>sverlag, S. 90–<strong>96</strong>.<br />

Tiktin, H. (1910): Wörterbücher der Zukunft. Germanisch-romanische<br />

Monatsschrift, II, S. 243–253.<br />

Wilss, W. (1997): Übersetzen als wissensbasierte<br />

Tätigkeit. In G. Budin & E. Oeser (Hg.), Beiträge zur<br />

Terminologie und Wissenstechnik. Wien: TermNet,<br />

S. 151–168.<br />

201


Multilingual Resources and Multilingual Applications - Posters<br />

202


Multilingual Resources and Multilingual Applications - Posters<br />

Annotating for Precision and Recall in Speech Act Variation: The Case of<br />

Directives in the Spoken Turkish Corpus<br />

Şükriye Ruhi a , Thomas Schmidt b , Kai Wörner b , Kerem Eryılmaz c<br />

a, c Middle East Technical University, b Hamburg University<br />

a Dept. of Foreign Language, Education, Faculty of Education, 06800 Ankara<br />

b SFB 538 'Mehrsprachigkeit' Max Brauer-Allee 60, D-22765 Hamburg<br />

c Dept. of Cognitive Science, Graduate School of Informatics, 06800 Ankara<br />

E-mail: sukruh@metu.edu.tr, thomas.schmidt@uni-hamburg.de, kai.woerner@uni-hamburg.de,<br />

keryilmaz@gmail.com<br />

Abstract<br />

Speech act realizations pose special difficulties in search during annotation and pragmatics research based on corpora in spite of the<br />

fact that their various forms may be relatively formulaic. Focusing on spoken corpora, this paper concerns the generation of discourse<br />

analytical annotation schemes that can address not only variation in speech act annotation but also variation in dialog and interaction<br />

structure coding. The major arguments in the paper are that (1) enriching the metadata features of corpus design can act as useful aids<br />

in speech act annotation; and that (2) sociopragmatic annotation and corpus-oriented pragmatics research can be enhanced by<br />

incorporating (semi-)automated linguistic annotations that rely both on bottom-up discovery procedures and the more top-down,<br />

linguistic categorizations based on the literature in traditional approaches to pragmatics research. The paper illustrates<br />

implementations of enriched metadata and pragmatic annotation with examples drawn from directives in the demo version of the<br />

Spoken Turkish Corpus, and presents a qualitative assessment of the annotation procedures.<br />

Keywords: speech act annotation, variation, spoken Turkish, precision, metadata<br />

1. Speech acts as a challenge for<br />

corpus annotation<br />

Speech act realizations are notorious for the special<br />

difficulties they pose in search both during annotation<br />

and pragmatics research based on corpora, in spite of the<br />

fact that their various forms may be relatively formulaic,<br />

hence amenable to (semi-)automatic annotation.<br />

Sociopragmatic annotation involves significant<br />

difficulties in the very process of identifying categories<br />

and units of pragmatic phenomena such as variation in<br />

manifestations of speech acts and the identification of<br />

conversational segments (Archer, Culpeper & Davies,<br />

2008:635). As underscored by Schmidt and Wörner, this<br />

makes pragmatics research conducted on corpora<br />

“heuristic” in nature in that the relationship between<br />

theory and corpus analysis is bi-directional (2009:4).<br />

This is all the more so in the identification of speech acts,<br />

as function only partially follows form.<br />

To illustrate this with a short excerpt from a naturally<br />

occurring speech event, the utterance iki çay ‘two teas’<br />

may be describing the number of cups of tea one has had.<br />

But followed by tamam hocam “okey deferential address<br />

term”, the noun phrase would achieve the illocutionary<br />

force of a request when uttered to a service provider. It<br />

goes without saying that the initial utterance can occur<br />

with please as a politeness marker, which would certainly<br />

increase its chance of being identified as a request.<br />

Communications, however, do not always exhibit such<br />

pre-fabricated forms. Thus their recall in corpora would<br />

require the analyst to increase the number of search<br />

expressions infinitely. Even so, that would not guarantee<br />

full recall; neither would it filter false cases. This<br />

situation goes against the advantage of using corpora for<br />

the study of variation and largely limits the derivation of<br />

qualitative and quantitative conclusions from corpora.<br />

In this paper we argue that annotation for studying<br />

variation in speech act realizations can be improved by (1)<br />

enriching metadata coding during the construction stage<br />

203


Multilingual Resources and Multilingual Applications - Posters<br />

of a corpus; and (2) by implementing (semi-)automated<br />

annotation for sociopragmatic features of<br />

communications that rely both on bottom-up discovery<br />

procedures and top-down, linguistic categorizations<br />

based on traditional approaches to pragmatics research<br />

(e.g. annotation of socially and discursively significant<br />

verbal and non-verbal phenomena and non-phonological<br />

units such as multi-word expressions and changes in tone<br />

of voice. The argumentation is based on insights from<br />

Multidimensional Analysis (Biber, 1995) and<br />

vocabulary-based identification of discourse units<br />

(Csomay, Jones & Keck, 2007), and the fact that<br />

pragmatic phenomena in conversational management<br />

(e.g., illocutionary force indicating devices, address<br />

terms, and politeness formulae) tend to form<br />

constellations of ‘traces’ in discourse. Annotating such<br />

traces can add “precision” and improve “recall”<br />

(Jucker et al. 2008) in searching for variation in speech<br />

acts. The main thrust of the paper is that speech events<br />

and discourse level units exhibit such verbal and<br />

non-verbal clusters, and that annotating such units can<br />

provide insights for further discursive coding. Below, we<br />

explain the procedures for these two approaches to<br />

annotation with illustrations from the demo version of the<br />

Spoken Turkish Corpus (STC), which currently<br />

comprises 44,<strong>96</strong>2 words from a selection of recordings in<br />

conversational settings, service encounters, and radio<br />

archives (STC employs EXMARaLDA corpus<br />

construction tools (Schmidt, 2004), along with a<br />

web-based corpus management system.).<br />

204<br />

2. Metadata construction in the<br />

transcription and annotation<br />

workflow of STC<br />

Besides constructing a metadata system for domain,<br />

interactional goal and speaker features, we maintain that<br />

the inclusion of speech acts and conversational topics as<br />

part of the metadata features of a corpus is a significant<br />

tool for tracing variation in speech acts in a systematic<br />

manner, as topical variation can impact their performance<br />

beyond the influence of domain and setting features.<br />

Viewed from another perspective, spoken texts are<br />

slippery resources of language in terms of domain and<br />

setting categorization such that they are often<br />

characterized by shifts in interactional goals. A service<br />

encounter in a shop, for example, can easily turn into a<br />

chat. Thus, if a communicative event were classified only<br />

for its domain of interaction, one would risk the chance of<br />

tracing subtle differences within the same domain along<br />

several dimensions. The simultaneous annotation of<br />

topics and speech acts during the compilation of the<br />

recordings and during their transcription can address the<br />

concern for achieving maximal retrieval of tokens of a<br />

speech act. It enables a bottom-up approach to search for<br />

variation through control for topic and speech acts, as<br />

manifestations of the act may not exhibit structures noted<br />

in the literature. It also allows for a corpus-driven<br />

categorization of speech acts that may not have been<br />

investigated at all in the particular language. The stages<br />

in this procedure in the construction of STC are outlined<br />

below:<br />

1) Noting of local and global topics, and the<br />

communication related activities by recorders (e.g.<br />

studying for an exam)<br />

2) Checking of topics and additions during the transfer<br />

of the recording to the corpus management system<br />

3) Stages in transcription:<br />

a. Initial step: basic transcription of recording for<br />

verbal and non-verbal; editing of topics and<br />

addition of speech act metadata<br />

b. First check: Checking the transcription for<br />

verbal and non-verbal events; editing of topics<br />

and speech act metadata<br />

c. Second check: Checking the transcription for<br />

verbal and non-verbal events; editing of topics<br />

and speech act metadata<br />

To achieve a higher level of reliability in transcription, a<br />

different transcriber is responsible for the annotation in<br />

each step in (3), and differences in transcription are<br />

handled through consultation. Stages (1) and (3a) ideally<br />

involve the same person so that the transcriber has an<br />

intuitive grasp of the topical content and the affective<br />

tone of the communication. This procedure has the added<br />

advantage of detecting regional variation with more<br />

precision. It also renders possible the construction of<br />

sub-corpora for initial pilot annotation not only through<br />

control for domain but also for topic and speech act, thus<br />

enhancing the likelihood of retrieval of a greater variety<br />

of tokens in a more economical manner. Naturally, this<br />

workflow taps into native speaker intuitions on speech<br />

act performance, but it is a viable methodological


procedure in linguistics because it harnesses intuitions in<br />

a context-sensitive environment during text processing.<br />

Figure 1 displays a select number of the metadata<br />

features of one communication in STC (Note that topics<br />

are written in Turkish, and that the term requests is used<br />

instead of directives because the former was a more<br />

transparent term for the transcribers in step (3a) above:<br />

Figure 1: Partial metadata for a communication in STC<br />

3. Annotation procedure for speech acts<br />

Speech act annotation in STC is being implemented with<br />

Sextant (Wörner, n.d.), which also allows searches to be<br />

conducted with EXAKT. The search for tokens of<br />

directives employs a snowballing technique in<br />

developing regular expressions, and is similar to what<br />

Kohnen (2008:21) describes as “structural eclecticism”.<br />

The annotation procedure starts off with the identification<br />

of forms that have been identified as being representative<br />

of directives in Turkish. Regular expressions based on<br />

these forms have been developed, and the development<br />

of tag sets is done according to the syntactic and/or<br />

lexical features of the head act. But instead of tagging<br />

only the head act, the full act is further coded by placing<br />

opening and closing tags for the relevant head act (see,<br />

Examples 1 and 2). This will allow further detailed<br />

tagging of the act in later stages of annotation.<br />

The regular expressions are enriched based on tokens<br />

detected first in the sub-corpora of service encounters<br />

both by examining the larger context of the tokens<br />

recalled in initial searches and by manually investigating<br />

specific communications that are marked for directives in<br />

the corpus metadata. However, this procedure does not<br />

Multilingual Resources and Multilingual Applications - Posters<br />

allow elliptical directives and hints to be recalled<br />

automatically. Based on the idea that a directive is ideally<br />

part of an adjacency pair, the search for ‘hidden’<br />

manifestations of the act is conducted through the<br />

presence of address terms and a select number of minimal<br />

responses, including lexical and non-lexical<br />

backchannels (e.g. tamam ‘okey/enough/full’, ha?, hm),<br />

which turned out to collate frequently with directives.<br />

Searches were thus conducted separately for these<br />

responses, and tokens that did not collocate with<br />

directives or form the head act itself were eliminated<br />

from the annotation (as is the case with tamam).<br />

Example (1) shows the co-occurrence of tamam with an<br />

elliptical request (tag code: RNp), which could not be<br />

recalled with a regular expression (The head act is<br />

marked in bold). It is noteworthy that the sequence<br />

manifests the presence of the discourse marker şimdi<br />

‘now’, which marks the speech act boundary, and<br />

illustrates how both minimal responses (tamam) and<br />

discourse markers collocate with the head act.<br />

(1)<br />

Speakers Interaction Translation<br />

XAM000066 şimdi ((0.3)) RNp-opennow your<br />

T.C. kimlik Turkish ID<br />

numarası number and first<br />

((0.2)) ve you home address<br />

öncelikli ((XXX))RNp-close<br />

olarak ((0.1))<br />

ev adresinizi<br />

DIL000065 tamam.((0.3)) okey.<br />

((filling in a<br />

form,10.8))<br />

Example (2) is an illustration from a service encounter.<br />

The head act has a verb with the future in the past. In<br />

isolation the utterance could be a manifestation of a<br />

representative. However, the collocation of the utterance<br />

with buyrun ‘lit. command’ (idiomatic equivalent:<br />

Welcome) disambiguates it as a request.<br />

(2)<br />

Speakers Interaction Translation<br />

MEH000222 ((0.3))<br />

buyrun. welcome.<br />

MED000112 iyi günler! good day!<br />

MEH000222 neresi where is it to be?<br />

olacak? (idiomatic equivalent:<br />

where to?)<br />

MED000112 Dikili’ye RImpFuI-openI was going<br />

bilet<br />

alacaktım.<br />

to get a ticket for<br />

DikiliRImpFuI-close<br />

205


Multilingual Resources and Multilingual Applications - Posters<br />

Such collocations allow us to form a list of<br />

(semi-)formulaic conversational management units,<br />

which should be tagged as pragmatic markers for<br />

directives. In the demo version of STC, tamam ‘okey’ is<br />

the item that exhibits the highest frequency. A search on<br />

the occurrence of the item was therefore conducted to<br />

check its collocation with directives. The search yielded<br />

298 tokens, 20 of which were related to directives. In 8<br />

instances, the item is a supportive move for the directive<br />

head act. In 2 recalls it was the head act itself to close off<br />

a conversational topic, while the remaining tokens were<br />

responses to a verbal or non-verbal request or part of the<br />

response to questions asking for advice/opinion.<br />

Amongst these we find the supportive function of tamam<br />

as a compliance gainer to be especially significant since<br />

the literature on directives in Turkish does not identify<br />

this function. Within these recalls, tamam collocates with<br />

6 requests of the kind illustrated in Example (2). This<br />

suggests that tamam can function to disambiguate<br />

representatives from requests and can be used to retrieve<br />

elliptical directives and hints. Although the full<br />

description of the pragmatics of tamam needs to be<br />

refined, we can say that in its semantically bleached use,<br />

it appears in topic closures, it functions as a backchannel<br />

to check comprehension, and is used as an agreement<br />

marker or as a pre-sequence to disagreement. In this<br />

regard, we can say that tamam is a pragmatic marker in<br />

its non-literal use and needs to be tagged accordingly.<br />

206<br />

4. Conclusions<br />

This paper touches only upon the disambiguating<br />

capacity of lexical pragmatic markers, but the<br />

distribution of tamam supports the claim that discourse<br />

segmentation and conversational structure annotation can<br />

use the clues provided by such ‘traces’. The functional<br />

description of tamam naturally raises the question as to<br />

coding principles for such items, including politeness<br />

formulae. While non-lexical backchannels may not be<br />

too problematic, the classification and coding of<br />

pragmatic markers is a fuzzy area. At this stage, we<br />

propose that a semantic-based, broad categorization be<br />

made to distinguish lexical and non-lexical markers,<br />

interjections and discourse markers, and discourse<br />

particles.<br />

Our experience in testing the effect of pragmatic markers<br />

on recall of speech acts suggests that it is possible to<br />

envision generic level schemes for speech act annotation.<br />

These would proceed first with a bottom-up approach, in<br />

which (multi-word) pragmatic markers, backchannels<br />

and non-verbal cues such as a classification of activity<br />

types (e.g., handing over money) are tagged. It is likely<br />

that such a venture will reveal commonalities between<br />

speech acts beyond what may be gleaned from the current<br />

pragmatics literature on speech act manifestations.<br />

5. Acknowledgements<br />

This paper was supported by TÜBİTAK, grant no.<br />

108K283, and METU, grant no. BAP-05-03-<strong>2011</strong>-001.<br />

6. References<br />

Archer, D., Culpeper, J., Davies, M. (2008): Pragmatic<br />

annotation. In A. Lüdeling & M. Kytö, Merja (Eds.),<br />

Corpus Linguistics: An International Handbook, Vol. I.<br />

Berlin/New York: Walter de Gruyter, pp. 613-642.<br />

Biber, D. (1995): Dimensions of Register Variation. New<br />

York: Cambridge University Press.<br />

Csomay, E., Jones, J.K., Keck, C. (2007): Introduction to<br />

the identification and analysis of vocabulary-based<br />

discourse units. In D. Biber, U. Connor & T.A. Upton<br />

(Eds.), Discourse on the Move. Using Corpus Analysis to<br />

Describe Discourse Structure. Amsterdam/ Philadelphia:<br />

John Benjamins, pp. 155-173<br />

Jucker, A., Schneider, G., Taavitsainen, I., Breustedt, B.<br />

(2008): “Fishing” for compliments. Precision and recall<br />

in corpus-linguistic compliment research. In A. Jucker &<br />

I. Taavitsainen (Eds.), Speech Acts in the History of<br />

English. Amsterdam/Philadelphia: Benjamins, pp.<br />

273-294.<br />

Kohnen, T. (2008): Historical corpus pragmatics: Focus on<br />

speech acts and texts. In A. Jucker & I. Taavitsainen<br />

(Eds.), Speech Acts in the History of English.<br />

Amsterdam/Philadelphia: Benjamins, pp. 13-36.<br />

Schmidt, T. (2004): Transcribing and Annotating Spoken<br />

Language with EXMARaLDA. In Proceedings of the<br />

LREC-Workshop on XML Based Richly Annotated<br />

Corpora, Lisbon 2004. Paris: ELRA, pp. 69-74.<br />

Schmidt, T., Wörner, K. (2009): EXMARALDA – creating,<br />

analysing and sharing spoken language corpora for<br />

pragmatic research. Pragmatics, 19(4), pp. 565-582.<br />

Spoken Turkish Corpus. http://std.metu.edu.tr/en/<br />

Wörner, K. n.d. Sextant tagger.<br />

http://www.exmaralda. org/sextant/sextanttagger.pdf


Multilingual Resources and Multilingual Applications - Posters<br />

The SoSaBiEC Corpus:<br />

Social Structure and Bilinguality in Everyday Conversation<br />

Veronika Ries 1 , Andy Lücking 2<br />

1 <strong>Universität</strong> Bielefeld, BMBF Projekt Linguistic Networks<br />

2 Goethe-<strong>Universität</strong> Frankfurt am Main<br />

E-mail: Veronika.Ries@uni-bielefeld.de, Luecking@em.uni-frankfurt.de<br />

Abstract<br />

The SoSaBiEC corpus is comprised audio recordings of everyday interactions between familiar subjects. Thus, the material the<br />

corpus is based on is not gained in task-oriented dialogue under strict experimental control; rather, it is made up of spontaneous<br />

conversations. We describe the raw data and the annotations that constitute the corpus. Speech is transcribed at the level of words.<br />

Dialogue act oriented codings constitute a functional, qualitative annotation level. The corpus so far provides an empirical basis for<br />

studying social aspects of unrestricted language use in a familiar context.<br />

Keywords: bilinguality, social relationships, spontaneous dialogue, annotation<br />

1. Introduction<br />

From the point of view of the methodology of<br />

psycholinguistic research on speech production<br />

unconstrained responding behavior of participants is<br />

problematic: it is known as “the problem of exuberant<br />

responding” and it is to be avoided by means of some sort<br />

of controlled elicitation in an experimental setting (Bock<br />

19<strong>96</strong>:407; see also Pickering & Garrod 2004:169). In<br />

addition, elicitations are usually bound up with a certain<br />

task the participants of the experimental study have to<br />

accomplish. Of course, each experimental set-up<br />

that obeys to the general “avoid-exuberantresponding”-design<br />

and is therefore appropriate to study<br />

and test the conditions underlying speech production in a<br />

controlled way. However, when studying human-tohuman<br />

face-to-face dialogue (or multi-logue, in case of<br />

more than two interlocutors), elicited communication<br />

behavior hinders the unfolding of spontaneous utterances<br />

and task-independent dialogue management. Taskoriented<br />

dialogue is known to be plan-based (Litman &<br />

Allen, 1987). The domain knowledge the interlocutors<br />

have of the task-domain together with the difference<br />

between their current state and the target state (defined in<br />

terms of the task to be accomplished) provides a<br />

structuring of dialogue states: the way from the current<br />

dialogue state to the target state is operationalized as a<br />

sequence of sub-tasks, each of these sub-tasks is part of a<br />

plan that has to be worked off sequentially in order to<br />

reach the target state. Plan-based accounts to dialogue<br />

provide a functional account to dialogue and have been<br />

successfully applied in computational dialogue systems<br />

for, e.g., timetable enquiries (Young & Proctor, 1989). At<br />

least partly due to the neat status of task-oriented<br />

conversational settings, respective study designs have<br />

been paradigmatic in linguistic research on dialogue.<br />

Task-oriented dialogues, inter alia, pre-determine the<br />

following conversational ingredients:<br />

� they define a dialogue goal and thereby a<br />

terminal dialogue state;<br />

� they constrain the topics the interlocutors talk<br />

about to a high degree (up to move type<br />

predictability, modulo repairs etc.);<br />

� they are cooperative rather than competitive;<br />

� the dialogue goal determines the social<br />

relationship of the interlocutors (for instance,<br />

whether they have equal communicative rights<br />

or whether task-knowledge is asymmetrically<br />

distributed) and it does so regardless of the<br />

actual relationships that might obtain between<br />

the interlocutors;<br />

� they are unilingual.<br />

Each of the ingredients above is lacking in spontaneous,<br />

everyday conversation. Does this mean that spontaneous,<br />

207


Multilingual Resources and Multilingual Applications - Posters<br />

everyday conversations also lack any structure of<br />

dialogue management? Answers to this question are in<br />

general given on the grounds of armchair theorizing or<br />

case studies. The feasibility of empirical approaches is<br />

simply hindered by the lack of respective data. The<br />

afore-given list can be extended by a further feature,<br />

namely the fact that it is easier to gather task-oriented<br />

dialogue data in experimental settings than to collect<br />

rampant spontaneous dialogue data. We have some<br />

spontaneous dialogue data that lack each of the<br />

task-based features listed above – see section 2 for a<br />

description. We focus on the latter two aspects here,<br />

namely social structure and bilingualism. The social<br />

dimension of language use, for instance, social deixis, is a<br />

well-known fact in pragmatics (Anderson & Keenan,<br />

1985; Levinson, 2008). The influence of social structure<br />

on the structure of lexica has also been reported (Mehler,<br />

2008). Yet, there is no account that scales the<br />

macroscopic level of language communities down to the<br />

microscopic level of dialogue. The data collected in<br />

SoSaBiEC aims at exactly this level of granularity of<br />

social structure and language structure: how does the<br />

social relationship between interlocutors affect the<br />

structure of their dialogue lexicon?<br />

A special characteristic of SoSaBiEC is bilingualism. The<br />

subjects recorded speak Russian as well as German, and<br />

they make use of both languages in one and same<br />

dialogue. What dialogical functions performed by the<br />

two languages seems to depend at least partly on who the<br />

addresses are, that is, on the social relationship between<br />

the interlocutors (Ries, to appear). This qualitative<br />

observation will be operationalized in terms of<br />

quantitative analyses that focus on the<br />

relationship-dependent, functional use of languages<br />

(cf. the outlook given in section 4).<br />

According to the bi-partition of corpora – primary or raw<br />

data are coupled with secondary or annotation data<br />

(loosely related to Lemnitzer & Zinsmeister, 2006:40) –<br />

the following two sections describe the data material<br />

(section 2) and its annotation (section 3) in terms of<br />

functional dialogue acts. In the last section, we sketch<br />

some research question we will address by means of<br />

SoSaBiEC in the very near future.<br />

208<br />

2. Primary Data<br />

The primary data are made up of audio recordings of<br />

everyday conversations (Ries, to appear). The recorded<br />

subjects all know each other, most of them are even<br />

related. The observations focus on natural language use,<br />

and in particular on bilingual language use. The compiled<br />

corpus is authentic because the researcher, who recorded,<br />

is herself a member of the observed speech community.<br />

The speaker gave their consent for recording at any time<br />

and without prior notice. So the recordings were taken<br />

spontaneously and at real events, such as birthday parties.<br />

For recording a digital recorder without microphone was<br />

used, so that it was without attracting too much attention.<br />

They include telephone calls and face-to-face<br />

conversations. The length of the conversations varies<br />

from about three minutes up to three hours. Depending on<br />

the topic of the conversation the number of the involved<br />

speakers differs: from two up to four speakers. In sum,<br />

there are about 300 minutes of data material covering six<br />

participants. Altogether ten conversations have been<br />

recorded. Four conversations have been analysed in<br />

detail and annotated because the participant constellation<br />

is obvious and definite: the participants come under the<br />

category parent-child or sibling. The six participants<br />

come from two families, not known to each other. As<br />

working basis for the qualitative analysis the recordings<br />

were transcribed. By way of illustration, an excerpt of the<br />

transcribed data is given:<br />

01 F: NAME<br />

A: guten abend.<br />

F: hallo?<br />

A: hallo guten abend<br />

05 F: nabend (.) hallo<br />

A: na wie gehts bei euch?<br />

F: gut<br />

A: gut?<br />

F: ja.<br />

10 A: na что вы смотрели что к чему тама?<br />

F: ja а что там?<br />

This is a sequence of a telephone call between father F<br />

and his daughter A. The conversation starts in German<br />

and initiated by daughter A there is an alternation into<br />

Russian (line 10). The qualitative analysis showed that<br />

through this language switch speaker introduced the first


topic of the telephone call and so managed the<br />

conversation opening. Results such as the described one<br />

are the main content of the annotation.<br />

3. Annotation<br />

The utterances produced by the participants have been<br />

transcribed using the Praat tool<br />

(http://www.fon.hum.uva.nl/praat/) on the level of<br />

orthographic words. That means, that no phonetic<br />

features like accent or defective pronunciations are coded.<br />

However, spoken language exhibits regularities of its<br />

own kind, regularities we accounted for in speech<br />

transcription. Most prominently, words that are separated<br />

in written language might get fused into phonetic word in<br />

spoken language. A common example in German already<br />

part of the standard of the language is “zum”, a melting of<br />

the preposition “zu” and the dative article “dem”.<br />

Meltings of this pattern are much more frequent in<br />

spoken German than acknowledged in standard German.<br />

The English language knows hard-wired combinations<br />

like “I’m” which usually is not resolved to the<br />

full-fledged standard form “I am”. The annotation takes<br />

care for these demands in providing respective<br />

adaptations of annotations to spoken language. In order<br />

to reveal the dialogue-related functions performed by the<br />

utterances, we employed a dialogue act-based coding of<br />

contributions. Here, we follow the ISOCat<br />

(www.isocat.org) initiative for standardization of<br />

dialogue act annotations outlined by Bunt et al. (2010).<br />

To be able to talk about dialogue-related functions and<br />

natural bilingual language use, language alternations<br />

regarding their functions and roles in the current<br />

discourse were annotated. The important factor annotated<br />

is the function of the involved languages and the<br />

observed language alternations: That means to annotate<br />

each language switch and its meaning on the level of<br />

conversation, for example the conversation opening. The<br />

differentiation by speakers is crucial for the examination<br />

of a connection between language use and social<br />

structure. The functional annotation labels have been<br />

derived from qualitative, ethnomethodological analyses<br />

by an expert researcher. The annotations made by this vey<br />

researcher can be regarded as having the privileged status<br />

as “gold standard” since part of the expert’s knowledge is<br />

not only the pre- and posthistory of the data recorded, but<br />

also familiarity with the subjects involved, a kind of<br />

Multilingual Resources and Multilingual Applications - Posters<br />

knowledge rather exclusive to our expert. However, since<br />

the annotation are a compromise between the qualitative<br />

and quantitative methods and methodologies that are<br />

brought together in this kind of research, we want to<br />

assess whether the ethnomethodological, functional<br />

annotation can be reproduced to a sufficient degree by<br />

other annotators. For this reason, we applied a reliability<br />

assessment in term of inter-rater agreement of two raters’<br />

annotations of a subset (one conversation) of the data. We<br />

use the agreement coefficient AC1 developed by Gwet<br />

(2001). The annotation of dialogue acts result in an AC1<br />

of 0.61, the rating of function result in an AC1 of 0.78.<br />

Two observations can be made: firstly, the functional<br />

dialogue annotation is reproducible -- an outcome of 0.78<br />

is regarded as "substantial" by Rietveld and van Hout<br />

(1993); secondly, the standardised dialogue act<br />

annotation scheme tailored for task-oriented dialogues<br />

can be applied with less agreement than the functional<br />

scheme custom-build to more unconstrained everyday<br />

conversations. We take this as further evidence for the<br />

validity of the distinction of different types argued for in<br />

the introduction.<br />

4. Outlook<br />

So far, we finished data collection and annotation of the<br />

subset of SoSaBiEC data that interests us first, namely<br />

the data that involve parent-child and sibling dialogues.<br />

The next step is to test our undirected hypothesis by<br />

means of mapping the annotation data on a variant of the<br />

dialogue lexicon model of Mehler, Lücking, and Weiß<br />

(2010). This model provides a graph-theoretical<br />

framework for classifying dialogue networks according<br />

to their structural similarity. Applying such quantitative<br />

measure onto mostly qualitative data allows not only to<br />

study whether social structure imprints on language<br />

structure in human dialogue, but in particular to measure<br />

if there is a traceable influence at all.<br />

5. Acknowledgments<br />

Funding of this work by the German Federal Ministry of<br />

Education and Research (Bundesministerium <strong>für</strong> Bildung<br />

und Forschung) is gratefully acknowledged. We also<br />

want to thank Barbara Job and Alexander Mehler for<br />

discussion and support.<br />

209


Multilingual Resources and Multilingual Applications - Posters<br />

210<br />

6. References<br />

Anderson, S. R., Keenan, E. L. (1985): “Deixis”. In:<br />

Language Typology and Syntactic Description. Ed. by<br />

Timothy Shopen. Vol. III. Cambridge: Cambridge<br />

University Press. Chap. 5, pp. 259–308.<br />

Bock, K. (19<strong>96</strong>): “Language Production: Methods and<br />

Methodologies”. In: Psychonomic Bulletin & Review<br />

3.4, pp. 395–421.<br />

Bunt, H. et al. (May 21, 2010): “Towards an ISO<br />

Standard for Dialogue Act Annotation”. In:<br />

Proceedings of the Seventh conference on<br />

International Language Resources and Evaluation<br />

(LREC’10). Ed. by Nicoletta Calzolari (Conference<br />

Chair) et al. Valletta, Malta: European Language<br />

Resources Association (ELRA).<br />

Cohen, J. (1<strong>96</strong>0): “A Coeffcient of Agreement for<br />

Nominal Scales”. In: Educational and Psychological<br />

Measurement 20, pp. 37–46.<br />

Gwet, K. (2001): Handbook of Inter-Rater Reliability.<br />

Gaithersburg, MD: STATAXIS Publishing Company.<br />

Lemnitzer, L., Zinsmeister, H. (2006): Korpuslinguistik.<br />

Eine Einführung. Tübingen: Gunter Narr Verlag.<br />

Levinson, S. C. (2008): “Deixis”. In: The Handbook of<br />

Pragmatics. Blackwell Publishing Ltd, pp. 97–121.<br />

Litman, D. J., Allen, J. F. (1987): “A plan recognition<br />

model for subdialogues in conversations”. In:<br />

Cognitive Science 11.2, pp. 163–200.<br />

Mehler, A. (2008): “On the Impact of Community<br />

Structure on SelfOrganizing Lexical Networks”. In:<br />

Proceedings of the 7th Evolution of Language<br />

Conference (Evolang 2008). Ed. By Andrew D. M.<br />

Smith, Kenny Smith, and Ramon Ferrer i Cancho.<br />

Barcelona: World Scienti fic, pp. 227–234.<br />

Mehler, A., Lücking, A., Weiß, P. (2010): “A Network<br />

Model of Interpersonal Alignment in Dialogue”. In:<br />

Entropy 12.6, pp. 1440–1483.<br />

doi: 10.3390/e12061440.<br />

Pickering, M. J. and Garrod, S. (2004): “Toward a<br />

Mechanistic Psychology of Dialogue”. In: Behavioral<br />

and Brain Sciences 27.2, pp. 169–190.<br />

Ries, V. (<strong>2011</strong>): “da=kommt das=so quer rein.<br />

Sprachgebrauch und Spracheinstellungen<br />

Russlanddeutscher in Deutschland”. PhD thesis.<br />

<strong>Universität</strong> Bielefeld.<br />

Rietveld, T. van Hout, R. (1993): Statistical Techniques<br />

for the Study of Language and Language Behavior.<br />

Berlin ; New York: Mouton de Gruyter.<br />

Young, S. J., Proctor, C. E. (1989): “The design and<br />

implementation of dialogue control in voice operated<br />

database inquiry systems”. In: Computer Speech and<br />

Language 3.4, pp. 329–353.<br />

doi: 10.1016/0885-2308(89)90002-8.


Multilingual Resources and Multilingual Applications - Posters<br />

DIL, ein zweisprachiges Online-Fachwörterbuch der Linguistik<br />

(Deutsch-Italienisch)<br />

Carolina Flinz<br />

<strong>Universität</strong> Pisa<br />

E-mail: c.flinz@ec.unipi.it<br />

Abstract<br />

DIL ist ein deutsch-italienisches Online-Fachwörterbuch der Linguistik. Es ist ein offenes Wörterbuch und mit diesem Beitrag wird<br />

<strong>für</strong> eine mögliche Zusammenarbeit, Kollaboration plädiert. DIL ist noch im Aufbau begriffen; zur Zeit ist nur die Sektion DaF<br />

komplett veröffentlicht, auch wenn andere Sektionen in Bearbeitung sind. Die Sektion LEX (Lexikographie), die zur<br />

Veröffentlichung ansteht, wird zusammen mit den wichtigsten Eigenschaften des Wörterbuches präsentiert.<br />

Keywords: Fachwörterbuch, Linguistik, zweisprachig, deutsch-italienisch, Online-Wörterbuch<br />

1. Einleitung<br />

DIL (Dizionario tedesco-italiano di terminologia<br />

linguistica / deutsch-italienisches Fachwörterbuch der<br />

Linguistik) ist ein online Wörterbuch, das Lemmata aus<br />

dem Bereich der Linguistik und einiger ihrer<br />

Nachbardisziplinen auflistet. Es ist ein offenes<br />

Wörterbuch, nach dem Muster von Wikipedia, bzw.<br />

Glottopedia, um eine mögliche Beteiligung von Experten<br />

der unterschiedlichen Disziplinen zu fördern.<br />

Im Handel und im Online-Medium existieren heute<br />

mehrere deutsche 1 und italienische 2 Wörterbücher der<br />

Linguistik aber kein einziges Fachwörterbuch <strong>für</strong> das<br />

Sprachenpaar deutsch-italienisch. Hingegen ist der<br />

Bedarf an einem solchen „Instrument“ in Italien, sowohl<br />

<strong>für</strong> die universitäre Didaktik als auch <strong>für</strong> die Forschung,<br />

sehr stark: in einem Zeitraum, wo das Fach „Deutsche<br />

Linguistik“ als Folge einer <strong>Universität</strong>sreform (1999)<br />

einen starken Aufschwung erlebt hat, könnte DIL <strong>für</strong> die<br />

wissenschaftliche Kommunikation von großer Relevanz<br />

sein 3<br />

. DIL könnte nämlich eine große Hilfe <strong>für</strong> die Suche<br />

1 Vgl. u.a. Bußmann, 2002; Conrad, 1985; Crystal, 1993;<br />

Ducrot & Todorov, 1975; Dubois, 1979; Glück, 2000; Heupel,<br />

1973; Lewandowki, 1994; Meier & Meier, 1979;<br />

Stammerjohann, 1975; Ulrich, 2002.<br />

2 Vgl. u.a. Bußmann, 2007; Cardona, 1988; Casadei, 1991;<br />

Ceppellini, 1999; Courtes & Greimas, 1986; Crystal, 1993;<br />

Ducrot & Todorov, 1972; Severino, 1937; Simone, 1<strong>96</strong>9.<br />

3 Die Relevanz von Fachwörterbüchern <strong>für</strong> die<br />

wissenschaftliche Kommunikation war Thema vieler<br />

lexikographischer Arbeiten: vgl. u.a. Wiegand, 1988; Pileegard,<br />

1994; Schaeder & Bergenholtz, 1994; Bergenholtz & Tarp,<br />

1995; Hoffmann & Kalverkämper & Wiegand, 1998.<br />

nach Äquivalenten von deutschsprachigen linguistischen<br />

Fachtermini sein.<br />

DIL ist ein Projekt des Deutschen Instituts der Fakultät<br />

Lingue e Letterature Straniere der <strong>Universität</strong> Pisa (daf,<br />

2004:37), das 2008 online veröffentlicht worden ist<br />

(http://www.humnet.unipi.it/dott_linggensac/glossword)<br />

und an dem weiterhin gearbeitet wird. Es handelt sich um<br />

ein monolemmatisches Fachwörterbuch 4<br />

(Wiegand,<br />

19<strong>96</strong>:46): die Lemmata sind in deutscher Sprache,<br />

während die Kommentarsprache Italienisch ist.<br />

Ziele dieses Beitrags sind:<br />

1) durch einen kurzen Überblick die wichtigsten<br />

Eigenschaften des Wörterbuches vorzustellen, wie<br />

Makro- und Mikrostruktur des Wörterbuches,<br />

Lemmabestand und Kriterien;<br />

2) zu ähnlichen Arbeiten und zukünftigen<br />

Kollaborationen an diesem Projekt anzuregen,<br />

insbesondere <strong>für</strong> die geplante Sektion der<br />

Computerlinguistik;<br />

3) die gerade neu erstellte Sektion LEX (Lexikographie)<br />

vorzustellen.<br />

2. Makro- und Mikrostruktur<br />

Die Makrostruktur und die Mikrostruktur von DIL<br />

wurden natürlich von der Funktion des Wörterbuches<br />

und der intendierten Benutzergruppe beeinflusst 5<br />

. Die<br />

Erkundung der Benutzerbedürfnisse wurde mit Hilfe von<br />

4<br />

Eine bilemmatische Ergänzung des Wörterbuches ist nicht<br />

ausgeschlossen.<br />

211


Multilingual Resources and Multilingual Applications - Posters<br />

Fragebögen, die sowohl im Printmedium als auch im<br />

Onlineformat gesendet wurden, und einer Analyse der<br />

möglichen Benutzersituationen erforscht 6<br />

. Jeder Benutzer<br />

kann weiterhin den Fragebogen von der Homepage<br />

aufrufen und beantworten, so dass ein ständiger Kontakt<br />

mit dem Benutzer vorhanden ist.<br />

DIL wendet sich im Allgemeinen an ein heterogenes<br />

Publikum: es ist sowohl <strong>für</strong> Experten als auch <strong>für</strong> Laien<br />

gedacht, so dass die potentiellen Benutzer sowohl Lerner<br />

und Lehrender in den Bereichen Germanistik,<br />

Romanistik, Linguistik oder Deutsch / Italienisch als<br />

Fremdsprache sein können als auch Lehrbuchautoren,<br />

Lexikographen oder Fachakademiker. Das Online<br />

Medium, dank seiner Flexibilität, ist von großem Vorteil<br />

in dieser Hinsicht.<br />

DIL kann nämlich in folgenden Benutzungssituationen<br />

verwendet werden:<br />

1) Der Benutzer sucht bestimmte fachliche<br />

212<br />

Informationen, und das Wörterbuch, laut seiner<br />

Werkzeugnatur, erfüllt das Bedürfnis;<br />

2) Der Benutzer greift zum Wörterbuch, um ein<br />

Kommunikationsproblem in der Textproduktion,<br />

Textrezeption oder Übersetzung zu lösen. DIL erfüllt<br />

deswegen mehrere Funktionen: es kann sowohl <strong>für</strong><br />

aktive / produktive als auch passive / rezeptive<br />

Tätigkeiten verwendet werden.<br />

a. Der italophone Benutzer (primärer Benutzer)<br />

wird es als dekodierendes Wörterbuch <strong>für</strong> die<br />

Herübersetzung verwenden, d.h. wenn er ein<br />

deutsches Fachwort verstehen will oder dessen<br />

Übersetzung sucht, oder wenn er spezifischere<br />

Informationen braucht und sich weiter<br />

b.<br />

informieren und weiterbilden möchte;<br />

Der deutschsprachige Benutzer wird es als<br />

enkodierendes Wörterbuch <strong>für</strong> die<br />

Hinproduktion benutzen, d.h. wenn er ins<br />

Italienische übersetzt und Fachtexte in<br />

italienischer Sprache erstellt.<br />

Die Makrostruktur von DIL vereinigt sowohl<br />

Eigenschaften der linguistischen Printwörterbücher (1.)<br />

5 Vgl. u.a. Storrer & Harriehausen, 1998; Barz, 2005.<br />

6 Für einen Überblick über mögliche Techniken zur<br />

(Benutzerbedürfnissen-Erforschung: meglio zur Erforschung<br />

von Benutzerbedürfnissen) vgl. u.a. Barz, 2005; Ripfel &<br />

Wiegand, 1988; Schaeder & Bergenholtz, 1994; Wiegand,<br />

1977.<br />

als auch der Onlinewörterbücher (2.):<br />

1. Die Strukturierung der Umtexte im Printmedium<br />

beeinflusste den aktuellen Stand. DIL verfügt<br />

nämlich über folgende nach wissenschaftlichen<br />

Kriterien verfasste Umtexte: Einleitung,<br />

Abkürzungsverzeichnis, Benutzerhinweise,<br />

Redaktionsnormen, Register der Einträge 7<br />

2.<br />

;<br />

Die Vorteile der Online-Wörterbücher wurden auch<br />

zum größten Teil ausgenutzt:<br />

a. Neue Einträge und neue Sektionen können sehr<br />

schnell veröffentlicht werden;<br />

b. DIL kann ständig erneuert, ergänzt und<br />

korrigiert werden;<br />

c. Es verfügt über ein klar strukturiertes Menu, in<br />

dem die wichtigsten Umtexte verlinkt sind, so<br />

dass der Benutzer schnell die gewünschten<br />

Informationen erreichen kann;<br />

d. Es verwendet sowohl interne 8 als auch externe 9<br />

Hyperlinks;<br />

e. Es bietet dem Benutzer nützliche Informationen,<br />

wie die TOP 10 (vgl. u.a. die „zuletzt gesuchten“<br />

oder „die am meisten geklickten Lemmata“);<br />

f. Es bietet wichtige Instrumente, wie die<br />

Suchmaschine, die Feedbackseite, das Lodgin<br />

Feld etc.<br />

Die Mikrostruktur von DIL bietet sowohl sprachliche als<br />

auch sachliche Informationen und ist auf der Grundlage,<br />

dass der Erst-Adressat der italophone Benutzer ist,<br />

strukturiert worden. Jeder Eintrag wird von folgenden<br />

Angaben komplettiert:<br />

1) grammatische Angaben (Genus und Numerus);<br />

2) das Äquivalent / die Äquivalente in italienischer<br />

Sprache;<br />

3) die Markierung als Information zum fachspezifischen<br />

Bereich des Lemmas;<br />

4) die enzyklopädische Definition;<br />

5) Beispiele 10<br />

;<br />

7<br />

Eine empirische Analyse linguistischer Online-<br />

Fachwörterbücher zeigte, wie „unwissenschaftlich“ oft<br />

Online-Wörterbücher mit Umtexten umgehen. Nur 45% der<br />

analysierten Werkzeuge verfügte über solche Texte und nur in<br />

seltenen Ausnahmen wurden wissenschaftlichen Kriterien<br />

gefolgt (Flinz, 2010:72)<br />

8<br />

Der Benutzer kann von einem Eintrag zu thematisch<br />

verbundenen Lemmata springen.<br />

9<br />

Es sind sowohl sprachliche Wörterbücher, wie Canno.net und<br />

Grammis, als auch sachliche, wie Glottopedia und DLM,<br />

verlinkt.<br />

10<br />

Alle Lemmata folgen im Prinzip dem gleichen Schema, da


6) Angaben zur Paradigmatik, wie Synonyme;<br />

thematisch verbundene Lemmata;<br />

7) bibliographische Angaben.<br />

3. Lemmabestand und Kriterien<br />

Der Lemmabestand von DIL kann nur<br />

„eingeschätzt“ werden. Die Gründe da<strong>für</strong> können wie<br />

folgt zusammengefasst werden:<br />

1) erstens handelt es sich um ein Online-Wörterbuch,<br />

das noch in Projekt und Testphase ist;<br />

2) zweitens soll das Werk, wie es sein Format vorgibt,<br />

nicht als etwas Statisches und Vollendetes gesehen<br />

werden, sondern in ständiger Erweiterung und<br />

Erneuerung. Aus einem Vergleich der existierenden<br />

linguistischen Fachwörterbücher kann aber eine<br />

ungefähre Zahl von ca. 2.000 Lemmata ausgerechnet<br />

werden, die allerdings<br />

geändert werden kann.<br />

ständig erweitert oder<br />

Primärquellen waren allgemeine Wörterbücher der<br />

Linguistik (deutsch- wie italienischsprachige), sowie<br />

spezifische deutsche und italienische Glossare der<br />

Disziplin Lexikographie und Fachlexikographie. Es<br />

wurden Quellen sowohl im gedruckten als auch im<br />

Online-Medium herangezogen. Sekundärquellen waren<br />

Handbücher aus dem Bereich der jeweiligen Disziplin<br />

(<strong>für</strong> die Sektion Lex waren es zum Beispiel<br />

Standardwerke der Disziplin Lexikographie und<br />

Fachlexikographie) sowohl in deutscher als auch in<br />

italienischer Sprache.<br />

Hauptkriterien <strong>für</strong> die Auswahl der Lemmata sind<br />

Frequenz und Relevanz (Bergenholtz, 1989:775) 11<br />

:<br />

1) Es wurde eine entsprechende Analyse der<br />

existierenden lexikographischen Wörterbücher<br />

sowohl im Print- als auch im Onlineformat<br />

2)<br />

hinsichtlich der dort aufgeführten Lemmata des<br />

jeweiligen Bereiches durchgeführt;<br />

Es wurde ein kleiner Korpus von Fachtexten des<br />

betreffenden Faches hergestellt. Die im Endregister<br />

enthaltenen Termini wurden in Excell-Tabellen<br />

die Standardisierung der Mikrostruktur eine wichtige<br />

Voraussetzung war. Da aber Beispiele nur in bestimmten<br />

Kontexten behilflich sind, wurden sie nur gelegentlich<br />

eingefügt.<br />

11 Korpusanalysen, im Sinne von automatischen Analysen von<br />

Textkorpora mit anschließender Korpusauswertung<br />

(Frequenzwerte) wurde bis jetzt ausgeschlossen. Jedoch wäre<br />

es interessant zu sehen, inwiefern eine solche Analyse mit einer<br />

Integrierung des Relevanzkriteriums die erhaltenen Ergebnisse<br />

widerspiegeln könnte oder nicht.<br />

Multilingual Resources and Multilingual Applications - Posters<br />

eingetragen, und die entstehenden Listen wurden auf<br />

Grund von Frequenzkriterien verglichen. Die aus<br />

diesem Prozess entstehende Endliste wurde<br />

zusätzlich auf der Basis des Relevanzkriteriums<br />

ergänzt.<br />

Die Einträge sind in strikt alphabetischer Reihenfolge,<br />

und die typischen Nachteile dieser Ordnung können dank<br />

des Online-Formats teilweise aufgehoben werden, da die<br />

begriffsystematischen Zusammenhänge durch verlinkte<br />

Verweise oft verdeutlich werden.<br />

Das Wörterbuch enthält zurzeit eine vollständige Sektion<br />

(DaF) mit 240 Einträgen, während andere Bereiche in<br />

Erarbeitung sind:<br />

1) Historische Syntax;<br />

2) Wortbildung;<br />

3) Textlinguistik;<br />

4) Fachsprachen.<br />

Eine neu erstellte Sektion LEX (Lexikographie) wurde<br />

gerade fertiggestellt und steht zur Veröffentlichung an.<br />

Sie enthält Lemmata aus dem Bereich der Lexikographie<br />

und Fachlexikographie sowie Metalexikographie und<br />

Metafachlexikographie.<br />

4. Die Sektion: LEX<br />

Die Sektion LEX wird voraussichtlich ca. 120 Einträge<br />

(Stand Juni <strong>2011</strong>) enthalten, die sich auf die wichtigsten<br />

Aspekte des Fachbereiches der Lexikographie<br />

konzentrieren. Es wird dabei auf folgende Themen<br />

Aufmerksamkeit gelegt:<br />

a. Lexikographie;<br />

b. Fachlexikographie;<br />

c. Wörterbuchforschung;<br />

d. Wörterbuchtypologie;<br />

e. Wörterbuchbenutzer und Wörterbuchbenutzung;<br />

f. Wörterbuchfunktionen;<br />

g. lexikographische Kriterien;<br />

h. Makrostrukur;<br />

i. Umtexte;<br />

j. Mediostruktur;<br />

k. Mikrostruktur.<br />

Im Folgenden wird ein Beispiel eines Eintrags aus dem<br />

Bereich Lex gezeigt. Es kann als Muster <strong>für</strong> die<br />

Erarbeitung von neuen Einträgen gelten. Jeder Autor<br />

kann die produzierten Lemmata an die Redaktion des<br />

Wörterbuches senden; nach der redaktionellen Prüfung<br />

wird der Eintrag veröffentlich und mit der Abkürzung des<br />

Autorennamens vermerkt.<br />

213


Multilingual Resources and Multilingual Applications - Posters<br />

214<br />

5. Literatur<br />

Abel, A. (2006): Elektronische Wörterbücher: Neue<br />

Wege und Tendenzen. In San Vincente, F. (Ed.)<br />

Akten der Tagung “Lessicografia bilingue e<br />

Traduzione: metodi, strumenti e approcci attuali”<br />

(Forlì, 17.-18.11.2005). Polimetrica Publisher (Open<br />

Access Publications). S. 35-56.<br />

Almind, R. (2005): Designing Internet Dictionaries.<br />

Hermes, 34, S. 37-54.<br />

Barz, I., Bergenholtz, H., Korhonen, J. (2005): Schreiben,<br />

Verstehen, Übersetzen, Lernen. Zu ein- und<br />

zweisprachigen Wörterbüchern mit Deutsch.<br />

Frankfurt a. M.: Peter Lang.<br />

Bergenholtz, H. (1989): Probleme der Selektion im<br />

allgemeinen einsprachigen Wörterbuch. In Hausmann,<br />

F. J. et al. (Hg). Wörterbücher: ein internationales<br />

Handbuch zur Lexikographie. Band 1. Berlin & New<br />

York: de Gruyter. S. 773-779.<br />

Bergenholtz, H., Tarp, S. (1995): Manuel of LSP<br />

lexikography. Preparation of LSP dictionaries -<br />

problems and suggested solutions. Amsterdam,<br />

Netherlands & Philadelphia: J. Benjamins.<br />

Foschi-Albert, M., Hepp, M. (2004): Zum Projekt:<br />

Bausteine zu einem deutsch-italienischen Wörterbuch<br />

der Linguistik. In daf Werkstatt, 4, S. 43-69.<br />

Hoffmann, L., Kalverkämper, H., Wiegand, H.E (Eds.)<br />

(1999): Fachsprachen. Handbücher zur Sprach- und<br />

Bild 1: Das Lemma “Fachlexikographie”<br />

Kommunikationswissenschaft (HSK 14.2.). Berlin &<br />

New York: de Gruyter.<br />

Pilegaard, M. (1994): Bilingual LSP Dictionaries. User<br />

benefit correlates with elaborateness of „explanation“.<br />

In Bergenholtz, H. & Schaeder, B. S. 211-228.<br />

Schaeder, B., Bergenholtz, H. (1994): Fachlexikographie.<br />

Fachwissen und seine Repräsentation in<br />

Wörterbüchern. Tübingen: G. Narr.<br />

Ripfel M., Wiegand, H.E. (1988): Wörterbuchbenutzungsforschung.<br />

Ein kritischer Bericht. In<br />

Studien zur Neuhochdeutschen Lexikographie VI. 2.<br />

Teilb. S. 482-520.<br />

Storrer, A., Harriehausen, B. (1998): Hypermedia <strong>für</strong><br />

Lexikon und Grammatik. Tübingen: G. Narr.<br />

Wiegand, H.E. (1977): Nachdenken über Wörterbücher.<br />

Aktuelle Probleme. In Drosdowski, H., Henne, H. &<br />

Wiegand, H.E. Nachdenken über Wörterbücher.<br />

Mannheim: Bibliographisches Institut / Dudenverlag.<br />

S. 51-102.<br />

Wiegand, H.E. (Ed.) (19<strong>96</strong>): Wörterbücher in der<br />

Diskussion II. Vorträge aus dem Heidelberger<br />

Lexikographie-Kolloquium. Tübingen: Lexicographica<br />

Series Major 70.<br />

Wiegand, H.E. (1988): Was ist eigentlich<br />

Fachlexikographie?. In Munske, H.H., Von Polenz, P.<br />

& Reichmann, O. & Hildebrandt, R. (Hg.). Deutscher<br />

Wortschatz. Lexikologische Studien. Berlin & New<br />

York: de Gruyter. S. 729-790.


Multilingual Resources and Multilingual Applications - Posters<br />

Knowledge Extraction and Representation: the EcoLexicon Methodology<br />

Pilar León Araúz, Arianne Reimerink<br />

Department of Translation and Interpreting, University of Granada<br />

Buensuceso 11, 18002, Granada, Spain<br />

E-mail: pleon@ugr.es, arianne@ugr.es<br />

Abstract<br />

EcoLexicon, a multilingual terminological knowledge base (TKB) on the environment, provides an internally coherent information<br />

system which aims at covering a wide range of specialized linguistic and conceptual needs. Knowledge is extracted through corpus<br />

analysis. Then it is represented and contextualized in several dynamic and interrelated information modules. This methodology<br />

solves two challenges derived from multidimensionality: 1) it offers a qualitative criterion to represent specialized concepts<br />

according to recent research on situated cognition (Barsalou, 2009), and 2) it is a quantitative and efficient solution to the problem of<br />

information overload.<br />

Keywords: knowledge extraction, knowledge representation, EcoLexicon, multidimensionality, context<br />

EcoLexicon 1<br />

1 http://ecolexicon.ugr.es<br />

1. Introduction<br />

is a multilingual knowledge base on the<br />

environment. So far it has 3,283 concepts and 14,695<br />

terms in Spanish, English and German. Currently, two<br />

more languages are being added: Modern Greek and<br />

Russian. It is aimed at users such as translators, technical<br />

writers, environmental experts, etc., which can access it<br />

through a friendly visual interface with different modules<br />

devoted to both conceptual, linguistic, and graphical<br />

information.<br />

In this paper, we will focus on some of the steps applied<br />

to extract and represent conceptual knowledge in<br />

EcoLexicon. According to Meyer et al. (1992),<br />

terminological knowledge bases (TKBs) should reflect<br />

conceptual structures in a similar way to how concepts<br />

relate in the human mind. The organization of semantic<br />

information in the brain should thus underlie any<br />

theoretical assumption concerning the retrieval and<br />

acquisition of specialized knowledge concepts as well as<br />

the design of specialized knowledge resources (Faber,<br />

2010). In Section 2, we explain how knowledge is<br />

extracted through corpus analysis. In Section 3, we show<br />

how conceptual knowledge is represented and<br />

contextualized in dynamic and interrelated networks.<br />

2. Conceptual Knowledge Extraction<br />

According to corpus-based studies, when a term is<br />

studied in its linguistic context, information about its<br />

meaning and its use can be extracted (Meyer &<br />

Mackintosh, 19<strong>96</strong>). In EcoLexicon, the corpus consists of<br />

specialized (e.g. scientific journal articles, thesis, etc.),<br />

semi-specialized texts (textbooks, manuals, etc.) and<br />

texts for the general public, all in the multidisciplinary<br />

domain of the environment. Each language has a separate<br />

corpus and the knowledge is extracted bottom-up from<br />

each of the corpora. The underlying ontology is language<br />

independent and based on the knowledge extracted from<br />

all the corpora. The extraction of conceptual knowledge<br />

combines direct term searches and knowledge pattern<br />

(KP) analysis. According to many studies on the subject,<br />

KPs are considered one of the most reliable methods for<br />

knowledge extraction (Barrière, 2004). Normally, the<br />

most recurrent knowledge patterns (KPs) for each<br />

conceptual relation identified in previous research are<br />

used to find related term pairs (Auger & Barrière, 2008).<br />

Afterwards, these terms are used for direct term searches<br />

to find new KPs and relations. Therefore, the<br />

methodology consists of the cyclic repetition of both<br />

procedures.<br />

When searching for the term EROSION, conceptual<br />

concordances show how different KPs convey different<br />

215


Multilingual Resources and Multilingual Applications - Posters<br />

relations with other specialized concepts. The main<br />

relations are caused_by, affects, has_location and<br />

has_result, which highlight the procedural nature of the<br />

concept and the important role played by<br />

non-hierarchical relations.<br />

In Figure 1, EROSION is related to its diverse kinds of<br />

This relation can also be conveyed through compound<br />

names such as flood-induced (10) or storm-caused (12)<br />

and any expression containing cause as a verb or noun:<br />

one of the causes of (9), cause (4, 5, 8) and caused by<br />

(14). EROSION is also linked to the patients it affects, such<br />

as WATER (15), SEDIMENTS (16), and BEACHES (17).<br />

However, the affected entities, or patients, are often<br />

equivalent to locations (eg. if EROSION affects BEACHES it<br />

actually takes place at the BEACH). The difference lies in<br />

the kind of KPs linking the propositions. The affects<br />

relation is often reflected through the preposition of (10)<br />

or verbs like threatens (18), damaged by (17) or provides<br />

(19), whereas the has_location relation is conveyed<br />

through prepositions linked to directions (around, 21;<br />

along, 22; downdrift, 23) or spatial expressions such as<br />

takes place (24). In this way, EROSION appears linked to<br />

the following locations: LITTORAL BARRIERS (21),<br />

COASTS (22) and STRUCTURES (23). Result is an essential<br />

216<br />

agents, such as STORM SURGE (1, 7), WAVE ACTION (2,<br />

13), RAIN (3), CONSTRUCTION PROJECTS (6) and<br />

HUMAN-INDUCED FACTORS (11).They can be retrieved<br />

thanks to all KPs expressing the relation caused_by, such<br />

as resultant (1), agent for (2, 3), due to (6, 7), and<br />

responsible for (11).<br />

Figure 1: Non-hierarchical relations associated with EROSION<br />

Figure 2: Hierarchical relations associated with EROSION<br />

dimension in the description of any process, since it also<br />

has certain effects, which can be the creation of a new<br />

entity (SEDIMENTS, 25; MARSHES, 29; BAYS, 31) or the<br />

beginning of another process (SEAWATER INTRUSION, 31;<br />

PROFILE STEEPENING, 32).<br />

All these related concepts are quite heterogeneous. They<br />

belong to different paradigms in terms of category<br />

membership or hierarchical range. For instance, some of<br />

the agents of EROSION are natural (WIND, WAVE ACTION)<br />

or artificial (JETTY, MANGROVE REMOVAL) and others are<br />

general concepts (STORM) or very specific (MEANDERING<br />

CHANNEL). This explains why knowledge extraction must<br />

still be performed manually, but it also illustrates one of<br />

the major problems in knowledge representation:<br />

multidimensionality (Rogers, 2004).<br />

This is better exemplified in the concordances in Figure<br />

2, since multidimensionality is most often codified in the<br />

is_a relation. In the scientific discourse community,


concepts are not always described in the same way<br />

because they depend on perspective and subject-fields.<br />

For instance, EROSION is described as a natural process of<br />

REMOVAL (33), a GEOMORPHOLOGICAL PROCESS (34), a<br />

COASTAL PROCESS (35) or a STORMWATER IMPACT (36).<br />

The first two cases can be considered traditional<br />

ontological hyperonyms. The choice of any of them<br />

depends on the upper-level structure of the<br />

representational system and its level of abstraction.<br />

However, COASTAL PROCESS and STORMWATER IMPACT<br />

frame the concept in more concrete subject-fields and<br />

referential settings. The same applies to subtypes, where<br />

the multidimensional nature of EROSION is clearly shown.<br />

It can thus be classified according to the dimensions of<br />

result (SHEET, RILL, GULLY, 37; DIFFERENTIAL EROSION,<br />

38), direction (LATERAL, 39; HEADWARD EROSION, 49),<br />

agent (WAVE, 41; WIND, 43) and patient (SEDIMENT, 47;<br />

DUNE, 48; SHORELINE EROSION, 49).<br />

3. Dynamic Knowledge Representation<br />

Since categorization is a dynamic context-dependent<br />

process, the representation and acquisition of specialized<br />

knowledge should certainly focus on contextual<br />

variation. Barsalou (2009: 1283) states that a concept<br />

produces a wide variety of situated conceptualizations in<br />

specific contexts. Accordingly, dynamism in the<br />

environmental domain comes from the effects of context<br />

on the way concepts are interrelated. Multidimensionality<br />

is commonly regarded as a way of enriching traditional<br />

static representations (León Araúz and Faber, 2010).<br />

However, in the environmental domain it has caused a<br />

great deal of information overload, which ends up<br />

jeopardizing knowledge acquisition. This is mainly<br />

caused by versatile concepts, such as WATER, which are<br />

usually top-level general concepts involved in a myriad<br />

of events.<br />

Our claim is that any specialized domain contains<br />

sub-domains in which conceptual dimensions become<br />

more or less salient depending on the activation of<br />

specific contexts. As a result, a more believable<br />

representational system should account for<br />

re-conceptualization according to the situated nature of<br />

concepts. In EcoLexicon, this is done by dividing the<br />

global environmental specialized field in different<br />

contextual domains: HYDROLOGY, GEOLOGY,<br />

BIOLOGY, METEOROLOGY, CHEMISTRY,<br />

Multilingual Resources and Multilingual Applications - Posters<br />

ENGINEERING, WATER TREATMENT, COASTAL<br />

PROCESSES and NAVIGATION.<br />

Figure 3: EROSION context free network<br />

Nevertheless, not only versatile concepts, such as WATER,<br />

are constrained, since information overload can also<br />

affect any other concept that is somehow linked with<br />

versatile ones. For instance, Figure 3 shows EROSION in a<br />

context-free network, which appears overloaded mainly<br />

because it is strongly linked to WATER, since this is one of<br />

its most important agents.<br />

Figure 4: EROSION in the GEOLOGY domain<br />

Contextual constraints are neither applied to individual<br />

concepts nor to individual relations, instead, they are<br />

applied to each conceptual proposition. When constraints<br />

are applied, EROSION is just linked to propositions<br />

belonging to the context of GEOLOGY (Figure 4) or<br />

217


Multilingual Resources and Multilingual Applications - Posters<br />

HYDROLOGY (Figure 5).<br />

218<br />

Figure 5: EROSION in the HYDROLOGY domain<br />

Comparing both networks and especially focusing on<br />

EROSION and WATER, the following conclusions can be<br />

drawn. The number of conceptual relations changes from<br />

one network to another, as EROSION is not equally<br />

relevant in both domains. EROSION is a prototypical<br />

concept of the GEOLOGY domain, this is why it shows<br />

more propositions. Nevertheless, since it is also strongly<br />

linked with WATER, the HYDROLOGY domain is also<br />

essential in the representation of EROSION. Relation types<br />

do not substantially change from one network to the<br />

other, but the GEOLOGY domain shows a greater<br />

number of type_of relations. This is due to the fact that<br />

the HYDROLOGY domain only includes types of<br />

EROSION whose agent is WATER, such as FLUVIAL<br />

EROSION and GLACIER EROSION. The GEOLOGY domain<br />

includes those and others, such as WIND EROSION, SHEET<br />

EROSION, ANTHROPIC EROSION, etc. The GEOLOGY<br />

domain, on the other hand, also includes concepts that are<br />

not related to HYDROLOGY such as ATTRITION because<br />

there is no WATER involved.<br />

On the contrary, WATER displays more relations in the<br />

HYDROLOGY domain. This is caused by the fact that<br />

WATER is a much more prototypical concept in<br />

HYDROLOGY. Therefore, its first hierarchical level<br />

shows more concepts. For example, in GEOLOGY, there<br />

are less WATER subtypes because the network only shows<br />

those that are related to the geological cycle (MAGMATIC<br />

WATER, METAMORPHIC WATER, etc.). In HYDROLOGY,<br />

there are more WATER subtypes related to the<br />

hydrological cycle itself (SURFACE WATER,<br />

GROUNDWATER, etc.). Even the shape of each network<br />

illustrates the prototypical effects of WATER or EROSION.<br />

In Figure 4, EROSION is displayed in a radial structure that<br />

shows it as a central concept in GEOLOGY, whereas in<br />

Figure 5, the asymmetric shape of the network implies<br />

that, more than EROSION, WATER is the prototypical<br />

concept of HYDROLOGY.<br />

4. Acknowledgements<br />

This research has been carried out in project<br />

FFI<strong>2011</strong>-22397/FILO funded by the Spanish Ministry of<br />

Science and Innovation.<br />

5. References<br />

Auger, A., Barrière, C. (2008): Pattern-based approaches<br />

to semantic relation extraction: A state-of-the-art.<br />

Special Issue on Pattern-Based Approaches to<br />

Semantic Relation Extraction, Terminology, 14(1),<br />

pp. 1–19<br />

Barrière, C. (2004): Knowledge-rich contexts discovery.<br />

In Proceedings of the 17th Canadian Conference on<br />

Artificial Intelligence (AI’2004). May 17–19, London,<br />

Ontario, Canada.<br />

Barsalou, L.W. (2009): Simulation, situated<br />

conceptualization and prediction. Philosophical<br />

Transactions of the Royal Society of London:<br />

Biological Sciences, 364, pp. 1281–1289.<br />

Faber, P. (2010): Conceptual modelling in specialized<br />

knowledge resources. In XII International Conference<br />

Cognitive Modelling in Linguistics. September,<br />

Dubrovnik.<br />

León Araúz, P., Faber, P. (2010): Natural and contextual<br />

constraints for domain-specific relations. In<br />

Proceedings of Semantic relations. Theory and<br />

Applications. 18–21 May, Valetta, Malta.<br />

Meyer, I., Mackintosh, K. (19<strong>96</strong>): The corpus from a<br />

terminographer’s viewpoint. International Journal of<br />

Corpus Linguistics, 1(2), pp. 257–285.<br />

Meyer, I., Bowker, L., Eck, K. (1992): COGNITERM:<br />

An experiment in building a knowledge-based term<br />

bank. In Proceedings of Euralex ’92, pp. 159–172.<br />

Rogers, M. (2004): Multidimensionality in concepts<br />

systems: a bilingual textual perspective. Terminology,<br />

10( 2), pp. 215–240.


Multilingual Resources and Multilingual Applications - Posters<br />

Processing Multilingual Customer Contacts via Social Media<br />

Michaela Geierhos, Yeong Su Lee, Matthias Bargel<br />

Center for Information and Language Processing (CIS)<br />

Ludwig Maximilian University of Munich<br />

Geschwister-Scholl-Platz 1, D-80539 München, Germany<br />

E-mail: micha@cis.uni-muenchen.de, yeong@cis.uni-muenchen.de, matthias@cis.uni-muenchen.de<br />

Abstract<br />

Within this paper, we will describe a new approach to customer interaction management by integrating social networking channels<br />

into existing business processes. Until now, contact center agents still read these messages and forward them to the persons in charge<br />

of customer’s in the company. But with the introduction of Web 2.0 and social networking clients are more likely to communicate<br />

with the companies via Facebook and Twitter instead of filling data in contact forms or sending e-mail requests. In order to maintain<br />

an active communication with international clients via social media, the multilingual consumer contacts have to be categorized and<br />

then automatically assigned to the corresponding business processes (e.g. technical service, shipping, marketing, and accounting).<br />

This allows the company to follow general trends in customer opinions on the Internet, but also record two-sided communication for<br />

customer relationship management.<br />

Keywords: classification of multilingual customer contacts, contact center application support, social media business integration<br />

1. Introduction<br />

Considering that Facebook alone had more than 750<br />

million active users 1<br />

in August <strong>2011</strong> it becomes apparent<br />

that Facebook currently is the most preferred medium by<br />

consumers and companies alike. Since many businesses<br />

are moving to online communities as a means of<br />

communicating directly with their customers, social<br />

media has to be explored as an additional communication<br />

channel between individuals and companies. While the<br />

English speaking consumers on Facebook are more likely<br />

to respond to communication rather than to initiate<br />

communication with an organization (Browne et al.,<br />

2009), the German speaking community in turn directly<br />

contacts the companies. Therefore, some German<br />

enterprises already have regularly updated Facebook<br />

pages for customer service and support, e.g. Telekom.<br />

Using the traditional communication channels such as<br />

telephone and e-mail, there are already established<br />

approaches and systems to incoming requests. They are<br />

used by companies to manage all client contacts through<br />

a variety of mediums such as telephone, fax, letter, e-mail,<br />

and online live chat. Contact center agents are therefore<br />

1 http://www.facebook.com/press/info.php?statistics<br />

responsible to assign all customer requests to internal<br />

business processes. However, social networking has not<br />

yet been integrated into customer interaction<br />

management tools.<br />

1.1. Related Work<br />

With the growth of social media, companies and<br />

customers now use sites such as Facebook and Twitter to<br />

share information and provide support. More and more<br />

integrated cross-platform campaigns are dealing with<br />

product opinion mining or providing web traffic statistics<br />

to analyze customer behavior. There is a plenty of<br />

commercial solutions, of varying quality, for these tasks,<br />

e.g. GoogleAlerts, BuzzStream, Sysomos, Alterian,<br />

Visible Technologies, and Radian6.<br />

The current trend goes to development of virtual contact<br />

centers integrating company’s fan profiles on social<br />

networking sites. This virtual contact center processes the<br />

customer contacts and forwards them to company’s<br />

service and support team. For instance, Eptica provides a<br />

commercial tool for customer interaction management<br />

via Facebook.<br />

Other monitoring systems try to predict election results<br />

(Gryc & Moilanen, 2010) or success of movies and music<br />

219


Multilingual Resources and Multilingual Applications - Posters<br />

(Krauss et al., 2008) by using scientific analysis of<br />

opinion polls or doing sentiment analysis on special web<br />

blogs or online forum discussions. Another relevant issue<br />

is the topic and theme identification as well as sentiment<br />

detection. Since blogs consist of news or messages<br />

dealing with various topics, blog content has to be<br />

divided into several topic clusters (Pal & Saha, 2010).<br />

1.2. Towards a Multilingual Social Media<br />

Customer Service<br />

Our proposed solution towards a web monitoring and<br />

customer interaction management system is quite simple.<br />

We focus on a modular architecture fully configurable for<br />

all components integrated in its work-flow (e.g. software,<br />

data streams, and human agents for customer service).<br />

Our first prototype, originally designed for processing<br />

customer messages posted on social networking sites<br />

about mobile-phone specific issues, can also deal with<br />

other topics and use different text types such as e-mails,<br />

blogs, RSS feeds etc. Unlike the commercial monitoring<br />

systems mentioned above, we concentrate on a linguistic,<br />

rule-based approach for message classification and<br />

product name recognition. One of its core innovations is<br />

its paraphrasing module for intra- and inter-lingual<br />

product name variations because of different national and<br />

international spelling rules or habits. By mapping product<br />

name variations to an international canonical form, our<br />

system allows for answering questions like Which<br />

statements are made about this mobile phone in which<br />

languages/in which social networks/in which countries?<br />

Its product name paraphrasing engine is designed in such<br />

a way that standard variants are assigned automatically,<br />

regular variants are assigned semi-automatically and<br />

idiosyncratic variants can be added manually. Moreover,<br />

our system can be adapted according to user’s language<br />

needs, i.e. the application can be easily extended on<br />

further natural languages. Until now, our prototype can<br />

deal with three very different languages: German, Greek,<br />

and Korean. It therefore provides maximum flexibility to<br />

service providers by enabling multiple services with only<br />

one system.<br />

220<br />

2. System Overview<br />

Since customers first share their problems with a social<br />

networking community before directly addressing the<br />

company, the social networking site will be the interface<br />

between customer and company. For instance, Facebook<br />

users post on the wall of a telecommunication company<br />

messages concerning tariffs, technical malfunction or<br />

bugs of its products, positive and negative feedback. The<br />

collector should download every n seconds (e.g. 10 sec)<br />

data from the monitored social networking site. Above all<br />

it should be possible to choose the social networking site,<br />

especially the business pages, to be monitored. This can<br />

be configured by updating the collector’s settings. In<br />

order to retrieve data from Facebook, we use its graph<br />

API. Then customer messages will be stored in a database.<br />

After simplifying their structure 2<br />

, the requests have to be<br />

categorized by the classification module. During the<br />

classification process, we assign both content and<br />

semantic tags (cf. Sect. 3.2) as features to the user posts<br />

before re-storing them in a database. According to the<br />

tags the messages are assigned to the corresponding<br />

business process. This n : 1 relationship is modeled in the<br />

contact center interface before passing these messages as<br />

e-mail requests to the customer interaction management<br />

tool used in contact centers. Finally, the pre-classified<br />

e-mails are automatically forwarded to the persons in<br />

charge of customer services. Those agents reply to the<br />

client requests and their responses will be delivered via<br />

e-mail to the contact center before being transformed into<br />

social network messages and sent back to the Facebook<br />

wall. Afterwards, the Facebook user can read his answer.<br />

3. Linguistic Processing of Costumer<br />

Contacts<br />

Within the customer requests, we try to discover<br />

relationships between clients and products, customers<br />

and technical problems, products and features that will be<br />

used for classification purposes. We are aware of the fact<br />

that many products (mobile phones, chargers, headsets,<br />

batteries, software, and operating systems) are sold in<br />

different countries under the same or under different<br />

names. Our system stores a unique international ID for<br />

each product. Product names and their paraphrases are<br />

language specific. Our prototype normalizes found<br />

product names to the international ID.<br />

2<br />

For example, Facebook wall posts are represented as<br />

structured data that can easily retrieved from Facebook graph<br />

API. We simplify this data format before using it for extraction<br />

and classification purposes.


3.1. International Product Name Paraphrasing<br />

Our first approach to product name paraphrasing was to<br />

use paraphrasing classes. Much as verbs are inflected<br />

according to their inflection class, product names were<br />

inflected according to their paraphrasing class. Yet,<br />

paraphrasing classes had to be assigned manually and<br />

quite many classes were needed. Therefore, we decided<br />

to use a simplified system: Each product or manufacturer<br />

name is stored in a canonical form: Thus, a name of the<br />

type glofiish g500 is stored in the form glofiish-g-500,<br />

even if glofiish g-500 or glofiish g500 should be more<br />

frequent. The minus characters tell our system where a<br />

new part of the product name begins. A product or<br />

manufacturer name has permutations: In German o2<br />

online tarif has the permutation tarif o2 online. Standard<br />

permutations are added automatically: A product or<br />

manufacturer name with three parts has the standard<br />

permutation 123. German tariff names of the type o2<br />

online tarif have the standard permutations 312 and 23<br />

von 1 as in online tarif von O2 (online tariff by o2).<br />

Apart from their canonical name and its variants, product<br />

names can also have spelling variants. Thus, android has<br />

the spelling variants androit, antroid, antroit, andorid,<br />

adroid, andoid and andoit. (These are some of the most<br />

frequent ways android is actually spelt in the customer<br />

messages.) For each spelling variant, our system<br />

automatically generates all paraphrases that exist<br />

according to the standard and the manually added<br />

permutations of the canonical name. I.e. the paraphrases<br />

of the mobile phone name e-ten glofiishg-500 include<br />

e-ten klofisch-g-500, e-ten klofisch-g 500, e-ten klofisch<br />

g-500, etc.<br />

Apart from spelling variants, product names can also<br />

have lexical variants. The mobile phone tct mobile<br />

one-touch-v-770-a has the lexical variant playboy-phone.<br />

The regular permutation transformations are not applied<br />

to lexical variants. But lexical variants and their<br />

manufacturer-based variants (e.g. tct playboy-phone and<br />

playboy-phone) are, of course, paraphrases, too.<br />

3.2. Grammar-based classification<br />

Grammar experts can create any number of content and<br />

sentiment classifiers. A classifier’s grammar consists of a<br />

set of positive constraints and a set of negative<br />

constraints. To classify a message, our system simply<br />

applies the grammars of all its classifier objects to the<br />

Multilingual Resources and Multilingual Applications - Posters<br />

message. If a content classifier’s grammar matches, its<br />

tag is added to the message’s content tags. Sentiment<br />

classification works analogously with the exception that<br />

exactly one tag is assigned.<br />

Content and sentiment classificators are language and<br />

URL specific: A classifier has exactly one language and a<br />

set of URLs. It will only be applied to messages that have<br />

the same language and that stem from one of the URLs in<br />

the classifier’s set of URLs. In general, content tags and<br />

product list are independent of each other. But many<br />

classifiers will have constraints that require that a product<br />

(or other entity) of a certain type be mentioned. Thus, a<br />

classifier that assigns the tag phone available? (e.g. to the<br />

message When will the new iPhone be released?) would<br />

probably include the mobile phone grammar in its<br />

constraints by using the special term \mobile_phone.<br />

4. Discussion<br />

4.1. No statistical approach<br />

We think that the fact that contact center agents can<br />

invent new tags and assign new or old tags to (badly)<br />

classified messages, if they mark the strings that are<br />

supposed to justify the assignment of the tag, is a good<br />

reason for not using a statistical approach. If we used a<br />

statistical approach, human work would be necessary at<br />

some point of the development process: Some algorithm<br />

would have to be trained. In our approach, the human<br />

work is done in the customer management process. This<br />

way, two things are achieved in one step: The customer’s<br />

request is answered and the classification algorithm is<br />

enhanced. The system is being enhanced while it is used.<br />

There is no need to interrupt the customer interaction in<br />

order to train it on new data that data specialists have<br />

created. Besides, manual intervention is much more<br />

straightforward and transparent, if a grammar of the type<br />

described above is used than it would be with a statistical<br />

algorithm. Our system is flexible in the sense that it can<br />

easily be modified in such a way that very specific<br />

requirements are met. If, e.g., a future user of our tool (a<br />

company that wants to interact with its customers) should<br />

want to assign every message that has the word hotline in<br />

it a certain tag – such as hotline problem – , then this<br />

requirement can be met by simply adding the line<br />

hotline to the positive constraints of the classifier<br />

called hotline problem.<br />

221


Multilingual Resources and Multilingual Applications - Posters<br />

4.2. Applying the DRY principle<br />

Our prototype follows the DRY principle (Don't repeat<br />

yourself (Murrell, 2009:35)): Changes are only made in<br />

one place. An example: the Korean variants of the mobile<br />

phone name with the international ID google-nexus-s<br />

include google nexus s, google nexus-s, nexus s, nexus-s,<br />

구글 넥서스 에스, 구글 넥서스에스, 구글 넥서스 s, 구글 넥서스-s, 넥서스<br />

에스, 넥서스에스, 넥서스 s, 넥서스-s, 구글 nexus s, 구글 nexus-s,<br />

google 넥서스에스, 구글의 넥서스에스. This phenomenon is<br />

represented in our system as follows: The Korean<br />

producer name corresponding to the international ID<br />

google has the variants google and 구글. The Korean<br />

mobile phone name with the international ID nexus-s has<br />

the variants nexus-s, 넥서스에스 and 넥서스-s. This is the<br />

only information our users have to store in order to make<br />

the system generate these and many other variants. Our<br />

tool generates google nexus s, 구글 넥서스 s and similar<br />

variants using the general rule that in any permutation of a<br />

product name any minus character may be replaced by a<br />

space character. It generates 넥서스에스, 넥서스 s and similar<br />

variants using the general rule that the producer name may<br />

be omitted. And our system generates 구글의 넥서스에스<br />

using the two Korean variants of the producer name and<br />

the general rule that phone names can have the form<br />

의 . (의 is a<br />

genitive affix, i.e. 구글의 넥서스에스 literally means Google's<br />

Nexus S or Nexus S by Google.)<br />

We might, of course, add the general rule to our product<br />

name paraphrasing engine that any part of a Korean<br />

product name may be spelt either with Latin or with<br />

Hangul characters – according to several sets of<br />

transliteration conventions that are used in parallel.<br />

Any change in a producer, tariff or product name object,<br />

such as the Korean mobile phone name with the<br />

international ID nexus-s, has implications for the<br />

grammars of the message classifiers: Newly generated<br />

variants of the product name must be matched by all<br />

instances of \mobile_phone in all grammars. For<br />

efficiency reasons, we compile all product names, tariff<br />

names, producer names, message classification grammars,<br />

sentiment classification grammars, and so on, into one<br />

single function. This function is very efficient, because it<br />

doesn’t do much more than apply one very large,<br />

compiled regular expression. The compiling and<br />

reloading of this function is done in the background, so<br />

222<br />

the users of our tool do not need to know anything about it.<br />

They don’t even have to understand the word compile.<br />

They just need to know that the system sometimes needs a<br />

few seconds to be able to use changed objects.<br />

5. Conclusion and Outlook<br />

Within this paper, we described a new technical service<br />

dealing with the integration of social networking<br />

channels into customer interaction management tools.<br />

Mining social networks for classification purposes is no<br />

novelty; providing an assignment of customer messages<br />

to business processes instead of classifying them in topics<br />

did not exist before. Above all, our system features<br />

effective named entity recognition because of its name<br />

paraphrasing mechanism dealing with different types of<br />

misspellings in both intra- and interlingual names of<br />

tariffs, products, manufacturers and providers. Future<br />

research will expand upon this study, investigating other<br />

social networking sites and additional companies across a<br />

range of non-telecommunication products or services.<br />

6. Acknowledgements<br />

This work was supported by grant no. KF2713701ED0<br />

awarded by the German Federal Ministry of Economics<br />

and Technology.<br />

7. References<br />

Browne, R., Clements, E., Harris, R., Baxter, S. (2009):<br />

Business and consumer communication via online<br />

social networks: a preliminary investigation. In<br />

ANZMAC 2009.<br />

Gryc, W., Moilanen, K. (2010): Leveraging Textual<br />

Sentiment Analysis with Social Network Modeling:<br />

Sentiment Analysis of Political Blogs in the 2008 U.S.<br />

Presidential Election. In Proceedings of the From Text<br />

to Political Positions Workshop (T2PP 2010), Vrije<br />

Universiteit, Amsterdam, April 9–10 2010.<br />

Krauss, J., Nann, S., Simon, D., Fischbach, K., Gloor, P.A.<br />

(2008): Predicting Movie Success and Academy<br />

Awards Through Sentiment and Social Network<br />

Analysis. In ECIS 2008.<br />

Murrell, P. (2009): Introduction to Data Technologies.<br />

Auckland, New Zealand.<br />

Pal, J.K., Saha, A. (2010): Identifying Themes in Social<br />

Media and Detecting Sentiments. Technical Report<br />

HPL-2010-50, HP Laboratories.


Multilingual Resources and Multilingual Applications - Posters<br />

ATLAS – A Robust Multilingual Platform for the Web<br />

Diman Karagiozov*, Svetla Koeva**, Maciej Ogrodniczuk***, Cristina Vertan****<br />

* Tetracom Interactive Solutions Ltd., ** Bulgarian Academy of Sciences,<br />

*** Polish Academy of Sciences, **** University of Hamburg,<br />

*Tetracom LTd. Sofia, Bulgaria, **52 Shipchenski prohod, bl. 17 Sofia 1113 Bulgaria,<br />

***ul. J.K. Ordona 2101-237 Warszawa, Poland, ****Von-Melle Park 6 20146 Hamburg, Germany<br />

E-mail: diman@tetracom.com, svetla@dcl.bas.bg, maciej.ogrodniczuk@gmail.com,<br />

cristina.vertan@uni-hamburg.de<br />

Abstract<br />

This paper presents a novel multilingual framework integrating linguistic services around a Web-based content management system.<br />

The language tools provide semantic foundation for advanced CMS functions such as machine translation, automatic categorization<br />

or text summarization. The tools are integrated into processing chains on the basis of UIMA architecture and using uniform<br />

annotation model. The CMS is used to prepare two sample online services illustrating the advantages of applying language<br />

technology to content administration.<br />

Keywords: content management system, language processing chains, UIMA, language technology<br />

1. Introduction<br />

During the last years, the number of applications which<br />

are entirely Web-based, or offer at least some Web<br />

front-end has grown dramatically. As a response to the<br />

need of managing all this data, a new type of system<br />

appeared: the Web-content management system. In this<br />

article we will refer to these type of system as WCMS.<br />

Existent WCMS focus on storage of documents in<br />

databases and provide mostly full-text search<br />

functionality. These types of systems have limited<br />

applicability, due to two reasons:<br />

� data available online is often multilingual, and<br />

� documents within a CMS are semantically related<br />

(share some common knowledge, or belong to<br />

similar topics)<br />

Shortly currently available CMS do not exploit modern<br />

techniques from information technology like text mining,<br />

semantic Web or machine translation.<br />

The ICT PSP EU project ATLAS 1<br />

– Applied Technology<br />

1 The work reported here was carried out within the Applied<br />

Technology for Language-Aided CMS project co-funded by the<br />

European Commission under the Information and<br />

Communications Technologies (ICT) Policy Support<br />

Programme (Grant Agreement No 250467). The authors would<br />

like to thank all representatives of project partners for their<br />

contribution<br />

for Language-Aided CMS aims at filling this gap by<br />

providing three innovative Web services within a WCMS.<br />

These three Web services: i-Librarian, EUDocLib and<br />

i-Publisher are not only thematically different but offer<br />

also different levels of intelligent information processing.<br />

The ATLAS WCMS makes use of state-of-the art text<br />

technology methods in order to extract information and<br />

cluster documents according to a given hierarchy. A text<br />

summarization module and a machine translation engine<br />

are embedded as well as a cross-lingual semantic search<br />

engine (Belogay et al., <strong>2011</strong>).<br />

The cross-lingual search engine implements Semantic<br />

Web technology: the document content is represented as<br />

RDF triples and the search index is built up from these<br />

triples.<br />

The RDF representation of documents collects not only<br />

metadata information about the whole file but also<br />

exploits linguistic analysis of the document and store as<br />

well the mapping of the file on some ontological concept.<br />

This paper presents the architecture of the ATLAS system<br />

with particular focus on the language processing<br />

components to be embedded aiming to show how robust<br />

NLP (natural language processing) tools can be wrapped<br />

in a common framework.<br />

223


Multilingual Resources and Multilingual Applications - Posters<br />

224<br />

2. Language resources in<br />

the ATLAS System<br />

The linguistic diversity in the project is a challenge not to<br />

be neglected: the languages belong to four language<br />

families and involve three alphabets. To our knowledge it<br />

is the first WCMS which will offer solutions for<br />

documents written in languages from Central and<br />

South-Eastern Europe.<br />

Whilst the standardised development of tools for<br />

widespread languages as English and German is more<br />

common, the situation is quite different when involving<br />

languages from Central and South Eastern Europe (see<br />

http://www.c-phil.uni-hamburg.de/view/Main/LrecWork<br />

shop2010).<br />

Tools with different processing depth, different output<br />

formats and sometimes very particular approach are<br />

current state of the art in the language technology map of<br />

the above-mentioned area (Degórski, Marcińczuk &<br />

Przepiórkowski, 2008). One of the innovative issues in<br />

project ATLAS is the integration of linguistically and<br />

technologically heterogeneous language tools within a<br />

common framework.<br />

The following description presents the steps taken in<br />

order to provide such common representation.<br />

� Starting from the fixed desiderata to include text<br />

summarisation, automatic document classi fication,<br />

machine translation and cross-lingual information<br />

retrieval the minimal list of tools required by such<br />

engines which can be provided by all languages<br />

involved in the project has been collated and<br />

includes:<br />

o tokeniser,<br />

o sentence boundary detector,<br />

o paragraph boundary detector,<br />

o lemmatizer,<br />

o PoS Tagger,<br />

o NP (noun phrase) chunker,<br />

o NE (named entity) extractor.<br />

Some of these tools are not completely available for<br />

particular languages (e.g. NP chunker for Croatian) but<br />

can be developed within the project. Regarding the NE<br />

extractor the following entities have been agreed upon:<br />

persons, dates, time, location and currency.<br />

� The annotation levels in the texts and the minimal<br />

features to be annotated have been defined:<br />

Paragraph, Sentence, Token, NP and NE. In order to<br />

provide a common representation all linguistic<br />

information regarding lemma, PoS etc. have been<br />

agreed to be provided at the token level. For a token<br />

following features are retained:<br />

o begin – an integer representing the offset of the<br />

first character of the token,<br />

o end – an integer representing the offset of the<br />

last character of the token,<br />

o pos – a string representing the morphosyntactic<br />

tag (PoS, gender, number) associated with the<br />

token,<br />

o lemma – a string containing the lemma of the<br />

token.<br />

� For each of the above-mentioned tools the list of<br />

additional linguistic features to be represented (if<br />

necessary and available) have been defined, e.g.<br />

antecedentBegin and antecedentEnd representing<br />

the offset of the first and respectively the last<br />

character of the referent in an NP. This feature is<br />

necessary for processing German NPs and is<br />

therefore included as optional in the NP annotation<br />

frame.<br />

A glossary of tagsets delivered by each tool is also<br />

maintained, ensuring cross-lingual processing.<br />

Each of the language tools can be included as primitive<br />

engine, i.e. part of an UIMA aggregate engine, but also as<br />

an aggregate engine. In this way any language<br />

component can reuse results produced by a particular tool<br />

and exploit its full functionality if required.<br />

3. Language Processing chains<br />

One of the goals of the ATLAS WCMS is to offer<br />

documented language processing chains (LPCs) for text<br />

annotation. A processing chain for a given language<br />

includes a number of existing tools, adjusted and/or<br />

fine-tuned to ensure their interoperability. In most<br />

respects a language processing chain does not require<br />

development of new software modules but rather<br />

combining existing tools.<br />

Most of the basic linguistic tools (sentence splitters,<br />

stopword filters, tokenizers, lemmatizers, part-of-speech<br />

taggers) for languages in scope of our interest have<br />

already existed as standalone offline applications.<br />

The multilinguality of the system services requires high<br />

level of accuracy of each monolingual language chain –<br />

simple example is that a word with part-of-speech tag


ambiguity in one language may correspond to an<br />

unambiguous word in the other language.<br />

The complexity grows at the level of structure and sense<br />

ambiguity differs among languages. Thus the high<br />

precision and performance of language specific chains<br />

predefines to the great extend the quality of the system as<br />

a whole.<br />

For example the Bulgarian PoS tagger has been<br />

developed as a modified version of the Brill tagger<br />

applying a rule-based approach and techniques for the<br />

optimization leading to the 98.3% precision (Koeva,<br />

2007). The large Bulgarian grammar dictionary used for<br />

the lemmatization is implemented as acyclic and<br />

deterministic finite-state automata to ensure a very fast<br />

dictionary look-up.<br />

The language processing chains have been fine-tuned and<br />

adjusted to facilitate integration into a common UIMA<br />

framework. Other tools (such as noun phrase extractors<br />

or named entity recognizers) had to be implemented or<br />

multilingually ported.<br />

The annotation produced by the chain along with<br />

additional tools (e.g. frequency counters) results in<br />

higher-level functions such as detection of keywords and<br />

phrases along with improbable phrases from the analyzed<br />

content, and utilisation of more sophisticated user<br />

functionality deserves complex linguistic functions as<br />

multilingual text summarisation and machine translation.<br />

UIMA is a pluggable component architecture and<br />

software framework designed especially for the analysis<br />

of unstructured content and its transformation into<br />

structured information. Apart from offering common<br />

components (e.g. the type system for document and text<br />

annotations) it builds on the concept of analysis engines<br />

(in our case, language specific components) taking form<br />

of primitive engines which can wrap up NLP (natural<br />

language processing) tools adding annotations aggregate<br />

engines which define the sequence of execution of<br />

chained primitives.<br />

Making the tools chainable requires ensuring their<br />

interoperability on various levels. Firstly, compatibility<br />

of formats of linguistic information is maintained within<br />

the defined scope of required annotation (Ogrodniczuk &<br />

Karagiozov, <strong>2011</strong>).<br />

The UIMA type system requires development of a<br />

uniform representation model which helps to normalize<br />

heterogeneous annotations of the component NLP tools.<br />

Multilingual Resources and Multilingual Applications - Posters<br />

With ATLAS it covers properties vital for further<br />

processing of the annotated data, e.g. lemma, values for<br />

attributes such as gender, number and case for tokens<br />

necessary to run coreference module to be subsequently<br />

used for text summarisation, categorization and machine<br />

translation.<br />

To facilitate introduction of further levels of annotation a<br />

general markable type has been introduced, carrying<br />

subtype and reference to another markable object. This<br />

way new annotation concepts can be tested and later<br />

included into the core model.<br />

4. Integration of language processing chains<br />

in ATLAS<br />

The language chains are used in order to extract relevant<br />

information such as named entities and keywords from<br />

the documents stored within the ATLAS WCMS.<br />

Additionally they provide the baseline for further engines:<br />

Text summarization, Clustering and Machine translation<br />

(Koehn et al., 2007) and as such they are the foundation<br />

of the enhanced ATLAS platform.<br />

The core online service of the ATLAS platform is<br />

i-Publisher, a powerful Web-based instrument for<br />

creating, running and managing content-driven Web sites.<br />

It integrates the language-based technology to improve<br />

content navigation e.g. by interlinking documents based<br />

on extracted phrases, words and names, providing short<br />

summaries and suggested categorization concepts.<br />

Currently two different thematic content-driven Web<br />

sites, i-Librarian and EUDocLib, are being built on top of<br />

ATLAS platform, using i-Publisher as content<br />

management layer. i-Librarian is intended to be a<br />

user-oriented web site which allows visitors to maintain a<br />

personal workspace for storing, sharing and publishing<br />

various types of documents and have them automatically<br />

categorized into appropriate subject categories,<br />

summarized and annotated with important words,<br />

phrases and names.<br />

EUDocLib is planned as a publicly accessible repository<br />

of EU legal documents from the EUR-LEX collection<br />

with enhanced navigation and multilingual access.<br />

An important aspect of ATLAS System is that all three<br />

services operate in a multilingual setting. Similar<br />

functionality will be implemented within the project for<br />

Bulgarian, Croatian, German, English, German, Greek,<br />

Polish and Romanian. The architecture of the system is<br />

225


Multilingual Resources and Multilingual Applications - Posters<br />

modular and allows anytime a new language extension. It<br />

is an aynchronous architecture based on queue<br />

processing of requests (see Figure 1)<br />

226<br />

Figure 1: Linguistic processing support<br />

in ATLAS System<br />

5. Conclusions<br />

In this paper we present an architecture which opens the<br />

door to standardized multilingual online processing of<br />

language and it offers localized demonstration tools built<br />

on top of the linguistic modules.<br />

The framework is ready for integration of new types of<br />

tools and new languages to pro- vide wider online<br />

coverage of the needful linguistic services in a<br />

standardized manner. New versions of the online services<br />

are planned to be launched in the beginning of 2012.<br />

6. References<br />

Belogay, A., Ćavar, D., Cristal, D., Karagiozov, D.,<br />

Koeva, S., Nikolov, R., Ogrodniczuk, M.,<br />

Przepiórkowski, A., Raxis P., Vertan C. (to appear):<br />

i-Publisher, i-Librarian and EUDocLib – linguistic<br />

services for the Web. In: Proceedings of the 8th<br />

Practical Applications in Language and Computers<br />

(PALC <strong>2011</strong>) conference. University of Łódź, Poland,<br />

13-15 April <strong>2011</strong><br />

Degórski, Ł., Marcińczuk, M., Przepiórkowski, A.<br />

(2008): Definitio n extraction using a sequential<br />

combination of baseline grammars and machine<br />

learning classi fiers. In: Proceedings of the 6th<br />

International Conference on Language Resources and<br />

Evaluation, LREC 2008. ELRA, Marrakech,<br />

http://nlp.ipipan.waw.pl/~adamp/Papers/2008-lreclt4el/213_paper.pdf<br />

Koehn, P., Hoang H., Birch A., Callison-Burch, C.,<br />

Federico M., Bertoldi, N., Cowan, B., Shen, W.,<br />

Moran, C., Zens, R., Dyer, C., Bojar O., Constantin,<br />

A., Herbst, E. (2007): Moses: Open Source Toolkit for<br />

Statistical Machine Translation. In: ACL (ed.) Annual<br />

Meeting of the Association for Computational<br />

Linguistics, (ACL), demonstration session. Prague,<br />

http://acl.ldc.upenn.edu/P/P07/ P07- 2045.pdf<br />

Koeva, S. (2007): Multi-word Term Extraction for<br />

Bulgarian. In: Piskorski, J., Pouliquen, B., Steinberger,<br />

R., Tanev, H. (eds.) Proceedings of the Workshop on<br />

Balto-Slavonic Natural Language Processing, pp.<br />

59–66. Association for Computational Linguistics,<br />

Prague, Czech Republic, June 2007.<br />

http://www.aclweb.org/anthology/W/W07/W07-1708<br />

Ogrodniczuk, M., Karagiozov, D. (to appear): ATLAS –<br />

The Multilingual Language Processing Platform. In:<br />

Proceedings of the 27th Conference of the Spanish<br />

Society for Natural Language Processing. University<br />

of Huelva, Spain, 5-7 September <strong>2011</strong>


Multilingual Resources and Multilingual Applications - Posters<br />

Multilingual Corpora at the Hamburg Centre for Language Corpora<br />

Hanna Hedeland, Timm Lehmberg, Thomas Schmidt, Kai Wörner<br />

<strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong> (HZSK)<br />

Max Brauer-Allee 60<br />

D-22765 Hamburg<br />

E-mail: hanna.hedeland@uni-hamburg.de, timm.lehmberg@uni-hamburg.de, thomas.schmidt@uni-hamburg.de,<br />

kai.wörner@uni-hamburg.de<br />

Abstract<br />

We give an overview of the content and the technical background of a number of corpora which were developed in various projects of<br />

the Research Centre on Multilingualism (SFB 538) between 1999 and <strong>2011</strong> and which are now made available to the scientific<br />

community via the Hamburg Centre for Language Corpora.<br />

Keywords: corpora, spoken language, multilingualism, digital infrastructures<br />

1. Introduction<br />

In this paper, we give an overview of the content and the<br />

technical background of a number of corpora which were<br />

developed in various projects of the Research Centre on<br />

Multilingualism (SFB 538) between 1999 and <strong>2011</strong> and<br />

which are now made available to the scientific<br />

community via the Hamburg Centre for Language<br />

Corpora.<br />

Between 1999 and <strong>2011</strong>, the Research Centre on<br />

Multilingualism (SFB 538) brought together researchers<br />

investigating various aspects of multilingualism<br />

focussing either on the language development of<br />

multilingual individuals, on communication in<br />

multilingual societies, or on diachronic change of<br />

languages in multilingual settings. Without exception, the<br />

projects of the Centre worked empirically, basing their<br />

analyses on corpora of spoken or written language. Over<br />

the years, an extensive and diverse data collection was<br />

thus built up consisting of language acquisition and<br />

attrition corpora, interpreting corpora, parallel<br />

(translation) corpora, corpora with a sociolinguistic<br />

design and historical corpora.<br />

Since corpus creation, management and analysis were<br />

thus crucial to the work of the Research Centre, a project<br />

was set up in June 2000 with the aim of designing and<br />

implementing methods for the computer-assisted<br />

processing of multilingual language data. One major<br />

result of that project is EXMARaLDA, a system for<br />

setting up and analysing spoken language corpora<br />

(Schmidt & Wörner, 2009, Schmidt et al., this volume).<br />

The focus of this paper will be on the spoken language<br />

corpora of the Research Centre which were either created<br />

or curated with the help of EXMARaLDA.<br />

2. Overview of corpora<br />

As the list of resources in the appendix shows, altogether<br />

31 resources constructed at the SFB 538 were transferred<br />

to the inventory of the Hamburg Centre for Language<br />

Corpora. 27 of these are spoken language corpora, 3 are<br />

corpora of modern written language, and one is a corpus<br />

of historical written language. More specifically, we are<br />

dealing with the following resource types:<br />

� Language acquisition corpora which document the<br />

acquisition of two first languages or a second<br />

language. Most of these corpora are longitudinal<br />

studies of child language in different bilingual<br />

language combinations (German-French, German-<br />

Portuguese, German-Spanish, German-Turkish), but<br />

other corpus designs (e.g. cross-sectional studies)<br />

and other speaker types (e.g. adult learners or<br />

monolingual children) are also present.<br />

� Language attrition corpora which document the<br />

development of a “weaker” language in adult<br />

bilinguals. Three different language combinations<br />

227


Multilingual Resources and Multilingual Applications - Posters<br />

228<br />

(German-Polish, German-Italian, German-French)<br />

are involved.<br />

� Interpreting corpora which document consecutive<br />

and simultaneous interpreting involving trained and<br />

ad-hoc interpreters for different language<br />

combinations (German-Portuguese, German-<br />

�<br />

Turkish, German-Russian, German-Polish, German-<br />

Romanian) and in different settings (doctor-patient<br />

communication and expert discussion).<br />

Corpora with a sociolinguistic corpus design whose<br />

data are stratified according to biographic<br />

�<br />

characteristics (e.g. age) of the speakers and/or their<br />

regional provenance. This comprises a corpus<br />

documenting Faroese-Danish bilingualism on the<br />

Faroese Islands and a corpus documenting the use of<br />

Catalan in different districts of Barcelona.<br />

Parallel and comparable corpora in which originals<br />

and translations of texts are aligned or which consist<br />

of original texts from specific genres in different<br />

languages.<br />

The entirety of spoken language resources amounts to<br />

approximately 5500 transcriptions with approximately<br />

5.5 million transcribed words (not counting secondary<br />

annotations).<br />

3. Data model<br />

The spoken language corpora, while sharing the common<br />

theme of multilingualism, are still highly heterogeneous<br />

with respect to many parameters. As far as their content is<br />

concerned, they do not only cover a spectrum of fourteen<br />

different languages, but also greatly differ with respect to<br />

the recorded discourse types (e.g. interviews, free<br />

conversation, expert discussion, classroom discourse,<br />

semi-controlled settings, and institutional discourse).<br />

Even more variation is to be found with respect to the<br />

research interests pursued with the help of the corpora<br />

and, consequently, the methodology used to record,<br />

transcribe and annotate the data. To begin with, either<br />

only audio or both video and audio data are recorded,<br />

depending on whether or not non-verbal behavior plays a<br />

role for analysis (as is the case, for example, for data of<br />

young children). As some projects focused their research<br />

on syntactic aspects of language, while others where<br />

interested in phonological properties or discourse<br />

structures, different systems where applied in<br />

transcribing (e.g. orthographic vs. phonetic transcription<br />

or complete vs. selective transcription) and annotating<br />

(e.g. prosodic annotations, annotation of code switches)<br />

the data.<br />

The challenge in representing the corpora on a common<br />

technical basis was thus to find a degree of abstraction<br />

which, on the one hand, allows operations common to all<br />

resources (such as time alignment of transcription and<br />

media) to be carried out efficiently on a unified structure,<br />

but, on the other hand, also makes it possible to apply<br />

theory or resource specific functions (such as<br />

segmentation according to a specific model) to the data.<br />

A data model based on annotation graphs (Bird &<br />

Liberman, 2001), but supplemented with additional<br />

semantic specifications and structural constraints, turned<br />

out to be suitable for this task (Schmidt, 2005).<br />

4. Data curation<br />

The construction of a non-negligible part of the resources<br />

had been completed or started before EXMARaLDA was<br />

available as a working system. A number of legacy<br />

software tools (syncWriter, HIAT-DOS, LAPSUS,<br />

WordBase) was used for the construction of these corpora<br />

resulting in data for which there was hardly a chance of<br />

sustainable maintenance. The resources therefore had to<br />

be converted to EXMARaLDA in a laborious process<br />

described in detail in Schmidt & Bennöhr (2007).<br />

From about 2003 onwards, all projects used<br />

EXMARaLDA or other compatible tools (e.g. Praat) for<br />

corpus construction. Although these resources were<br />

much easier to process once they had been completed,<br />

there was still a considerable amount of data curation to<br />

be done before they could be published. This involved<br />

various completeness and consistency checks on the<br />

transcription and annotation data and the construction of<br />

valid metadata descriptions for all parts of a resource.<br />

5. Data dissemination<br />

Completed resources are made available to interested<br />

users via the WWW 1<br />

through several methods:<br />

� A hypermedia representation of transcriptions,<br />

annotations, recording and metadata allows users to<br />

browse corpora online (see figure 1).<br />

1 http://www.corpora.uni-hamburg.de


� Resources can be downloaded in the EXMARaLDA<br />

format and then edited and queried with the system’s<br />

tools (Partitur-Editor for editing transcriptions,<br />

Coma for editing and querying metadata, EXAKT<br />

for querying transcription and annotation data).<br />

� Queries via EXAKT can also be carried out on<br />

remote data, i.e. without downloading the resource<br />

first, or through a web interface, i.e. without the need<br />

to install local software first.<br />

� A number of export formats are offered for each<br />

annotation file making it possible to edit or query the<br />

data also with non-EXMARaLDA tools. Most<br />

importantly, most data are also available in the<br />

CHAT format of the CHILDES system, as ELAN<br />

annotation files, as Praat TextGrids and as TEI files.<br />

Access to all corpora is password protected. The process<br />

for obtaining a password varies from resource to resource,<br />

but always requires the data owner’s consent. Due to<br />

privacy protection issues, a part of the spoken resources<br />

can only be made accessible in the form of transcriptions,<br />

not audio or video recordings.<br />

6. Future plans<br />

In order to cater for the long term archiving and<br />

availability of the data beyond the finite funding period<br />

of the Research Centre, in January <strong>2011</strong> the<br />

Hamburg Centre for Language Corpora (HZSK,<br />

http://www.corpora.uni-hamburg.de) was set up. This<br />

institution is intended to provide a permanent basis not<br />

Multilingual Resources and Multilingual Applications - Posters<br />

Figure 1: Hypermedia representation of a transcription from the Hamburg Map Task Corpus (HAMATAC)<br />

only for the corpora and tools referred to in this paper, but<br />

also for further resources existing or under construction<br />

at the University of Hamburg. The HZSK is part of the<br />

CLARIN-D network and will, in the years to come,<br />

integrate its resources into this infrastructure by<br />

providing protocols for metadata harvesting, assigning<br />

PIDs to resources, allowing for single-sign-on<br />

mechanisms and implementing interfaces as defined by<br />

CLARIN for access to metadata and annotations.<br />

7. References<br />

Bird, S., Liberman, M. (2001): A formal framework for<br />

linguistic annotation. In: Speech Communication (33),<br />

pp. 23-60.<br />

Schmidt, T. (2005): Computergestützte Transkription -<br />

Modellierung und Visualisierung gesprochener<br />

Sprache mit texttechnologischen Mitteln. Frankfurt a.<br />

M.: Peter Lang.<br />

Schmidt, T., Bennöhr, J. (2008): Rescuing Legacy Data.<br />

In: Language Documentation and Conservation (2),<br />

pp. 109-129.<br />

Schmidt, T., Wörner, K. (2009): EXMARaLDA –<br />

Creating, analysing and sharing spoken language<br />

corpora for pragmatic research. In: Pragmatics 19(4),<br />

pp. 565-582.<br />

229


Multilingual Resources and Multilingual Applications - Posters<br />

Appendix: List of resources<br />

Spoken resources<br />

230<br />

Corpus name<br />

Project / Data Owner<br />

Type<br />

HABLA (Hamburg Adult Bilingual LAnguage)<br />

E11 / Tanja Kupisch<br />

spoken/audio/exmaralda<br />

DUFDE (Deutscher und Französischer<br />

doppelter Erstspracherwerb)<br />

E2 / Jürgen Meisel<br />

spoken/video/exmaralda<br />

BIPODE (Bilingualer Portugiesisch-Deutscher<br />

Erstpracherwerb)<br />

E2 / Jürgen Meisel<br />

spoken/video/exmaralda<br />

CHILD-L2<br />

E2 / Jürgen Meisel<br />

spoken/video/exmaralda<br />

ZISA (Zweitspracherwerb Italienischer und<br />

Spanischer Arbeiter)<br />

E2 / Jürgen Meisel<br />

spoken/audio/exmaralda<br />

BUSDE (Baskischer und Spanischer doppelter<br />

Erstspracherwerb)<br />

E2 / Jürgen Meisel<br />

spoken/video/other<br />

PAIDUS (Parameterfixierung im Deutschen<br />

und Spanischen)<br />

E3 / Conxita Lleó<br />

spoken/audio/exmaralda<br />

PHONBLA Longitudinalstudie Hamburg<br />

E3 / Conxita Lleó<br />

spoken/audio+video/exmaralda<br />

PHONBLA Querschnittsstudie Madrid<br />

E3 / Conxita Lleó<br />

spoken/audio+video/exmaralda<br />

PEDSES (Phonologie-Erwerb<br />

Deutsch-Spanisch als Erste Sprachen)<br />

E3 / Conxita Lleó<br />

spoken/audio/exmaralda<br />

PHON-CL2<br />

E3 / Conxita Lleó<br />

spoken/audio/exmaralda<br />

PHONMAS<br />

E3 / Conxita Lleó<br />

spoken/audio/exmaralda<br />

TÜ_DE-cL2-Korpus<br />

E4 / Monika Rothweiler<br />

spoken/video/exmaralda<br />

TÜ_DE-L1-Korpus<br />

E4 / Monika Rothweiler<br />

spoken/audio/exmaralda<br />

Short description Language(s) Size<br />

Audio recordings of semi-spontaneous interviews<br />

(elicited grammaticality judgments and<br />

production data are collected from the same<br />

speakers)<br />

Video recordings (longitudinal study) of seven<br />

French-German bilingual children aged between 1<br />

year;6 months and 6 years;11 months (+some<br />

later recordings).<br />

Video recordings (longitudinal study) of three<br />

Portuguese-German bilingual children aged<br />

between 1 year;6 months and 5 years;6 months.<br />

Video recordings of children which start acquiring<br />

French or German as a second language at the<br />

age of three or four years.<br />

deu, fra, ita 169 communications<br />

127 speakers<br />

737797 transcribed words<br />

169 transcriptions<br />

deu, fra 562 communications<br />

14 speakers<br />

ca. 1000000 transcribed<br />

words<br />

849 transcriptions<br />

deu, por 250 communications<br />

48 speakers<br />

ca. 250000 transcribed<br />

words<br />

227 transcriptions<br />

deu, fra 181 communications<br />

69 speakers<br />

376114 transcribed words<br />

181 transcriptions<br />

Recordings of adult L2-German-learners deu 101 communications<br />

5 speakers<br />

11<strong>96</strong>67 transcribed words<br />

100 transcriptions<br />

Longitudinal language aqcuisition study on<br />

bilingual Basque-Spanish children<br />

eus, spa unknown<br />

Audio recordings of monolingual children. deu, spa 253 communications<br />

66 speakers<br />

166976 transcribed words<br />

253 transcriptions<br />

Longitudinal data of Spanish/German bilingual<br />

children<br />

Cross sectional study of bilingual German-Spanish<br />

L1 acquisition<br />

Longitudinal data of Spanish/German bilingual<br />

children<br />

Recordings of German subjects/children who<br />

have learned (or are learning) Spanish after the<br />

age of two<br />

Recordings of monolingual Spanish children (as<br />

comparable data for Madrid-PhonBLA)<br />

Video recordings of (spontaneous and elicited<br />

language) of eight bilingual children with Turkish<br />

as their first language<br />

Video recordings of (spontaneous and elicited<br />

language) of twelve bilingual children with<br />

Turkish as their first language<br />

deu, spa 413 communications<br />

61 speakers<br />

303792 transcribed words<br />

413 transcriptions<br />

deu, spa 113 communications<br />

34 speakers<br />

56722 transcribed words<br />

113 transcriptions<br />

deu, spa 127 communications<br />

21 speakers<br />

101292 transcribed words<br />

127 transcriptions<br />

deu, spa 26 communications<br />

22 speakers<br />

17412 transcribed words<br />

26 transcriptions<br />

spa 49 communications<br />

4 speakers<br />

3067 transcribed words<br />

49 transcriptions<br />

deu 112 communications<br />

19 speakers<br />

348292 transcribed words<br />

112 transcriptions<br />

tur 12 communications<br />

22 speakers<br />

13 transcriptions


Rehbein-ENDFAS/Rehbein-SKOBI-Korpus<br />

E5 / Jochen Rehbein<br />

spoken/audio/exmaralda<br />

ENDFAS/SKOBI Gold Standard<br />

E5 / Jochen Rehbein<br />

spoken/audio/exmaralda<br />

Catalan in a bilingual context<br />

H6 / Conxita Lleó<br />

spoken/audio/exmaralda<br />

Hamburg Corpus of Polish in Germany<br />

H8 / Bernhard Brehmer<br />

spoken/audio/exmaralda<br />

Hamburg Corpus of Argentinean Spanish<br />

(HaCASpa)<br />

H9 / Christoph Gabriel<br />

spoken/audio/exmaralda<br />

Dolmetschen im Krankenhaus<br />

K2 / Kristin Bührig Bernd Meyer<br />

spoken/audio/exmaralda<br />

SkandSemiko (Skandinavische<br />

Semikommunikation)<br />

K5 / Kurt Braunmüller<br />

spoken/audio/exmaralda<br />

CoSi (Consecutive and Simultaneous<br />

Interpreting)<br />

K6 / Bernd Meyer<br />

spoken/audio+video/exmaralda<br />

FADAC Hamburg (Faroese Danish Corpus<br />

Hamburg)<br />

K8 / Kurt Braunmüller<br />

spoken/audio/exmaralda<br />

ALCEBLA<br />

T4 / Conxita Lleó<br />

spoken/audio/exmaralda<br />

Simuliertes Dolmetschen im Krankenhaus<br />

T5 / Kristin Bührig, Bernd Meyer<br />

spoken/audio+video/exmaralda<br />

EXMARaLDA Demo Corpus<br />

Z2 / <strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong><br />

spoken/audio+video/exmaralda<br />

Hamburg Map Task Corpus<br />

Z2 / <strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong><br />

spoken/audio/exmaralda<br />

Multilingual Resources and Multilingual Applications - Posters<br />

Audio recordings of evocative field experiments<br />

with Turkish and German monolingual and<br />

Turkish/German bilingual children.<br />

Audio recordings of Turkish and German<br />

monolingual and Turkish/German bilingual<br />

children. Demo Excerpt from the larger<br />

Rehbein-ENDFAS/Rehbein-SKOBI-Korpus<br />

Prompted, read and spontaneous speech data of<br />

Catalan speakers from Barcelona, stratified<br />

according to district and age of speakers<br />

Audio recordings of bilingual (Polish and German)<br />

and monolingual (Polish) adults (16-46 years).<br />

Recordings of semi-spontaneous data (3 topics)<br />

and renarration of a picture story (from 'Vater<br />

und Sohn')<br />

Recordings of spontaneous speech and laboratory<br />

data of speakers of Porteño Spanish in Argentina<br />

(read speech, story retelling, read<br />

question-answer pairs, intonation questionnaires,<br />

free interviews); 7 experiments altogether.<br />

Monolingual and interpreted doctor-patient<br />

communication in hospitals<br />

Radio recordings, recordings of group discussions<br />

and classroom discourse with speakers of two or<br />

more Scandinavian languages (Swedish, Danish,<br />

Norwegian) interacting.<br />

Recordings of simultaneously and consecutively<br />

interpreted lectures<br />

Recordings of semi-structured interviews in<br />

Faroese and Danish with bilingual speakers living<br />

on the Faroe Islands.<br />

Recordings of Spanish-German bilingual children<br />

living in Germany and attending the Spanish<br />

complementary school at the first level<br />

Simulations of interpreted doctor-patient<br />

communication.<br />

A selection of short audio and video recordings in<br />

different languages for demonstration of the<br />

EXMARaLDA system<br />

Audio recordings of map tasks with advanced<br />

learners of German<br />

deu, tur 1017 communications<br />

523 speakers<br />

289012 transcribed words<br />

836 transcriptions<br />

deu, tur 3 communications<br />

8 speakers<br />

4862 transcribed words<br />

3 transcriptions<br />

cat 225 communications<br />

234 speakers<br />

187<strong>96</strong>7 transcribed words<br />

875 transcriptions<br />

pol 354 communications<br />

94 speakers<br />

ca. 350000 transcribed<br />

words<br />

358 transcriptions<br />

spa 259 communications<br />

63 speakers<br />

141321 transcribed words<br />

261 transcriptions<br />

deu, por, tur 91 communications<br />

189 speakers<br />

165689 transcribed words<br />

92 transcriptions<br />

dan, nor, swe 162 communications<br />

515 speakers<br />

269945 transcribed words<br />

74 transcriptions<br />

deu, por 3 communications<br />

8 speakers<br />

35432 transcribed words<br />

5 transcriptions<br />

dan, fao 92 communications<br />

82 speakers<br />

440194 transcribed words<br />

92 transcriptions<br />

deu, spa 66 communications<br />

23 speakers<br />

36717 transcribed words<br />

66 transcriptions<br />

deu, pol, ron, rus 4 communications<br />

12 speakers<br />

4018 transcribed words<br />

4 transcriptions<br />

deu, eng, fra, ita,<br />

nor, pol, spa,<br />

swe, tur, vie<br />

19 communications<br />

50 speakers<br />

11659 transcribed words<br />

19 transcriptions<br />

deu 24 communications<br />

26 speakers<br />

24409 transcribed words<br />

24 transcriptions<br />

231


Multilingual Resources and Multilingual Applications - Posters<br />

Written resources<br />

HaCOSSA (Hamburg Corpus of Old Swedish<br />

with Syntactic Annotations)<br />

H3 / Kurt Braunmüller<br />

written/tei<br />

Covert translation: popular science<br />

K4 / Juliane House<br />

written/tei<br />

Covert Translation: business communication<br />

(old)<br />

K4 / Juliane House<br />

written/tei<br />

Covert Translation: business communication<br />

(new)<br />

K4 / Juliane House<br />

written/tei<br />

232<br />

Bible translations, religious and secular prose, law<br />

texts, non-fiction literature (geographical,<br />

theological, historic, natural science), diploma.<br />

Translation corpora of original texts with<br />

translations and comparable texts from the genre<br />

popular scientific prose<br />

Translation corpora of original texts with<br />

translations and comparable texts from the genre<br />

external business communication<br />

Translation corpora of original texts with<br />

translations and comparable texts from the genre<br />

external business communication<br />

dan, deu, isl, lat,<br />

nob, swe<br />

35 texts<br />

deu, eng 114 texts<br />

500446 words<br />

deu, eng 119 texts<br />

169154 words<br />

deu, eng 198 texts


Multilingual Resources and Multilingual Applications - Posters<br />

The English Passive and the German Learner –<br />

Compiling an Annotated Learner Corpus<br />

to Investigate the Importance of Educational Settings<br />

Verena Möller, Ulrich Heid<br />

<strong>Universität</strong> Hildesheim<br />

Institut <strong>für</strong> Informationswissenschaft und Sprachtechnologie<br />

- Sprachtechnologie / Computerlinguistik -<br />

Marienburger Platz 22<br />

31141 Hildesheim<br />

E-mail: verena.moeller@uni-hildesheim.de, ulrich.heid@uni-hildesheim.de<br />

Abstract<br />

In the south of Germany, a number of changes have recently been effected with respect to the possible environments in which pupils<br />

in primary and secondary schools learn/acquire English. The current co-existence of various educational settings allows for<br />

investigation of the effects that each of these settings has on the structure of learners' interlanguage. As different text types are used as<br />

input in the various educational environments which have been created in secondary schools, the English passive has been chosen as<br />

a diagnostic criterion for the analysis of the learners' production of written text. The present article describes the compilation of a<br />

corpus of teaching materials and a learner corpus. It outlines the procedures involved in annotating metadata, esp. those obtained<br />

from questionnaires and psychological tests. Tools for linguistic annotation (POS-taggers and a parser) are compared with respect to<br />

their effectiveness in dealing with data from students after 6-10 years of instruction and/or immersion.<br />

Keywords: second language acquisition, learner corpus, metadata, POS-tagging, parsing<br />

1. Co-Existence of Educational Settings<br />

In recent years, a number of changes in the educational<br />

system in Baden-Württemberg (Germany) have been<br />

effected, some of them directly related to language<br />

learning and acquisition. In addition to English as a<br />

Foreign Language (EFL) lessons in secondary schools,<br />

more and more CLIL (content and language integrated<br />

learning) programmes have been established. CLIL<br />

learners are taught History and Biology, as well as a<br />

combination of Geography, Economics and Politics in<br />

English during certain years specified by the curriculum.<br />

In addition, 'immersive-reflective' lessons (IRL) have<br />

been introduced at the primary level. These focus on<br />

situational context and communication, while at the same<br />

time allowing for reflection on language whenever this is<br />

deemed necessary.<br />

Due to the current co-existence of various educational<br />

settings, it is timely to compile a learner corpus in order<br />

to investigate the effects of educational settings on the<br />

interlanguages of the following four groups of learners:<br />

1) participants in EFL, but neither IRL nor CLIL;<br />

2) participants in EFL and IRL, but not CLIL;<br />

3) participants in EFL and CLIL, but not IRL;<br />

4) participants in EFL, CLIL and IRL.<br />

All learners participating in the study described below are<br />

in Year 11, i. e. they have entered the final stage of their<br />

school career.<br />

2. The Passive and the German Learner<br />

To test the impact of educational settings on the learner<br />

groups outlined above, grammatical structures need to be<br />

analysed with respect to the question which ones will<br />

most likely occur with different frequency in the types of<br />

input that is available to these learners. For the purpose of<br />

the present study, the English passive has been chosen as<br />

an indicator.<br />

Being exposed to scientifically-oriented writing, CLIL<br />

learners receive input from a genre that differs from those<br />

233


Multilingual Resources and Multilingual Applications - Posters<br />

used in EFL classes. Based on the findings of Svartvik<br />

(1<strong>96</strong>6), this genre may be assumed to contain a relatively<br />

larger number of passive structures. This will be tested on<br />

a corpus of teaching materials. It is likely that passive<br />

constructions will also occur with higher frequency in the<br />

written output of CLIL learners.<br />

Different types of be Ved constructions, i. e. central<br />

passives with solely verbal features and semi-passives<br />

carrying verbal as well as adjectival characteristics, are<br />

included into an analysis of teaching materials and of<br />

written learner language. Questions of verb valency are<br />

also taken into account.<br />

234<br />

3. The Teaching Materials Corpus<br />

3.1. Input and Norm: TMCinp and TMCref<br />

To determine whether or not the various groups of<br />

learners are indeed exposed to different types of written<br />

input, a corpus of teaching materials (TMC) is being<br />

compiled. It includes written material for learners from<br />

Year 7 onwards, as both CLIL and the treatment of the<br />

English passive start in that year.<br />

The TMC serves two purposes: On the one hand, it<br />

compares input from EFL lessons to input from CLIL by<br />

means of an input subcorpus (TMCinp). An analysis of<br />

Year 7-10 materials for both groups will establish<br />

whether or not passive structures do indeed occur with<br />

higher frequency in CLIL materials than in EFL<br />

materials.<br />

Secondly, the TMC represents a reference norm. All four<br />

groups of learners take the same EFL exams at the end of<br />

their school career. Hence a target norm, against which<br />

the learners' written performance at that stage can be<br />

measured, is defined by compiling a reference subcorpus<br />

(TMCref). TMCref comprises Year 11-12 materials<br />

designed for use in the EFL classroom.<br />

The overall structure of the TMC is presented in Fig. 1.<br />

Year<br />

7<br />

8<br />

9<br />

10<br />

11<br />

12<br />

TMC<br />

TMCinp TMCref<br />

EFL CLIL<br />

EFL<br />

Figure 1: Teaching Materials Corpus (TMC)<br />

3.2. Metadata<br />

The TMC is, amongst others, annotated with the<br />

following metadata to enable efficient querying:<br />

� learning environment (EFL vs. CLIL);<br />

� publisher and title;<br />

� targeted age group;<br />

� type of material (textbook, workbook, newspaper,<br />

fiction not included into textbooks etc.);<br />

� genre.<br />

The TMC includes written text as well as supplementary<br />

information and instructions referring to written text only,<br />

rather than to film sequences or listening comprehension<br />

exercises that may accompany textbooks. Skills files,<br />

which are used to acquire the techniques and vocabulary<br />

needed for various types of text production, are also<br />

excluded from the TMC.<br />

3.3. POS-Tagging and Parsing<br />

To linguistically annotate the TMC, the English versions<br />

of TreeTagger (Schmid, 1994) and of the MATE parser<br />

(Bohnet, 2010) were used. TreeTagger is a stochastic<br />

part-of-speech (POS) tagger that uses annotated<br />

reference texts, lexical entries (word form, lemma, POS),<br />

word endings and three-word windows (two items left of<br />

the candidate) as an input. It performs lemmatization<br />

together with tagging (U Penn Treebank tagset, 36 tags).<br />

MATE is a trainable dependency parser (trained on the U<br />

Penn Treebank). Both tools also perform sentence<br />

tokenization. Having the TMC tagged, lemmatized and<br />

parsed, we expect to be able to extract occurrences of<br />

passives with good precision and recall.<br />

4. Learner Corpus: Data Elicitation<br />

4.1. Personal Data<br />

If a difference in the use of passive constructions in<br />

learner text is to be attributed to a specific educational<br />

setting, it is inevitable to make sure that all groups of<br />

learners are comparable with respect to a number of<br />

individual parameters. The collection of these personal<br />

data centres around two methods – a questionnaire and<br />

psychological testing. In the questionnaire, learners are<br />

asked to provide information e. g. on age, sex, mother<br />

tongue, learning environment, etc. (cf. sec. 5.1.).<br />

Moreover, information on cognitive capacities and<br />

motivation needs to be gathered by means of


psychological testing. Participation in CLIL lessons is<br />

not compulsory and there is room for the possibility that<br />

learners opt for these programmes because they possess<br />

better overall or language-related cognitive skills, or a<br />

higher level of motivation.<br />

The intelligence test used in this study (PSB-R 6-13,<br />

Horn, 2003) provides information on the two cognitive<br />

factors mentioned above, along with individual scales on<br />

lexical fluency in German and language-related logical<br />

thinking. Data from a pilot study with 28 subjects<br />

(cf. Table 1) suggest that the most reasonable procedure<br />

will be to sort participants into two groups according to<br />

the scores attained on the scales for overall and<br />

language-related cognitive capacities (SW 100-109/IQ<br />

100-114 vs. SW 110-119/IQ 115-129).<br />

SW General Lang.-related<br />

(PSB-R 6-13 GL) (PSB-R 6-13 V)<br />

100-109 14 19<br />

110-119 10 9<br />

119<br />

4 0<br />

Table 1: Pilot study – cognitive skills<br />

The psychological test related to motivational factors<br />

(FLM 7-13, Petermann & Winkel, 2007) provides,<br />

amongst others, information on orientation towards<br />

performance and success as well as perseverance and<br />

effort. The study aims at learners with an average<br />

motivation (T-score 40-60), allowing for a margin on<br />

both sides (T-score 36-64). The results of the pilot study<br />

show that 23/24 out of 28 learners fall into this category<br />

for the two scales.<br />

4.2. Learner Text Data<br />

Learners are invited to write two short argumentative<br />

essays within a time frame of about 70 minutes. Students<br />

at this level are used to this kind of task, as it is widely<br />

practised throughout the years preceding their final<br />

exams. Learners key in their texts using a simple editor<br />

without a spellchecker. However, they are allowed to use<br />

a printed version of a monolingual dictionary.<br />

Some of the essay topics to choose from involve passive<br />

constructions, others do not. The following enumeration<br />

lists the topics most frequently chosen:<br />

1) In order to fight teenage drinking, the legal drinking<br />

age should be raised to 21. (18 essays)<br />

Multilingual Resources and Multilingual Applications - Posters<br />

2) In Germany, the education system offers equality of<br />

opportunity to everyone, rich or poor. (9 essays)<br />

3) Privacy is a thing of the past. (9 essays)<br />

4) The death penalty should be reintroduced in<br />

Germany. (9 essays)<br />

In the pilot study, the average number of words produced<br />

in one essay was 308, resulting in a corpus of slightly<br />

more than 17,000 words.<br />

4.3. Experimental Data<br />

A study on the International Corpus of Learner English<br />

(ICLE) has revealed a marked underuse of the English<br />

passive even in more advanced German learners<br />

(cf. Granger, 2009). It can therefore be assumed that this<br />

will be the case with less advanced learners as well. Thus,<br />

to make sure that additional information is available as a<br />

backup, text data elicitation is supplemented with an<br />

experimental task to find out whether or not learners are<br />

able to transform active sentences into their passive<br />

counterparts. Not only are learners tested on the<br />

morphology of the English passive in various tenses<br />

(cf. sentences 1 and 2), but the task also involves<br />

ditransitive verbs to find out which object is most likely<br />

to be moved to the subject position of the passive<br />

sentence (cf. sentence 3). Moreover, learners are<br />

presented with constructions that have not or only<br />

marginally been part of their EFL instruction<br />

(e.g. prepositional verbs or complex-transitive verbs,<br />

cf. sentences 4 and 5).<br />

1) My sister's friends often invite me to parties.<br />

2) The teams will play the last match of the season next<br />

Friday.<br />

3) My grandparents have promised me a new computer.<br />

4) People look upon the construction of the railroad as<br />

a fantastic achievement.<br />

5) Everyone considered Pat a nice person.<br />

In the experimental task, learners respond to 12 sentences<br />

in about 20 minutes. In addition, they are asked to rate the<br />

reliability of their own responses on a 5-point Likert scale.<br />

These reliability scores are included into the learner<br />

corpus as metadata.<br />

5. Learner Corpus: Annotation<br />

5.1. Metadata<br />

As a result of the procedures described in sec. 4, the<br />

235


Multilingual Resources and Multilingual Applications - Posters<br />

learner corpus comprises information on the following<br />

aspects, annotated as metadata:<br />

� age and sex;<br />

� mother tongue and languages spoken at home;<br />

� other second and foreign languages, duration of<br />

acquisition and self-rated competence;<br />

� duration of the learner's longest stay in an<br />

English-speaking country;<br />

� number of school years skipped or doubled;<br />

� attendance of German primary school and<br />

�<br />

participation in immersive-reflective lessons;<br />

textbooks used in the EFL classroom;<br />

� participation in CLIL programmes and school<br />

subjects affected;<br />

� exposure to English during the learner's spare time;<br />

� aspects of cognitive capacities;<br />

� aspects of motivation;<br />

� self-rated reliability of responses in the experimental<br />

task;<br />

� essay topic.<br />

5.2. POS-Tagging<br />

The Learner Corpus was POS-tagged by means of<br />

TreeTagger, the same way as TMC. In addition, the<br />

CLAWS4 tagger was applied, a hybrid tagger that<br />

involves both probabilistic and rule-based procedures<br />

(Garside & Smith, 1997). For the purpose of the present<br />

pilot study, we have used the C7 tagset, which amounts to<br />

a number of 146 tags. CLAWS4 provides probability<br />

scores for tags assigned to potentially ambiguous word<br />

forms. For the 17,000 word pilot learner corpus,<br />

CLAWS4 lists 5,255 ambiguities; of these, 88.4 % are<br />

assigned a first tag alternative with 80 % probability or<br />

more.<br />

TreeTagger assigned an tag to 423 words<br />

that were misspelled. Slightly more than half of these<br />

nevertheless received a correct POS-tag. When CLAWS4<br />

was used, only two items received the unknown tag,<br />

. These were misspellings identified as truncations.<br />

However, 51 misspelled words received an <br />

tag in addition to their POS-tag. 16 of these were<br />

correctly POS-tagged despite their spelling error. It is<br />

remarkable that 19 of the 35 mistagged words involved<br />

proper nouns or adjectives denoting nationalities, spelt<br />

without a capital letter. In seven cases, the omission of<br />

apostrophes to mark either a genitive or a clitic made it<br />

236<br />

impossible to assign a correct POS-tag. As CLAWS4<br />

operates using the probability of POS-tags for both<br />

individual words and tag sequences, this had rather<br />

far-reaching consequences for the tagging of the<br />

preceding and following units.<br />

5.3. Parsing<br />

As is the case for the TMC, the Learner Corpus was also<br />

parsed by means of MATE. As the parser assigns<br />

POS-tags to the word forms analysed, a comparison with<br />

TreeTagger and CLAWS4 was performed (cf. sec. 6.2.<br />

for details). Tested on the misspelled words tagged<br />

by TreeTagger, MATE performed<br />

slightly better on the assignment of correct POS-tags<br />

(245 vs. 219). MATE and CLAWS4 were almost equally<br />

successful on partly erroneous occurrences of be Ved<br />

(cf. Table 3). To retrieve English passive constructions<br />

from the Learner Corpus, in principle no parsing would<br />

be needed. Correct syntagms can be found by means of<br />

patterns formulated in terms of POS and lemmas; most<br />

erroneous occurrences are not classifiable for the parser<br />

and thus need to be searched with partial patterns<br />

(e.g. participle alone).<br />

6. Retrieval of Passive Constructions<br />

6.1. Manual Analysis<br />

Before an automatic analysis was undertaken, instances<br />

of English be Ved constructions were retrieved manually<br />

from the pilot corpus. 151 occurrences were found, 22 of<br />

which being erroneous. The following types of error<br />

occurred:<br />

� Omission of be (6 instances): *Should the death<br />

penalty reintroduced in Germany?<br />

� Morphological and/or orthographic errors in the<br />

form of be or related clitics (3 instances): *You arent<br />

forced to post anything in the internet.<br />

� Morphological and/or orthographic errors in the past<br />

participle (11 instances): *[...] if the alcohol can just<br />

be buyed by 21 old people.<br />

� Lexical errors (1 instance): *[...] so he is already<br />

prisoned by the police.<br />

� A combination of these (1 instance): *[...] because<br />

it´s forbideden. 1<br />

1 The fact that learners frequently use accents on the keyboard<br />

instead of apostrophes presents POS-taggers with problems.<br />

However, this will be solved by combining automatic


In addition, 9 instances of get-passives were retrieved,<br />

three of which were ungrammatical.<br />

6.2. Automatic Analysis<br />

An analysis of which POS-tags TreeTagger (TT),<br />

CLAWS4 (CL) and the tagger integrated into the MATE<br />

parser (MA) assign to the learners' grammatical be Ved<br />

and get Ved constructions has shown that only<br />

TreeTagger was able to find all instances 2 (cf. Table 2).<br />

TT CL MA<br />

be + past participle<br />

(n=129)<br />

129 128 123<br />

get + past participle (n=6) 6 4 5<br />

Table 2: Retrieval of be Ved and get Ved<br />

An analysis of how the three taggers deal with erroneous<br />

occurrences of be Ved constructions has revealed that<br />

both CLAWS4 and MATE seem to have less difficulty in<br />

dealing with ungrammatical past participles than<br />

TreeTagger (cf. Table 3).<br />

TT CL MA<br />

correct tag for be (n=16) 12 12 11<br />

correct tag for the past<br />

participle (n=22)<br />

11 15 3<br />

15<br />

corrects tags for be and<br />

past participle (n=16)<br />

4 8 8<br />

Table 3: Tags in erroneous occurrences of be Ved<br />

7. Conclusion<br />

In this paper, work towards a richly annotated corpus of<br />

teaching materials (TMC) and of learner text was<br />

described. The corpora are particularly rich in metadata<br />

(both on the sources of TMC and on learner parameters),<br />

and they have been processed with two POS-taggers<br />

(TreeTagger and CLAWS4) and a dependency parser<br />

annotation with manual editing (cf. Granger 1997).<br />

2<br />

MATE had some difficulty processing said as a participle in<br />

passive constructions (4 instances).<br />

3<br />

It is interesting to note that in some cases in which learners<br />

overgeneralize the –ed suffix for the formation of past<br />

participles (e. g. *buyed, *payed, *splitted), CLAWS4 will add<br />

to the POS-tag of the respective form, indicating that<br />

occurrence is deemed unlikely.<br />

Multilingual Resources and Multilingual Applications - Posters<br />

(MATE). Metadata and linguistic annotations can be<br />

queried together.<br />

As of summer <strong>2011</strong>, the corpora are still very small<br />

(TMC: 420,000 words, LC: 17,000 words); they will<br />

gradually be enlarged. Both TreeTagger and CLAWS4<br />

will continue to be used concurrently, as TreeTagger<br />

seems to perform better on correct forms, and CLAWS4<br />

to be more robust towards erroneous ones. All relevant<br />

passive constructions will be extracted from the enlarged<br />

corpora, with pattern-based search for the correct forms<br />

and semi-automatic procedures for erroneous ones. The<br />

retrieved data, together with the pertaining metadata,<br />

should allow for an interpretation in terms of the impact<br />

of educational settings on the interlanguage of learners.<br />

8. Acknowledgements<br />

The authors would like to thank following companies:<br />

Alfred Kärcher Vertriebs-GmbH, Cornelsen Verlag<br />

GmbH, Ernst Klett Verlag GmbH, SWN Kreissparkasse,<br />

Pearson Assessment & Information GmbH.<br />

9. References<br />

Bohnet, B. (2010): Very High Accuracy and Fast<br />

Dependency Parsing is not a Contradiction. In<br />

Proceedings of the 23rd International Conference on<br />

Computational Linguistics (Coling 2010), Beijing,<br />

pp. 89–97.<br />

Garside, R., Smith, N. (1997): A hybrid grammatical<br />

tagger: CLAWS4. In R. Garside, G. Leech & A.<br />

McEnery (Eds.), Corpus Annotation: Linguistic<br />

Information from Computer Text Corpora. London:<br />

Longman, pp. 102-121.<br />

Granger, S. (2009): More lexis, less grammar? What does<br />

the (learner) corpus say? Paper presented at the<br />

Grammar & Corpora conference, Mannheim,<br />

pp. 22-24 September 2009.<br />

Granger, S. (1997): Automated Retrieval of Passives<br />

from Native and Learner Corpora. Precision and<br />

Recall. In Journal of English Linguistics 25(4),<br />

pp. 365-374.<br />

Horn, W. (2003): PSB-R 6-13. Prüfsystem <strong>für</strong> Schul- und<br />

Bildungsberatung <strong>für</strong> 6. bis 13. Klassen – revidierte<br />

Fassung. Göttingen: Hogrefe.<br />

Petermann, F. & Winkel, S. (2007): FLM 7-13.<br />

Fragebogen zur Leistungsmotivation <strong>für</strong> Schüler der 7.<br />

bis 13. Klasse. Frankfurt/Main: Harcourt.<br />

237


Multilingual Resources and Multilingual Applications - Posters<br />

Schmid, H. (1994): Probabilistic Part-of-Speech Tagging<br />

Using Decision Trees. In Proceedings of International<br />

Conference on New Methods in Language Processing.<br />

Svartvik, J. (1<strong>96</strong>6): On Voice in the English Verb. The<br />

Hague/Paris: Mouton.<br />

238


Multilingual Resources and Multilingual Applications - Posters<br />

Register, Genre, Rhetorical Functions:<br />

Variation in English Native-Speaker and Learner Writing<br />

Ekaterina Zaytseva<br />

Johannes Gutenberg-<strong>Universität</strong> Mainz, Department of English and Linguistics<br />

Jakob-Welder-Weg 18, 55099 Mainz<br />

E-mail: zaytseve@uni-mainz.de<br />

Abstract<br />

The present paper explores patterns and determinants of variation found in the writing of two groups of novice academic writers:<br />

advanced learners of English and English native speakers. It focuses on lexico-grammatical means for expressing the rhetorical<br />

function of contrast in academic and argumentative writing. The study’s aim is to explore and to compare stocks of meaningful ways<br />

of expressing the rhetorical function of contrast employed by native and learner novice academic writers in two different written<br />

genres: argumentative essays and research papers. The following corpora are used for that purpose: the Louvain Corpus of Native<br />

English Essays (LOCNESS), the Michigan Corpus of Upper-level Student Papers (MICUSP), the British Academic Written English<br />

corpus (BAWE) and two corpora of learner English, i.e. the International Corpus of Learner English (ICLE) and the Corpus of<br />

Academic Learner English (CALE) – the latter being a corpus of advanced learner academic writing, currently being compiled at<br />

Johannes Gutenberg-<strong>Universität</strong> Mainz, Germany. The study adopts a variationist perspective and a functional-pedagogical<br />

perspective on learner writing, aiming at contributing to the field of second language acquisition (SLA), by focusing on advanced<br />

stages of acquisition and teaching English for academic purposes.<br />

Keywords: novice academic writing, rhetorical function of contrast, variation, function-oriented annotation<br />

1. Introduction<br />

The branch of the SLA focusing on advanced levels of<br />

proficiency puts forward issues that are problematic for<br />

researchers, EAP teachers, and foreign language learners<br />

alike. Those include the need for an exhaustive<br />

description of language performance on an advanced<br />

level and a set of defining characteristics which could be<br />

further developed into assessment criteria.<br />

One of the factors responsible for the problematic nature<br />

of “advancedness” is a somewhat narrow view of this<br />

stage of language acquisition as on the one hand, “no<br />

more than ‘better than intermediate level’ structural and<br />

lexical ability for use”, as pointed out by Ortega and<br />

Byrnes (2008:283); and yet, on the other hand, as<br />

language performance, not “flawless” enough to be<br />

considered native-like.<br />

2. Theoretical Background<br />

Advanced learner writing has recently been the object of<br />

a number of corpus-based studies (cf. e.g. Callies, 2008;<br />

Gilquin & Paquot, 2008; Paquot, 2010). It has generally<br />

been analysed from a pedagogical perspective, i.e.<br />

against the yardstick of English native-speakers’ writing,<br />

where features of learner writing have often been<br />

characterized as non-native-like. Among the areas<br />

identified as problematic for advanced learners are most<br />

notably accurate and appropriate use of lexis, register<br />

awareness, and information structure management. Yet,<br />

studies adopting a variationist perspective on advanced<br />

learners’ output and considering a possible influence of<br />

different kinds of variables are still scarce (cf., however,<br />

Ädel, 2008; Paquot, 2010; Wulff & Römer, 2009). One of<br />

the reasons for this could be the lack of corpora<br />

representing advanced academic learner writing (Granger<br />

& Paquot, forthcoming), which makes it difficult, for<br />

example, to analyse the importance of genre and writer’s<br />

genre (un)awareness as possible determinants of<br />

variation. The existing corpora include the following<br />

projects in progress: the ‘Varieties of English for Specific<br />

Purposes’ database (VESPA) (cf. Granger, 2009), the<br />

239


Multilingual Resources and Multilingual Applications - Posters<br />

Corpus of Academic Learner English (CALE) 1<br />

, and the<br />

Cologne-Hanover Advanced Learner Corpus (CHALC)<br />

(Römer, 2007).<br />

The pedagogical approach to learners’ language<br />

production has brought forward particular kinds and<br />

methods of learner data analysis. One of them is<br />

annotating a learner corpus for errors (cf. Granger, 2004).<br />

Valuable as it is, this kind of corpus annotation, however,<br />

does not allow for a truly usage-based perspective on<br />

learner language production, where learners’ experience<br />

with language in particular social settings is the focus of<br />

attention.<br />

Corpus-based analyses of native English academic<br />

writing, meanwhile, have revealed that this register is<br />

characterised by a specific kind of vocabulary on the one<br />

hand (Biber et al., 1999; Coxhead, 2000; Paquot, 2010)<br />

and by certain kinds of grammatical structures on the<br />

other hand (e.g. Biber, 2006; Kertz & Haas, 2009). In<br />

addition, it has been pointed out that the register of native<br />

English academic writing displays a certain degree of<br />

variation as well, e.g. there is discipline- and genre-based<br />

variation in the form and use of lexico-grammatical<br />

structures used in written discourse (Hyland, 2008).<br />

However, there is little information on possible variation<br />

in different genres produced by novice native English<br />

academic writers (cf., however, Wulff & Römer, 2009).<br />

240<br />

3. Project Aims and Objectives<br />

The present paper reports on work in progress exploring<br />

patterns and determinants of variation found in the<br />

writing of two groups of novice academic writers:<br />

advanced learners of English and English native speakers.<br />

It focuses on lexico-grammatical ways for expressing the<br />

rhetorical function of contrast in academic and<br />

argumentative writing. The study’s aim is to explore and<br />

subsequently to compare stocks of meaningful ways of<br />

expressing contrast employed by native and learner<br />

novice academic writers in two different written genres:<br />

argumentative essays and research papers. For that<br />

purpose the following corpora are used: three corpora of<br />

native English corpora: the Louvain Corpus of Native<br />

English Essays (LOCNESS) (Granger, 19<strong>96</strong>), the<br />

Michigan Corpus of Upper-level Student Papers<br />

1 http://www.advanced-learner-varieties.info<br />

(MICUSP) 2 , the British Academic Written English corpus<br />

(BAWE) (Nesi, 2008) as well as two corpora of learner<br />

English, i.e. the International Corpus of Learner English<br />

(ICLE) (Granger, 2003) and the Corpus of Academic<br />

Learner English (CALE) 3<br />

- a corpus of advanced learner<br />

academic writing, currently being compiled at<br />

Johannes-Gutenberg-<strong>Universität</strong> Mainz, Germany.<br />

Another aim of the study is to investigate to what extent<br />

the influence of the variable ‘genre’ is a possible<br />

determinant of variation in the written production of<br />

various groups of academic writers. In this respect, it is<br />

important to address the issue of novice writers’ genre<br />

awareness and to discuss the question of native-speaker<br />

norm. In addition, the paper explores the existence of<br />

interlanguage (IL)-specific strategies used by advanced<br />

learners to express rhetorical functions in writing.<br />

The latter will be achieved by annotating both corpora of<br />

advanced learner writing for the rhetorical function of<br />

contrast. This kind of function-oriented annotation,<br />

though still rare in English learner corpus research,<br />

presents researchers with a valuable opportunity to view<br />

learners as active language users, rather than learners<br />

demonstrating deficient knowledge of the target language.<br />

In addition, the potential of multidimensional corpus<br />

analysis (Biber & Conrad, 2001) is currently being<br />

considered as a highly useful method of distinguishing<br />

between different registers and genres.<br />

The study, thus, adopts a variationist perspective to<br />

novice academic writing, considering advanced<br />

interlanguage as a variety in its own right. At the same<br />

time, a functional-pedagogical perspective allows for a<br />

further analysis of those areas of language use that are<br />

still problematic for advanced learners, and reveals<br />

meaningful ways in which learners cope with<br />

writing-related tasks.<br />

4. Function-oriented annotation<br />

The advantage of adding a function-driven annotation is<br />

that it makes it possible to generally identify contrast in<br />

learner writing and to pin down an extensive stock of<br />

language means, treated as writers’ lexico-grammatical<br />

preferences for signaling this rhetorical function in<br />

written discourse.<br />

2<br />

http://micusp.elicorpora.info/www.micusp.org<br />

3<br />

http://www.advanced-learner-varieties.info


Further on, the encoded information allows for<br />

function-driven, together with form-driven searches in<br />

learner writing, resulting in a comprehensive and<br />

accurate picture of the variety of lexico-grammatical<br />

means for expressing contrast used by two groups of<br />

(advanced) German learners in their writing. In addition,<br />

a subsequent quantitative analysis can provide valuable<br />

insights into general and individual preferences of<br />

learners in terms of which items are particularly favoured<br />

in the context of a specific writing-related task set in a<br />

specific situation of language use. Moreover, its<br />

combination with a qualitative analysis of patterns and<br />

determinants of variation in the ways of expressing<br />

contrast in writing promises to shed more light on general<br />

written argumentation strategies employed by (advanced)<br />

German learners.<br />

In order for this kind of annotation to be reliable, several<br />

conditions have to be met, which when applied to the<br />

present project, imply clarification of the concept of a<br />

rhetorical function and a clear definition of the rhetorical<br />

function of contrast in terms of its aim and distinctive<br />

characteristics, complemented by a list of possible<br />

language items for its realization in writing.<br />

The next step involves annotating each instance of<br />

contrast being expressed in written discourse in both<br />

corpora of (advanced) German learner writing (i.e.<br />

CALE-GE and ICLE-GE). This stage is followed by a<br />

detailed description and categorization of the<br />

lexico-grammatical means for expressing contrast in<br />

learner writing. Subsequently, comparative analyses,<br />

quantitative as well as qualitative, are carried out, in<br />

order to reveal possible patterns and determinants of<br />

variation that exist in the novice academic writing.<br />

Preliminary findings reveal a slight degree of<br />

genre-induced variation in German learners’ writing in<br />

terms of sentence placement of the contrastive item<br />

however, see Table 1 below.<br />

Corpus Corpus<br />

size,<br />

N of<br />

tokens<br />

Multilingual Resources and Multilingual Applications - Posters<br />

Initial Non-initial Total<br />

ICLE-GE 234.423 103 125 228<br />

% 45 55<br />

CALE-GE 55.000 49 27 76<br />

% 64 36<br />

Table 1: Position of the contrastive item however<br />

As the table shows, German learners seem to prefer the<br />

initial sentence positioning of however in academic<br />

(CALE-GE), rather than in argumentative (ICLE-GE)<br />

writing. Thus, the item however found in the sentence<br />

initial position is almost 1,5 times more frequent in term<br />

papers than in argumentative essays. This seems to tie in<br />

well with one of the findings recently reported by Wagner<br />

(<strong>2011</strong>). In her empirical study, she points out a tendency<br />

for however to take up the initial sentence position in<br />

literature and cultural studies texts, rather than in<br />

linguistic texts and general corpora (<strong>2011</strong>:43). Due to a<br />

modest number of words contained in the version of the<br />

CALE corpus used at the time of analysis (see Table 1),<br />

the preliminary finding reported in the current paper<br />

should be treated with caution. A further analysis of a<br />

greater number of occurrences in a bigger corpus is<br />

needed in order to provide more empirical evidence for<br />

supporting and accounting for this finding.<br />

5. Conclusion<br />

The project presented in the present paper sets out to<br />

explore advanced IL-specific strategies for coping with a<br />

writing-related task in the context of English academic<br />

and argumentative writing. This is achieved by<br />

combining a functional-pedagogical view with a<br />

variationist perspective on learner writing and annotating<br />

the rhetorical function of contrast in the two corpora of<br />

learner writing. At the same time, the findings of the<br />

project will contribute to the area of variation in novice<br />

native English academic writing and will further a<br />

definition of the native speaker norm, which advanced<br />

learners are generally expected to aim at.<br />

6. References<br />

Ädel, A. (2008): Involvement features in writing: do time<br />

and interaction trump register awareness? In G.<br />

Gilquin, S. Papp. & M. B. Díez-Bedmar (Eds.),<br />

Linking up Contrastive and Learner Corpus Research.<br />

Amsterdam, Atlanta: Rodopi, pp. 35-53.<br />

Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan,<br />

E. (1999): Longman Grammar of Spoken and Written<br />

English. Harlow: Pearson Education.<br />

Biber, D., Conrad, S. (2001): Introduction:<br />

Multidimensional analysis and the study of register<br />

variation. In S. Conrad & D. Biber (Eds.), Variation in<br />

English: Multidimensional Studies. London: Longman,<br />

241


Multilingual Resources and Multilingual Applications - Posters<br />

pp. 3-13.<br />

Biber, D. (2006): University Language: A Corpus-Based<br />

Study of Spoken and Written Registers. Amsterdam:<br />

John Benjamins.<br />

Callies, M. (2008): Easy to understand but difficult to use?<br />

Raising constructions and information packaging in<br />

the advanced learner variety. In G. Gilquin, S. Papp. &<br />

M. B. Díez-Bedmar (Eds.), Linking up Contrastive and<br />

Learner Corpus Research. Amsterdam, Atlanta:<br />

Rodopi, pp. 201-226.<br />

Coxhead, A. (2000): A new academic word list. TESOL<br />

Quarterly, 34(2), pp. 213-238.<br />

Gilquin, G., Paquot, M. (2008): Too chatty: Learner<br />

academic writing and register variation. English Text<br />

Construction, 1(1), pp. 41-61.<br />

Granger, S. (19<strong>96</strong>): From CA to CIA and back: An<br />

integrated approach to computerized bilingual and<br />

learner corpora. In K. Aijmer, B. Altenberg & M.<br />

Johansson (Eds.), Languages in Contrast. Text-Based<br />

Cross-Linguistic Studies. Lund Studies in English 88.<br />

Lund: Lund University Press, pp. 37-51.<br />

Granger, S. (2003): The international corpus of learner<br />

English: A new resource for foreign language learning<br />

and teaching and second language acquisition research.<br />

TESOL Quarterly, 37(3), pp. 538-546.<br />

Granger, S. (2004): Computer learner corpus research:<br />

Current status and future prospects. In U. Connor & T.<br />

Upton (Eds.), Applied Corpus Linguistics: A<br />

Multidimensional Perspective. Amsterdam, Atlanta:<br />

Rodopi, pp. 123-145.<br />

Granger, S. (2009): In search of a general academic<br />

vocabulary: A corpus-driven study. Paper Presented at<br />

the International Conference ‘Options and Practices of<br />

L.S.A.P Practitioners’, 7-8 February 2009. University<br />

of Crete, Heraklion, Crete.<br />

Granger, S., Paquot, M. (Forthcoming): Language<br />

for Specific Purposes. Retrieved from<br />

http://sites.uclouvain.be/cecl/archives/GRANGER_P<br />

AQUOT_Forthcoming_Language_for_Specific_Purp<br />

oses_Learner_Corpora.pdf , 17.12.2010.<br />

Hyland, K. (2008): As can be seen: lexical bundles and<br />

disciplinary variation. English for Specific Purposes,<br />

27(1), pp. 4-21.<br />

Kerz, E., Haas, F. (2009): The aim is to analyse NP: the<br />

function of prefabricated chunks in academic texts. In<br />

R. Corrigan, E. Moravcsik, H. Ouali & K. Wheatley<br />

242<br />

(Eds.), Formulaic Language: Volume 1. Distribution<br />

and historical change. Amsterdam, Philadelphia: John<br />

Benjamins, pp. 97-117.<br />

Nesi, H. (2008): BAWE: An introduction to a new<br />

resource. In A. Frankenberg-Garcia, T. Rkibi, M.<br />

Braga da Cruz, R. Carvalho, C. Direito & D.<br />

Santos-Rosa (Eds.), Proceedings of the 8th Teaching<br />

and Language Corpora Conference. Held 4-6 July<br />

2008 at the Instituto Superior de Línguas e<br />

Administração. Lisbon, Portugal: ISLA, pp. 239-246.<br />

Ortega, L., Byrnes, H. (2008): Theorizing advancedness,<br />

setting up the longitudinal research agenda. In L.<br />

Ortega & H. Byrnes (Eds.), The Longitudinal Study of<br />

Advanced L2 Capacities. New York: Routledge/Taylor<br />

& Francis, pp. 3-20.<br />

Paquot, M. (2010): Academic Vocabulary in Learner<br />

Writing: From Extraction to Analysis. United States:<br />

Continuum Publishing Corporation.<br />

Römer, U. (2007): Learner language and the norms in<br />

native corpora and EFL teaching materials: a case<br />

study of English conditionals. In: S. Volk-Birke & J.<br />

Lippert (Eds.), Anglistentag 2006 Halle. Proceedings.<br />

Trier: Wissenschaftlicher Verlag, pp. 355–63.<br />

Wagner, S. (<strong>2011</strong>): Concessives and contrastives in<br />

student writing: L1, L2 and genre differences.<br />

In J. Schmied (Ed.), Academic Writing in Europe:<br />

Empirical Perspectives. Göttingen: Cuvillier,<br />

pp. 23-49.<br />

Wulff, S. & Römer, U. (2009): Becoming a proficient<br />

academic writer: Shifting lexical preferences in the use<br />

of the progressive. Corpora, 4(2), pp. 115-133.


Multilingual Resources and Multilingual Applications - Posters<br />

Tools to Analyse German-English Contrasts in Cohesion<br />

Kerstin Kunz, Ekaterina Lapshinova-Koltunski<br />

<strong>Universität</strong> des Saarlandes<br />

<strong>Universität</strong> Campus, 66123 Saarbrücken<br />

E-mail: k.kunz@mx.uni-saarland.de, e.lapshinova@mx.uni-saarland.de<br />

Abstract<br />

In the present study, we elaborate resources to semi-automatically analyse German-English contrasts in the area of cohesion. This<br />

work is an example of applications for corpus data extraction that is designed for the analysis of cohesion from both a system-based<br />

and a text-based contrastive perspective<br />

Keywords: cohesion, contrastive analysis, corpus linguistics, extraction of linguistic knowledge, German-English contrasts<br />

1. Introduction<br />

To obtain empirical evidence of cohesion in English and<br />

German texts we carry out a corpuslinguistic analysis,<br />

which includes investigating a broad range of cohesive<br />

phenomena. We particularly focus on the analysis of<br />

various types of cohesive devices, the linguistic<br />

expressions to which they connect (the antecedents), the<br />

nature of the semantic ties established as well as the<br />

properties of cohesive chains. Our main research<br />

questions are 1) Which cohesive resources provided by<br />

the language systems of English and German are<br />

instantiated in different registers? 2) How frequent are<br />

they? 3) Which cohesive meanings do they express?<br />

Substantial research gaps in these areas justify such an<br />

enterprise: On the one hand, comprehensive accounts of<br />

cohesion are only existent from a monolingual<br />

perspective, e.g. in (Halliday & Hasan, 1976), (Schubert,<br />

2008), (Linke et al., 2001), (Brinker, 2005). On the other<br />

hand, empirical monolingual or contrastive analyses on<br />

the level of text and discourse mainly deal with<br />

individual phenomena, cf. (Fabricius-Hansen, 1999) and<br />

(Doherty, 2006) for certain aspects of information<br />

packaging and (Bosch et al., 2007), (Gundel et al., 2004)<br />

for the investigation of particular cohesive devices.<br />

Thus, both system-based and text-based contrastive<br />

methods to compare English and German in terms of<br />

textuality have to our knowledge not received much<br />

attention so far, cf. table 1.<br />

With our research, we intend to focus on cohesion as one<br />

particular aspect of textuality. As a starting point for our<br />

empirical analysis, we take the classification by (Halliday<br />

& Hasan, 1976), according to which cohesion mainly<br />

includes five categories: reference, substitution, ellipsis,<br />

conjunctive relations and lexical cohesion.<br />

Table 1: Contrastive system- and text-based studies available for English and German<br />

2. Corpus Resources<br />

In this contribution, we describe our tools to extract<br />

evidence for these categories from the English- German<br />

corpus GECCo, cf. (Amoia et al., submitted). Currently<br />

there are no comprehensive resources known to us that<br />

offer a repository of the coherence building systems of<br />

one or more language(s) 1<br />

. Our analysis design permits<br />

1 We can only name some resources providing annotations of<br />

individual cohesive phenomena, e.g. pronoun coreference in the<br />

BBN Pronoun Coreference and Entity Type Corpus, cf.<br />

(Weischedel and Brunstein 2005), verbal phrase ellipsis in (Bos<br />

and Spenader <strong>2011</strong>) or conjunctive relations in PDTB, cf.<br />

(Prasad et al. 2008) for English, or annotation of anaphora in<br />

(Dipper and Zinsmeister 2009) for German.<br />

243


Multilingual Resources and Multilingual Applications - Posters<br />

new insights into cohesive phenomena across languages,<br />

contexts and registers. The elaboration of the procedures<br />

to extract such phenomena includes compilation,<br />

annotation and exploitation of GECCo, which consists of<br />

10 registers of both written and spoken texts, as shown in<br />

table 2. The written part of GECCo includes 8 registers 2<br />

which are based on the CroCo corpus, cf. (Neumann,<br />

2005).<br />

languages<br />

EO,<br />

GO,<br />

Etrans,<br />

Gtrans<br />

EO,<br />

GO<br />

244<br />

registers<br />

Written (imported from CRoCo)<br />

FICTION, ESSAY,<br />

INSTR, POPSCI,<br />

SHARE, SPPECH,<br />

TOU, WEB<br />

spoken<br />

INTERVIEW<br />

ACADEMIC<br />

Table 2: Registers in GECCo<br />

The spoken part contains interviews (INTERVIEW) and<br />

academic speeches (ACADEMIC) produced by native<br />

speakers of the two languages 3 . We have chosen such a<br />

corpus constellation as we expect considerable<br />

differences in frequency and function of cohesive devices<br />

between written and spoken registers. Moreover, we<br />

depart from the assumption that there is a continuum<br />

from written to spoken mode rather than a clear dividing<br />

line.<br />

The written part of the multilingual corpus is already<br />

annotated with information on lemma, morphology, pos<br />

on the word level; sentences, grammatical functions,<br />

predicate-argument structures on the chunk level;<br />

2 popular-scientific texts (POPSCI), tourism leaflets (TOU),<br />

prepared speeches (SPEECH), political essays (ESSAY),<br />

fictional texts (FICTION), corporal communication (SHARE),<br />

instruction manuals (INSTR) and websites (WEB).<br />

3 This corpus part will be public and available on the web.<br />

registers and metadata on the text level as shown in figure<br />

1. It additionally contains clause-based alignment of<br />

originals and translation 4<br />

. We intend to semi-<br />

automatically annotate spoken registers with the<br />

information available for the written part, developing a<br />

set of automatic procedures for this task. The annotation<br />

layer on text level will be also enhanced with metadata<br />

information on language variation, speaker age, etc.<br />

Further annotations such as coreference, lexical chaining<br />

and cohesion disambiguation based on the analyses in<br />

(Kunz & Steiner, in progress)’s and (Kunz 2010) will be<br />

integrated into both parts of GECCo.<br />

3. Procedures to Analyse Cohesion<br />

The annotated corpus is encoded to be queried with CQP<br />

(Corpus Query Processor) 5<br />

. We also plan to encode it for<br />

further existing query engines, e.g. ANNIS2 described in<br />

(Zeldes et al., 2009). The extracted information on<br />

cohesion will be imported into semiautomatic annotation<br />

tools in order to refine the corpus annotations on different<br />

levels, cf. figure 2.<br />

Figure 1: Annotation layers in GECCo<br />

As mentioned above, the annotated corpus can already be<br />

queried with CQP, which allows two types of attributes:<br />

positional (e.g. for part-of-speech and morphological<br />

features) and structural (e.g. for clauses or metadata).<br />

With the help of CQP-based queries that include string,<br />

part-of-speech, text and register constraints we are able to<br />

extract linguistic items expressing the cohesion<br />

categories introduced in section 1. above and classify<br />

them according to their specific textual functions. We use<br />

our linguistic knowledge on cohesive devices to develop<br />

sets of complex queries with CQP that enable the<br />

extraction of cohesion from GECCo. The obtained data<br />

are subject to statistical validation (e.g. significance tests<br />

4 EO=English originals, GO=German originals, ETrans=<br />

English translations, Gtrans=German translations in table 2.<br />

5 cf. in (Christ 1994).


or variation and cluster analysis) with R, with the help of<br />

which we can disambiguate and classify cohesive<br />

devices.<br />

Moreover, CQP can also be employed to incrementally<br />

improve the corpus annotations, which allows us to<br />

semi-automatically enrich the corpus with the<br />

annotations on the information extracted as shown in<br />

figure 3. However, our observations show that<br />

representing nested structures or constituents containing<br />

gaps (necessary for annotation of coreference or ellipsis)<br />

within CQP is rather problematic, cf. (Amoia et al.,<br />

submitted). As mentioned above, we therefore attempt to<br />

exploit GECCo with further query engines available, e.g.<br />

ANNIS2.<br />

4. Preliminary Results<br />

Our preliminary extraction results already show that<br />

there exist systematic regularities of language- and<br />

register-dependent contrasts in frequency with respect to<br />

personal reference. As an example, consider our findings<br />

for the distribution of neuter forms of third person<br />

pronouns at sentence-initial position in figure 4 (EO =<br />

English Original, GO = German Original, ETrans =<br />

English Translation, GTrans = German Translation, cf.<br />

figure 2). The left side shows the distribution in<br />

percentage of sentence initial occurrences of cohesive<br />

it/es. The right side displays the total numbers for all<br />

instances and cohesive instances of sentence initial it/es.<br />

In addition, we could already show in the analysis of the<br />

German demonstrative pronouns der, die, das that there<br />

Multilingual Resources and Multilingual Applications - Posters<br />

Figure 2: Procedures to analyse Cohesion in GECCo<br />

is a heterogeneity in frequency and function across<br />

registers which goes beyond assumptions drawn in the<br />

frame of earlier systemic and also textual accounts. For<br />

instance, the findings displayed in table 3 suggest a<br />

written-spoken continuum, with the register INSTR at<br />

one end and INTERVIEW at the other end of the<br />

continuum, rather than a clear-cut distinction between<br />

written and spoken registers (as already postulated<br />

above). Moreover, the differences in numbers between<br />

das and der, die call for an in-depth analysis with respect<br />

to distinct functions.<br />

der die das<br />

GO_SPEECH 4 4 173<br />

Gtrans_SPEECH 3 - 38<br />

GO_FICTION 15 12 113<br />

Gtrans_FICTION 10 7 100<br />

GO_POPSCI 4 1 110<br />

Gtrans_POPSCI 3 1 44<br />

GO_TOU 9 2 31<br />

Gtrans_TOU 2 1 14<br />

GO_SHARE 3 1 44<br />

Gtrans_SHARE 3 - 46<br />

GO_ESSAY 1 3 90<br />

Gtrans_ESSAY - - 49<br />

GO_INSTR - - 20<br />

Gtrans_INSTR - - 18<br />

GO_WEB 1 2 31<br />

Gtrans_WEB 1 - 27<br />

GO_INTERVIEW 19 47 506<br />

Table 3: Occurrences of der, die, das in German<br />

subcorpora<br />

245


Multilingual Resources and Multilingual Applications - Posters<br />

246<br />

5. Conclusion<br />

The described resources to extract comprehensive<br />

linguistic knowledge on cohesion will find application in<br />

various linguistic areas. First, they should provide us with<br />

evidence for our hypotheses on English-German<br />

contrasts in cohesion described in (Kunz & Steiner, in<br />

progress). Second, they should yield an initial<br />

understanding of how contrast and contact phenomena on<br />

the level of cohesion affect language understanding and<br />

language production. Furthermore, the obtained<br />

information on cohesive mechanisms of English and<br />

German will provide valuable insights for language<br />

teaching, particularly for translator/ interpreter training.<br />

Our tools will also offer new incentives for the automatic<br />

exploitation of cohesion, e.g. in machine translation, as<br />

they permit extraction from parallel corpora.<br />

6. Acknowledgements<br />

The authors thank the DFG (Deutsche Forschungsgemeinschaft)<br />

and the whole GECCo team for supporting<br />

this project.<br />

7. References<br />

Brinker, K. (2005): Linguistische Textanalyse: Eine<br />

Einfuhrung in Grundbegriffe und Methoden. 6 edition.<br />

Berlin: Erich Schmidt.<br />

Christ, O. (1994): A modular and flexible architecture for<br />

an integrated corpus query system. In Proceedings of<br />

the 3rd Conference on Computational Lexicography<br />

and Text Research. Budapest, Hungary.<br />

Dipper, S., Zinsmeister, H. (2009): Annotation discourse<br />

anaphora. In Proceedings of the Workshop "Third<br />

Linguistic Annotation Workshop", LAW III,<br />

ACL-IJCNLP 2009. Suntec, Singapore, pp. 166169.<br />

Doherty, M. (2006): Structural Propensities. Translating<br />

nominal word groups from English into German.<br />

Amsterdam/ Philadelphia: Benjamins.<br />

Fabricius-Hansen, C. (1999): Information packaging and<br />

translation: Aspects of translational sentence splitting<br />

(German - English/ Norwegian). In Studia Grammatica,<br />

47, pp. 175-214.<br />

Gundel, J. K., Hedberg, N., Zacharski, R. (2004):<br />

Demonstrative pronouns in natural discourse. In<br />

Proceedings of the Fifth Discourse Anaphora and<br />

Anaphora Resolution Colloquium. Sao Miguel, Portugal.<br />

pp. 81-86.<br />

Halliday, M.A.K., Hasan, R. (1976): Cohesion in English.<br />

London, New York: Longman.<br />

Kunz, K., Steiner, E. (in progress): Towards a<br />

comparison of cohesion in English and German -<br />

contrasts and contact. Submitted for Functional<br />

Linguistics.<br />

London: Equinox Publishing Ltd.<br />

Kunz, K. (2010): Variation in English and German<br />

Nominal Coreference. A Study of Political Essays.<br />

Frankfurt am Main: Peter Lang.<br />

Linke, A., Nussbaumer, M., Portmann, P.R. (2001):<br />

Studienbuch Linguistik. 4 edition. Tubingen:<br />

Niemeyer.<br />

Neumann, S. (2005): Corpus Design. Deliverable No. 1<br />

of the CroCo Project.<br />

Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo,<br />

L., Joshi, A., Webber, N. (2008): Penn Discourse<br />

Treebank Version 2.0. In Proceedings of the 6th<br />

International Conference on Language Resources and<br />

Evaluation (LREC 2008). Marrakech.<br />

Schubert, C. (2008): Englische Textlinguistik. Eine<br />

Einfuhrung. Berlin: Schmidt.<br />

Weischedel, R. Brunstein, A. (2005): BBN Pronoun<br />

Coreference and Entity Type Corpus. Linguistic Data<br />

Consortium, Philadelphia.<br />

Zeldes, A., Ritz, J., Ludeling, A., Chiarcos, C. (2009):<br />

Annis: A search tool for multi-layer annotated corpora.<br />

In Proceedings of Corpus Linguistics 2009, Liverpool,<br />

July 20-23, 2009.


Multilingual Resources and Multilingual Applications - Posters<br />

Comparison and Evaluation of ontology extraction systems<br />

Stefanie Reimers<br />

University of Hamburg<br />

E-mail: 4reimers@informatik.uni-hamburg.de<br />

Abstract<br />

This paper presents the results of an evaluation and comparison of the two semi-automatic, corpus based ontology extraction systems<br />

OntoLT and Text2Onto. Both systems were applied to a German corpus and their outputs were evaluated in two steps. First, the<br />

Text2Onto-Ontology was evaluated against a Gold Standard ontology, represented by a manually created ontology. Second, both<br />

automatically extracted ontologies of the systems were compared to each other. Additional to this, the usability of the tools has been<br />

discussed, in order to provide some hints to improve the design of future ontology extracting systems.<br />

Keywords: ontology, ontology learning, ontology extraction, ontology evaluation<br />

1. Introduction<br />

During the last years the application area of ontologies<br />

has massively been enlarged. They are not longer only a<br />

part of the vision of the semantic web but also used in<br />

intelligent search engines, information systems and in the<br />

field of model based system engineering. Therefore the<br />

need of ontologies increases similarly. But the creation of<br />

ontologies is often accompanied by a huge manual effort<br />

so that this process remains very time- and cost-intensive.<br />

Existing editors like Protégé 1<br />

support the work of<br />

ontology developers and make it more comfortable but<br />

they only can reduce a little bit of the needed effort.<br />

Hence, techniques which reduce the manual part of the<br />

process by employing automatic methods are desirable.<br />

Corpus based ontology extraction tools seem to be the<br />

solution. They take as input a domain specific text corpus<br />

and output a domain ontology. Text is especially<br />

nowadays an excellent data source because of its<br />

permanently updated availability on the web. Under ideal<br />

circumstances the ontology extraction process should be<br />

fully automatic and produce a domain ontology of good<br />

quality. Up to date this remains unrealizable, because an<br />

important part of knowledge can't be inferred from text<br />

corpora: the commonsense knowledge. Consequently<br />

semi-automatic extraction systems present the maximal<br />

degree of support during the ontology engineering<br />

process. Several tools have been developed and are partly<br />

1 http://protege.stanford.edu/<br />

free available on the web. But how well do they perform?<br />

At last they are only useful, if they heavily reduce the<br />

manual effort compared to the traditional ontology<br />

engineering process. This aspect includes on the one<br />

hand that the tool should be easy to use and on the other<br />

hand that the resulting ontology should be of good<br />

quality, comparable to a manual created one. Another<br />

interesting question is, how the outputs of systems differ,<br />

if they are applied to the same corpus.<br />

Several works on the evaluation of ontology extraction<br />

systems have been published during the last years. But<br />

none of them considered a Gold Standard Evaluation<br />

against a manually created ontology. Furthermore, there<br />

hasn't been an attempt which used a German text corpus<br />

as data source. This work aims on exploring these<br />

missing aspects by figuring out, how great the advantage<br />

of OntoLT 2 and Text2Onto 3 is compared to a manual<br />

creation process of an ontology. Therefore both systems<br />

were applied to the German text corpus of the Language<br />

Technology for eLearning project (LT4eL 4<br />

), their outputs<br />

were compared to each other and finally, the<br />

Text2Onto-ontology was evaluated against the manually<br />

created LT4eL-ontology.<br />

Section 2 introduces the ontology extraction systems<br />

OntoLT and Text2Onto as well as the LT4eL-corpus and<br />

the LT4eL-ontology. Section 3 gives a short review about<br />

2 http://olp.dfki.de/OntoLT/OntoLT.htm<br />

3 http://code.google.com/p/text2onto<br />

4 http://www.let.uu.nl/lt4el<br />

247


Multilingual Resources and Multilingual Applications - Posters<br />

current studies dealing with the evaluation of tools of this<br />

kind. Section 4 deals with the actual evaluation of the<br />

systems and the produced ontologies.<br />

248<br />

2. Presentation of the used systems<br />

and the data resources<br />

The used ontology extraction systems were, because they<br />

are freely available and because they are able to process<br />

German texts.<br />

2.1. OntoLT<br />

OntoLT is a java based Protégé-Plugin. The in this work<br />

employed version 2.0 is exclusively compatible with<br />

Protégé 3.2, which is also freely online available. It takes<br />

as input a corpus of linguistically annotated texts in<br />

XML 5 format. There are no requirements for a specific<br />

XML format, because the user can customize the tool for<br />

various formats. This takes place by changing the<br />

implemented XPath 6<br />

-expressions which allow addressing<br />

specific linguistic elements (like sentences, noun phrases,<br />

head nouns, etc). They are needed for the extraction<br />

process which is performed via so called mapping rules.<br />

Those rules determine which concepts, instances and<br />

relations will be automatically extracted. Some rules are<br />

already implemented but the user has also the possibility<br />

to integrate new ones by using the OntoLT native<br />

precondition language. Rules consist of two parts:<br />

constraints and operators. If certain constraints are<br />

satisfied, one or more operators take effect. Operators can<br />

create concepts and concept properties as well as attach<br />

instances to existing concepts.<br />

In this work, only the implemented rules were used. They<br />

specify that concepts will be created according to all<br />

heads of noun phrases in the corpus. If there exist<br />

adjectives, which belong to the nouns, they will be<br />

combined with the concept and result in a subconcept.<br />

Another rule effects the extraction of relations, which are<br />

inferred from the predicates – together with the subject<br />

and its direct objects - of sentences. After the application<br />

of the rules, the extracted concepts, relations and<br />

instances can be manipulated with the help of Protégé<br />

(Buitelaar et al., 2004).<br />

5 http://www.w3.org/standards/xml<br />

6 http://www.w3.org/TR/xpath<br />

2.2. Text2Onto<br />

Text2Onto is also a java based application and realized as<br />

a standalone system. It requires the prior installation of<br />

Gate 4.0 7 and WordNet 2.0 8 , which are both open source.<br />

The input consists of a corpus in text, html or pdf format.<br />

No linguistic preprocessing is required because the<br />

system provides its own preprocessing. Supported<br />

languages are English, partially also Spanish and<br />

German. The extraction process consists of several steps,<br />

which itself consists of different implemented<br />

algorithms. The user can chose between the algorithms or<br />

employ a combination of them. For example, there are<br />

three different methods for identifying concept<br />

candidates: rtf 9 , tf-idf 10<br />

and C/NC-value. The results of<br />

the algorithms are saved in a so called probabilistic<br />

ontology model (POM). It consists of a set of instantiated<br />

modeling primitives, which are independent of a specific<br />

ontology representation language. Each instance gets a<br />

numerical value between 0 and 1 (computed by the<br />

algorithms), indicating the probability, that it deals with a<br />

for the ontology relevant element. The elements, together<br />

with its values, are then presented to the user, who shall<br />

be supported in the selection process by the assigned<br />

values. The instantiation of the primitives takes place by<br />

accessing the declarative definition in the modeling<br />

primitive library (MPL).<br />

Modeling primitives are: concepts, subconcepts, instances<br />

and relations. Ontology writers are responsible for the<br />

translation of the POMs into a specific ontology language<br />

11 12<br />

like OWL or RDFS (Cimiano & Völker, 2005).<br />

2.3. The LT4eL-corpus<br />

The corpus originates from the LT4eL project and<br />

consists of 69 German texts. They were selected by the<br />

project participants and belong to the domain<br />

Information Technology for End Users & eLearning. All<br />

texts deal with introductions about how to use programs<br />

(like Excel and Word), internet and eLearning. The<br />

corpus includes 69 files, on average 5732 words per file<br />

and a total of 395547 words. 752 different domain<br />

relevant keywords were (manually) identified, which are<br />

7 http://gate.ac.uk/download/index.html<br />

8 http://wordnet.princeton.edu<br />

9 relative term frequency<br />

10 Term Frequency Inverse Document Frequency<br />

11 http://www.w3.org/TR/2004/REC-owl-features-20040210<br />

12 http://www.w3.org/TR/2004/REC-rdf-concepts-20040210


all covered by the LT4eL-ontology. Ideally, an on the<br />

basis of this corpus automatically extracted ontology<br />

should also semantically cover all keywords.<br />

The files of the corpus are available in two formats: in<br />

text format and in a linguistically annotated xml format,<br />

all encoded in utf-8 13 . The text files serve as input for<br />

Text2Onto, the xml files for OntoLT. The xml format was<br />

determined by the LT4eL members. Sentence structure,<br />

noun phrases and tokens as well as corresponding<br />

lemmas, parts of speech and some morpho-syntactic<br />

information (person, number, gender,case) are annotated.<br />

A snippet of an annotated file is presented in figure 1.<br />

Figure 1 Sample linguistic annotation<br />

The linguistic information is located in the values of the<br />

attributes of the token tags. base references the lemma,<br />

ctag the part of speech 14<br />

and msd contains morphosyntactic<br />

data. Those are the for OntoLT relevant aspects.<br />

The complete corpus of 69 files is used for the Gold<br />

Standard evaluation of the Text2Onto-Ontology.<br />

Unfortunately, not all files could be processed by OntoLT.<br />

The reason for this circumstance could not be detected<br />

during this work. Therefore it was not possible to perform<br />

a Gold Standard evaluation of the OntoLT-ontology,<br />

because the Gold Standard ontology was generated on the<br />

basis of the whole corpus, so that a comparison would be<br />

unfair. Alternatively, the OntoLT-ontology was compared<br />

to a Text2Onto-ontology, extracted on the basis of a<br />

reduced form of the corpus. This reduced corpus consists<br />

of the files, which could be processed by OntoLT. It<br />

contains 43 files, on average 4760 words per file and a<br />

total of 204378 words (Mossel, 2007).<br />

2.4. The LT4eL-ontology<br />

The LT4eL-ontology was created on the basis of<br />

manually annotated keywords of the corpus. The project<br />

members modeled adequate concepts, corresponding to<br />

those keywords. They also added further sub- and<br />

superconcepts (for example: if Notepad was identified as<br />

13 Universal Character Set Transformation Format-8-bit<br />

14 STTS (Stuttgart-Tübingen-TagSet)<br />

Multilingual Resources and Multilingual Applications - Posters<br />

concept, also text editor and editor were added as<br />

superconcepts). Finally, the ontology was connected to<br />

the upper ontology DOLCE Ultralite 15<br />

. All in all the<br />

ontology contains 1275 concepts – 1002 of them are<br />

domain concepts – 1612 subconcept-relations, 116<br />

further relations, including 42 subrelations. Each concept<br />

comes with an English definition and a natural language<br />

representation. The ontology is available as an owl-file in<br />

xml representation (Mossel, 2007).<br />

3. State of the art<br />

During the last two years there were amongst others three<br />

publications of studies in the field of evaluation of<br />

semi-automatic ontology extraction tools, which used<br />

OntoLT and/or Text2Onto.<br />

Hatala et al. (2009) tested the systems OntoGen 16<br />

and<br />

Text2Onto mainly according to their usability but also in<br />

relation to the quality of the produced ontologies. They<br />

used English corpora. 28 participants used the tools and<br />

answered questionnaires afterwards. The evaluation<br />

showed that the ontology extraction process via<br />

Text2Onto was accompanied by two central issues: 1)<br />

Due to a missing user guide the participants were not able<br />

to preview, what kind of effects the different algorithms<br />

or their combination would have on the resulting<br />

ontology. 2) The integrated extraction methods identified<br />

an enormous amount of concept candidates (several<br />

thousand) and the user was supposed to review all items<br />

according to their adequacy. Furthermore the quality of<br />

the produced ontologies was categorized as very poor,<br />

because they were flat and not appropriate to represent<br />

the demanded domain knowledge. The OntoGen-Tool<br />

was judged as more comfortable and user-friendly than<br />

Text2Onto. The participants felt to be more involved into<br />

the extraction process and were satisfied with the well<br />

structured ontologies, which included several relations<br />

(Hatala et al., 2009).<br />

Ahrens published her studies of OntoLT in 2010. She<br />

implemented her own extraction rules and applied them<br />

to an English corpus. Since the extracted ontology was<br />

very flat, additional superconcepts were inserted. Finally<br />

the ontology was adequate enough to represent the<br />

domain of the corpus. Ahrens concluded that OntoLT –<br />

though having some issues – would be a good support<br />

15 http://wiki.loa-cnr.it/index.php/LoaWiki:DOLCE-UltraLite<br />

16 http://ontogen.ijs.si/<br />

249


Multilingual Resources and Multilingual Applications - Posters<br />

during the ontology engineering process (Ahrens, 2010).<br />

Also 2010 Park et al. made their work public. They<br />

evaluated the systems OntoLT, Text2Onto,<br />

OntoBuilder 17 and DODDLE 18<br />

by applying them to an<br />

English corpus. They took the usability as well as the<br />

quality of the produced ontologies into account. They<br />

considered OntoLT to be less user-friendly because the<br />

input corpus has to be linguistically preprocessed. After<br />

all, Text2Onto was judged as the best tool because of its<br />

flexibility on the one hand according to the input format<br />

and on the other hand according to the applicability of<br />

different extraction algorithms (Park et al., 2010).<br />

All presented studies treat the evaluation of ontology<br />

extraction tools. Nevertheless one can't infer predictions<br />

or expectations for the evaluation scenario in this work.<br />

The results are somehow contradictory: Hatala et al.<br />

weren't satisfied with Text2Onto but Park et al. judged it<br />

as the best of all tested systems. Ahrens classified<br />

OntoLT as helpful, though Park et al. criticized its<br />

user-friendliness. Additionally to this, none of the studies<br />

includes a comparison between an automatically<br />

constructed and a manually created ontology. This fact<br />

and the application of a German corpus distinguish this<br />

work from all so far published ones.<br />

250<br />

4. Evaluation<br />

4.1. Gold Standard Evaluation<br />

The Text2Onto-ontology contained 10174 concepts, 13<br />

subconcept relations and 945 instances. But only 981<br />

concepts, 3 subconcept relations and 18 instances made<br />

sense. Most of the extracted items were either not domain<br />

relevant or consisted of strings, which couldn't be<br />

interpreted (due to the partial supported linguistic<br />

analysis for German texts). No further relations were<br />

identified. The ontology covers ca. 56 % of all domain<br />

relevant terms of the corpus. Altogether, its quality is not<br />

as high as that of the manually created, well structured<br />

LT4eL-ontology. The Text2Onto-ontology includes only<br />

few hierarchical relations, so that it is more a list of<br />

concepts than a real ontology. Also, the coverage of the<br />

domain relevant terms is very low. Most of the concepts<br />

are very specific, e.g. PowerPoint and Excel are included,<br />

17 http://ontobuilder.bitbucket.org/<br />

18 http://doddle-owl.sourceforge.net/en/<br />

but more general concepts like editor are missing<br />

(although they appear in the texts).<br />

4.2. OntoLT vs. Text2Onto<br />

The OntoLT-ontology consisted of 3939 concepts, 2565<br />

subconcept relations, 105 further relations and 0<br />

instances. 829 concepts, 299 subconcept relations and 87<br />

further relations were considered to be domain relevant.<br />

The ontology covers ca. 58 % of all domain relevant<br />

terms of the corpus. Many relevant concepts are missing,<br />

because the system only extracted terms, which appeared<br />

together with a modifier in the text.<br />

The comparison of both semi-automatic extracted<br />

ontologies showed, that OntoLT had more problems to<br />

detect acronyms whereas Text2Onto often failed to<br />

identify compounds. The degree of coverage of domain<br />

relevant terms was similar.<br />

It turns out, that both systems need to be improved.<br />

Especially Text2Onto extracts an enormous amount of<br />

irrelevant concept candidates, so that the user has to<br />

spend a lot of time to delete them. In general, the<br />

underlying algorithms are not adequate to identify<br />

suitable items, because they are based on statistical<br />

methods: but the domain relevance of a term mustn't be<br />

dependent of the number of its occurrence in a text<br />

corpus (Lame, 2004).<br />

5. References<br />

Ahrens, M. (2010): Semi-autom. Generierung einer<br />

OWL-Ontologie aus domänensp. Texten, Dipl. Thesis.<br />

Buitelaar,P., Olejnik, D. Sintek, M. (2004): A Protégé<br />

Plug-in for Ontology Extraction from Text. In: Proc. of<br />

the 1st European Semantic Web Symposium.<br />

Cimiano, P., Völker, J. (2005): Text2Onto – A Framework<br />

for Ont. Learning and Data-driven Chance Discovery.<br />

Hatala, M., Siadaty, M., Gasevic, D., Jovanovic, J.,<br />

Torniai, C., (2009): Utility of Ontology Extraction Tools<br />

in the Hands of Educators. In: Proc. of the ICSC, USA.<br />

Lame, G. (2004): Using NLP Techniques to Identify<br />

Legal Ontology Components. In: Artificial<br />

Intelligence and Law 12, Nr.4, pp. 379-3<strong>96</strong>.<br />

Mossel, E. (2007): Crosslingual Ontology-Based<br />

Document Retrieval. In: Proc. of the RANLP 2007.<br />

Park, J. Cho, W. Rho, S. (2010): Evaluation<br />

ontology extraction tools, In: Data Knowl.Eng. 69,<br />

pp. 1043-1061.


Multilingual Resources and Multilingual Applications - Posters<br />

System Presentations<br />

251


Multilingual Resources and Multilingual Applications - System Presentations<br />

252


Multilingual Resources and Multilingual Applications - System Presentations<br />

New and future developments in EXMARaLDA<br />

Thomas Schmidt, Kai Wörner, Hanna Hedeland, Timm Lehmberg<br />

<strong>Hamburger</strong> <strong>Zentrum</strong> <strong>für</strong> <strong>Sprachkorpora</strong> (HZSK)<br />

Max Brauer-Allee 60<br />

D-22765 Hamburg<br />

E-mail: thomas.schmidt@uni-hamburg.de, kai.wörner@uni-hamburg.de, hanna.hedeland@uni-hamburg.de,<br />

timm.lehmberg@uni-hamburg.de<br />

Abstract<br />

We present some recent and planned future developments in EXMARaLDA, a system for creating, managing, analysing and<br />

publishing spoken language corpora. The new functionality concerns the areas of transcription and annotation, corpus management,<br />

query mechanisms, interoperability and corpus deployment. Future work is planned in the areas of automatic annotation,<br />

standardisation and workflow management.<br />

Keywords: annotation tools, corpora, spoken language, digital infrastructure<br />

EXMARaLDA 1<br />

1. Introduction<br />

(Schmidt & Wörner, 2009) is a system<br />

for creating, managing, analysing and publishing spoken<br />

language corpora. It was developed at the Research<br />

Centre on Multilingualism (SFB 538) between 2000 and<br />

<strong>2011</strong>. EXMARaLDA is based on a data model for<br />

time-aligned multi-layer annotations of audio or video<br />

data, following the general idea of the annotation graph<br />

framework (Bird & Liberman, 2001). It uses open<br />

standards (XML, Unicode) for data storage, is largely<br />

compatible with many other widely used media<br />

annotation tools (e.g. ELAN, Transcriber, CLAN) and<br />

can be used with all major operating systems (Windows,<br />

Macintosh, Linux). The principal software components<br />

of the system are a transcription editor (Partitur-Editor), a<br />

corpus management tool (Corpus Manager) and a KWIC<br />

concordancing tool (EXAKT).<br />

EXMARaLDA has been used to construct the corpus<br />

collection of the Research Centre on Multilingualism<br />

comprising 23 multilingual corpora of spoken language<br />

(see Hedeland et al., this volume). It is also used for<br />

several larger corpus projects outside Hamburg such as<br />

2<br />

the METU corpus of Spoken Turkish (Middle Eastern<br />

Technical University Ankara, see Ruhi et al., this<br />

1 http://www.exmaralda.org<br />

2 http://std.metu.edu.tr/<br />

volume), the GEWISS corpus of spoken academic<br />

discourse 3 (Universities of Leipzig, Wroclaw and Aston),<br />

the Corpus of Northern German Language Variation 4<br />

(SiN – Universities of Hamburg, Bielefeld, Frankfurt/O.,<br />

Münster, Kiel and Potsdam) and the Corpus of Spoken<br />

Language in the Ruhrgebiet 5<br />

(KgSR, University of<br />

Bochum).<br />

This paper focuses on new functionality added or<br />

improved during the last two years and sketches some<br />

plans for the future development of the system.<br />

2. New and improved functionality<br />

2.1. Transcription and annotation<br />

The Partitur-Editor now provides additional support for<br />

time alignment of transcription and audio and/or video in<br />

the form of a time-based visualisation of the media signal.<br />

Navigation in this visualization is synchronised with<br />

navigation in the transcript, and the visualization can be<br />

used to specify the temporal extent of new annotations<br />

and to modify the start and end points of existing<br />

annotations. This has turned out a way to significantly<br />

improve transcription speed and accuracy.<br />

3<br />

https://gewiss.uni-leipzig.de/de/<br />

4<br />

http://sin.sign-lang.uni-hamburg.de/drupal/<br />

5<br />

http://www.ruhr-uni-bochum.de/kgsr/<br />

253


Multilingual Resources and Multilingual Applications - System Presentations<br />

Similarly, systematic manual annotation with (closed) tag<br />

sets is now supported through a configurable annotation<br />

panel which allows the user to define one or several<br />

hierarchical tag sets, assign tags to keyboard shortcuts<br />

and link them to specific labels of annotation layers. It is<br />

also possible to specify dependencies between different<br />

tag sets so that the user is offered only those tags which<br />

are applicable in a certain context. Annotation speed and<br />

consistency can thus be improved considerably.<br />

254<br />

Figure 1: Annotation Panel in the Partitur-Editor<br />

For large scale standoff annotation of corpora, a separate<br />

tool – Sextant (Standoff EXMARaLDA Transcription<br />

Annotation Tool, Wörner, 2010) – was added to the<br />

system’s tool suite. Sextant can be used to efficiently add<br />

standoff tags from closed tag sets to a segmented<br />

EXMARaLDA transcription. Annotations are stored as<br />

TEI conformant feature structures which point into<br />

transcriptions via ID references. For further processing,<br />

the standoff annotation can also be integrated into the<br />

main file.<br />

2.2. Corpus management<br />

The Corpus Manager was supplemented with a set of<br />

operations to aid in the maintenance of transcriptions,<br />

recordings and metadata. This includes functionality for<br />

checking the structural consistency (e.g. temporal<br />

integrity of time-alignment, correct assignment of<br />

annotations to primary layers etc.), the validity of<br />

transcriptions with respect to a given transcription<br />

convention, as well as the completeness and consistency<br />

of metadata descriptions. Furthermore, a set of analysis<br />

functions operating on a corpus as a whole was added.<br />

Users can now generate and manipulate global<br />

type/token and frequency lists for a given corpus,<br />

perform global search and replace routines or generate<br />

corpus statistics according to different parameters. These<br />

new features are intended to facilitate both corpus<br />

construction and corpus use.<br />

2.3. Query mechanisms<br />

For the query tool EXAKT, several new features were<br />

added to support the user in formulating complex queries<br />

to a corpus.<br />

A Levenshtein calculation was made available which<br />

selects from a given list of words all entries which are<br />

sufficiently similar to a form selected by the user. This<br />

can help to minimize the risk that (potentially<br />

unpredictable) variants – as are common in spoken<br />

language corpora – are accidentally overlooked in<br />

queries.<br />

Figure 2: Word list with Levenshtein functionality in<br />

EXAKT<br />

A regular expression library can now be used to store and<br />

retrieve common queries. This is meant mainly as a help<br />

for those users who are not experts in the design of formal<br />

queries.<br />

Through an extension of the original KWIC functionality,<br />

EXAKT is now also able to carry out queries across two<br />

or more annotation layers. This is achieved by adding one<br />

or more so called annotation columns in which<br />

annotation data from a specified annotation level<br />

overlapping with the existing search results are added to<br />

the concordance. The type of overlap between<br />

annotations can be specified as illustrated in figure 3.


Multilingual Resources and Multilingual Applications - System Presentations<br />

Figure 3: Specifiying the overlaptype for a multilevel<br />

search in EXAKT<br />

2.4. Interoperability<br />

Much work has been invested to further improve and<br />

optimise EXMARaLDA’s compatibility with other<br />

widely used transcription and annotation tools. Wizards<br />

for importing entire corpora from Transcriber, FOLKER,<br />

CLAN and ELAN were integrated into EXAKT thereby<br />

considerably extending the tool’s area of application.<br />

Moreover, a proposal for a spoken language transcription<br />

standard based on the P5 version of the TEI guidelines<br />

was formulated (Schmidt, <strong>2011</strong>), and a droplet<br />

application (TEI-Drop) was added to the EXMARaLDA<br />

toolset which enables users to easily transform<br />

Transcriber, FOLKER, CLAN, ELAN or EXMARaLDA<br />

files into this TEI conformant format.<br />

Figure 4: Screenshot of TEI-Drop<br />

2.5. Corpus deployment<br />

Completed EXMARaLDA corpora can now also be made<br />

available (i.e. queried) via a relational database with<br />

EXAKT. Compared to the deployment in the form of<br />

individual XML files which are then queried either<br />

locally or via http with EXAKT, this method not only<br />

facilitates data access, but also considerably improves<br />

query performance (by a factor of about 10 for smaller<br />

corpora, probably more for larger corpora) and allows for<br />

a more fine-grained access management. Furthermore,<br />

making data available in this way is also a prerequisite<br />

for integrating EXMARaLDA data into evolving<br />

distributed infrastructures like CLARIN.<br />

With the general availability of HTML5, methods for<br />

visualizing corpus data for web presentations could also<br />

be simplified and improved considerably. The integration<br />

of transcription text and underlying audio or video<br />

recording now no longer depends on Flash technology,<br />

but can be efficiently realised with standard browser<br />

technology.<br />

3. Future work<br />

With the end of the maximum funding period of the<br />

Research Centre on Multilingualism in June <strong>2011</strong>,<br />

EXMARaLDA’s original context of development has<br />

also ceased to exist. Although the system is now in a<br />

stable state and should remain usable for quite some time<br />

with some minimal maintenance work, we still see much<br />

potential for future development in at least three areas.<br />

3.1. Automatic annotation<br />

Additional manual and automatic annotation methods are<br />

required in order to make spoken language corpora more<br />

useful for corpus linguistic research. We have<br />

consequently started to explore the application of<br />

methods developed for written language, such as<br />

automatic part-of-speech-tagging or lemmatisation to<br />

EXMARaLDA corpora.<br />

First tests were carried out on the Hamburg Map Task<br />

Corpus (HAMATAC, Hedeland & Schmidt, 2012) with<br />

TreeTagger (Schmid, 1995), which was integrated via the<br />

TT4J interface (Eckart de Castilho et al., 2009) into<br />

EXMARaLDA. HAMATAC was POS-tagged and<br />

lemmatised with the default German parameter file,<br />

trained on written newspaper texts. The data were first<br />

tokenized using EXMARaLDA’s segmentation<br />

functionality which segments and distinguishes words,<br />

punctuation, pauses and non-phonological segments.<br />

Only words and punctuation were fed as input into the<br />

255


Multilingual Resources and Multilingual Applications - System Presentations<br />

tagger in the sequence in which they occur in the<br />

transcription. The tagging results were saved as<br />

EXMARaLDA standoff annotation files which can be<br />

further processed in the Sextant tool (see above). A<br />

student assistant was instructed to manually check and<br />

correct all POS tags. An evaluation shows that roughly<br />

80% of POS tags were assigned correctly. The error rate<br />

is thus considerably higher than for the best results which<br />

can be obtained on written texts (about 97% correct tags).<br />

By far the most tagging errors, however, occurred with<br />

word forms which are specific to spoken language, such<br />

as hesitation markers (“äh”, “ähm”), interjections and<br />

incomplete forms (cut-off words). Since especially the<br />

former are highly frequent but very limited in form (three<br />

forms äh, ähm and hm account for about half of the<br />

tagging errors), we expect a retraining of the TreeTagger<br />

parameter file on the corrected data to lead to a much<br />

lower error rate.<br />

3.2. Standardisation<br />

Further work in standardisation of data models, metadata<br />

descriptions, file formats and transcription conventions is<br />

needed in order to integrate spoken language data on<br />

equal footing with written data into the language resource<br />

landscape. EXMARaLDA as one of the most<br />

interoperable systems of its kind already provides a solid<br />

basis for developing and establishing such standards.<br />

Future work in this area should attempt to consolidate<br />

this basis with more general approaches like the<br />

guidelines of the Text Encoding Initiative,<br />

standardisation efforts within the ISO framework and<br />

emerging standards for digital infrastructures.<br />

3.3. Workflow management<br />

As we survey, train and support users in constructing and<br />

analysing spoken language corpora with EXMARaLDA,<br />

we observe how important it is to organise the tools’<br />

functionalities into an efficient workflow. Right now, the<br />

EXMARaLDA tools operate in a standalone fashion on<br />

local file systems, leaving many important aspects of the<br />

workflow (e.g. version control, consistency checking etc.)<br />

in the users’ responsibility. A tight integration of the tools<br />

with a repository solution may make it much easier,<br />

especially for larger projects, to organise their workflows<br />

and construct and publish their corpora in a maximally<br />

efficient and effective manner. We plan to explore this<br />

256<br />

possibility further in the follow-up projects at the<br />

Hamburg Centre for Language Corpora (HZSK). 6<br />

4. Acknowledgements<br />

Work on EXMARaLDA was funded by the University of<br />

Hamburg and by grants from the Deutsche<br />

Forschungsgemeinschaft (DFG).<br />

5. References<br />

Bird, S., Liberman, M. (2001): A formal framework for<br />

linguistic annotation. In: Speech Communication (33),<br />

pp. 23-60.<br />

Eckart de Castilho, R., Holtz, M., Teich, E. (2009):<br />

Computational support for corpus analysis work flows:<br />

The case of integrating automatic and manual<br />

annotations. In: Lingustic Processing Pipelines<br />

Workshop at GSCL 2009 - Book of Abstracts<br />

(electronic proceedings), October 2009.<br />

Hedeland, H., Schmidt, T. (2012): Technological and<br />

methodological challenges in creating, annotating and<br />

sharing a learner corpus of spoken German. To appear<br />

in: Schmidt, T., Wörner, K.: Multilingual Corpora and<br />

Multilingual Corpus Analysis. To appear as part of the<br />

series ‘Hamburg Studies in Multilingualism’ (HSM).<br />

Amsterdam: Benjamins.<br />

Schmid, H. (1995): Improvements in Part-of-Speech<br />

Tagging with an Application to German. Proceedings<br />

of the ACL SIGDAT-Workshop. March 1995.<br />

Schmidt, T., Wörner, K. (2009): EXMARaLDA –<br />

Creating, analysing and sharing spoken language<br />

corpora for pragmatic research. In: Pragmatics (19:4),<br />

pp. 565-582.<br />

Schmidt, T. (<strong>2011</strong>): A TEI-based approach to<br />

standardising spoken language transcription. In:<br />

Journal of the Text Encoding Initiative (1).<br />

Wörner, K. (2010): Werkzeuge zur flachen Annotation<br />

von Transkriptionen gesprochener Sprache. PhD<br />

Thesis, <strong>Universität</strong> Bielefeld,<br />

http://bieson.ub.uni-bielefeld.de/volltexte/2010/1669/.<br />

6 http://www.corpora.uni-hamburg.de


Multilingual Resources and Multilingual Applications - System Presentations<br />

Der VLC Language Index<br />

Dirk Schäfer, Jürgen Handke<br />

Institut <strong>für</strong> Anglistik und Amerikanistik, Philipps-Universtität Marburg<br />

Wilhelm-Röpke-Straße 6D<br />

E-mail: {dirk.schaefer,handke}@staff.uni-marburg.de<br />

Abstract<br />

Der Language Index ist eine Sammlung von Audiodaten von Sprachen der Welt. Als Bestandteil der Online-Lernplattform “Virtual<br />

Linguistics Campus” repräsentiert der Language Index Sprachaufnahmen in standardisierter Form und typologische Informationen<br />

mit Web-Technologien, die zum Zwecke der Analyse, z.B. in der Lehre, verwendet werden können.<br />

Keywords: Audio-Korpus, Typologie, Web<br />

1. Übersicht<br />

Der Language Index als Teil der Online Lernplattform<br />

„Virtual Linguistics Campus“ ist eine Sammlung von<br />

strukturierten Audiodaten von Sprachen der Welt. Im<br />

Rahmen einer Systemvorführung stellen wir vor, wie die<br />

Daten präsentiert werden und wie Forscher die vorhandenen<br />

Audiodaten nutzen können. Der restliche Artikel<br />

beschreibt das Datenformat <strong>für</strong> die Sprachaufnahmen,<br />

und die Benutzerschnittstellen.<br />

2. Erstellung von Sprachaufnahmen<br />

Die Sprachaufnahmen stellen einen Parallelkorpus dar,<br />

da von jedem Sprecher dieselben Wörter, Halbsätze und<br />

Sätze gesprochen wurden. Zu diesem Zweck existiert<br />

eine Sammlung von standardisierten Datenblättern, die<br />

erweitert wird, sobald eine neue Sprache hinzukommt.<br />

Für manche Sprachen existieren mehrere leicht voneinander<br />

abweichende Datenblätter, da alle Sprecher die<br />

Daten entsprechend ihres regionalen Dialekts übersetzt<br />

haben. Zurzeit verfügen wir über Datenblätter zu 110<br />

Sprachen und Regionaldialekten, sowie Sprachaufnahmen<br />

von 850 Sprechern.<br />

2.1. Verfahren zur Gewährleistung der Qualität<br />

von Sprachaufnahmen<br />

Um die Qualität der Sprachaufnahmen zu gewährleisten,<br />

hat sich folgendes Verfahren bewährt:<br />

a. Der Sprecher überprüft das vorhandene Datenblatt<br />

zu seiner Sprache. Ist kein Datenblatt zu seiner<br />

Sprache verfügbar, übersetzt er die Keywords und<br />

Sätze.<br />

b. Verfügt die Sprache über kein Schriftsystem, werden<br />

Abbildung 1: Benutzerschnittstelle<br />

die Datenblätter auf der Basis des IPA-Alphabets<br />

durch Interaktion mit dem Sprecher erstellt.<br />

c. Der Sprecher liest die Keywords und Sätze im normalen<br />

Tempo vor. Die Aufnahme erfolgt vor Ort mit<br />

einem digitalen Aufnahmegerät, über das Web mit<br />

Hilfe von Skype oder mit einem Headset am heimischen<br />

Computer.<br />

d. Die aufgenommenen Sprachdaten werden nachbearbeitet<br />

und mit Cuepoints versehen.<br />

e. Die vollständige Sprachaufnahme mitsamt Transkription<br />

und Transliteration wird dem Sprecher zu<br />

Kontrolle vorgelegt.<br />

f. Die Aufnahme wird über den VLC Language Index<br />

verfügbar gemacht.<br />

257


Multilingual Resources and Multilingual Applications - System Presentations<br />

258<br />

3. Benutzerschnittstelle<br />

Es gibt besondere Schwierigkeiten bei der Repräsenta-<br />

tion solcher audiobasierter Parallelkorpora. Zum Beispiel<br />

muss eine einfache Benutzbarkeit gewährleistet sein, die<br />

ohne Einarbeitungszeit einen schnellen Zugriff auf alle<br />

gewünschten Daten ermöglicht. Außerdem liegt es in der<br />

Natur eines Parallelkorpus, dass Vergleichsmöglichkeiten<br />

gegeben sein müssen.<br />

Der Language Index ist eine auf Webtechniken basierende<br />

Anwendung mit hohen Flash und Flex Anteilen.<br />

Seit 2006 wird die Google Maps API zur Darstellung von<br />

Sprachdaten auf Karten eingesetzt. Mit dem Anwachsen<br />

des Datenbestandes wurde eine Datenbanknutzung notwendig,<br />

besondere Verfahren mussten eingesetzt werden,<br />

um eine performante Kommunikation zwischen PHP und<br />

den auf Flex basierenden Benutzeroberflächen zu gewährleisten.<br />

Der Zugriff auf die Audiodaten im VLC Language Index<br />

ist auf verschiedene Weisen möglich:<br />

� Eine Liste von Sprachaufnahmen, nach Sprachen<br />

sortiert.<br />

� Eine Google-Map bei der jede Sprachaufnahme als<br />

Pin dargestellt wird, beim Daraufklicken öffnet sich<br />

ein Popup-Fenster.<br />

� Ein Filterinterface bei dem sich bestimmte syntaktische,<br />

morphologische, phonologische und weitere<br />

Parameter einstellen lassen.<br />

4. Besondere Features<br />

Es gibt zusätzliche besondere Features, die sich mit dem<br />

Datenbestand des Parallelkorpus realisieren lassen. Mit<br />

Hilfe des „Cognate Comparison“ Werkzeugs können die<br />

Benutzer nach Wahl eines Kognats akustisch miteinander<br />

vergleichen, indem der Benutzer Pins auf einer Karte<br />

oder Einträge in einer Liste auswählt.<br />

Auf „Acoustic Vowel Charts“ werden die Frequenzen<br />

derselben Vokale verschiedener Sprecher visualisiert.<br />

5. Ausblick<br />

Es gibt verschiedene Einsatzszenarien, <strong>für</strong> die der VLC<br />

Language Index genutzt werden kann, dazu gehören die<br />

Lehre, Abschlußarbeiten und Forschung auf Master- und<br />

PhD-Niveau.<br />

Jüngstes Feature ist das mp3-Download-Angebot mit<br />

einem bibliographischen Referenzierungssystem <strong>für</strong> alle<br />

Sprachdaten, damit diese auf einfache Weise in anderen<br />

Werkzeugen genutzt werden können und wissenschaftlichen<br />

Arbeiten, die auf diesen Daten basieren, beigelegt<br />

werden können.<br />

6. Weblink<br />

Virtual Lingustics Campus (VLC):<br />

http://www.linguistics-online.de<br />

Abbildung 2: Acoustic Vowel Chart


Multilingual Resources and Multilingual Applications - System Presentations<br />

Topological Fields, Constituents and Coreference:<br />

A New Multi-layer Architecture for TüBa-D/Z<br />

Thomas Krause*, Julia Ritz+, Amir Zeldes*, Florian Zipser*‡<br />

* Humboldt-<strong>Universität</strong> zu Berlin, Unter den Linden 6, 10099 Berlin<br />

+ <strong>Universität</strong> Potsdam, Karl-Liebknecht-Straße 24-25, 14476 Potsdam<br />

‡ INRIA<br />

E-mail: krause@informatik.hu-berlin.de, jritz@uni-potsdam.de, amir.zeldes@rz.hu-berlin.de, f.zipser@gmx.de<br />

Abstract<br />

This presentation is concerned with a new multi-layer representation of the German TüBa-D/Z Treebank, which allows users to<br />

conveniently query and visualize annotations for syntactic constituents, topological fields and coreference either separately or in<br />

conjunction.<br />

Keywords: corpus search tool, multi-layer annotation, treebank, German<br />

1. The Original Corpus<br />

The TüBa-D/Z corpus (Tübinger Baumbank des<br />

Deutschen / Zeitungskorpus, Telljohann et al., 2003)<br />

was already at its release, in a sense, a multi-layer<br />

corpus, since it combined information about constituent<br />

syntax with topological field annotation. However, the<br />

corpus was originally constructed using the TigerXML<br />

format (Mengel & Lezius, 2000), which only allowed<br />

for one type of internal node: the syntactic category,<br />

which was used to express both types of annotation.<br />

Figure 1 shows a representation of a sentence from the<br />

corpus in the TigerSearch tool (Lezius, 2002).<br />

Though the layers of topological and constituent syntax<br />

annotations are in principle separate, users must take<br />

into account the intervening topological nodes when<br />

formulating syntactic queries, and vice versa.<br />

With the addition of coreference information in<br />

Version 5 of the corpus (Hinrichs et al., 2004), which<br />

were created using the MMAX tool (see Müller &<br />

Strube, 2006 for the latest version), the facilities of the<br />

TigerSearch software could no longer be used to search<br />

through all annotations of the different layers (syntax,<br />

coreference and topological fields), since TigerSearch<br />

indexes only individual sentences, whereas coreference<br />

annotations require a full document context.<br />

Figure 1: TüBa-D/Z Version 3 in TigerSearch. Topological fields (VF: Vorfeld, LK:<br />

linke Klammer, MF: Mittelfeld) are represented as syntactic categories.<br />

259


Multilingual Resources and Multilingual Applications - System Presentations<br />

260<br />

2. The New Architecture<br />

Our goal is to make all existing layers of annotation<br />

available for simultaneous search, but in a way that<br />

allows each one to be searched separately without<br />

intervening nodes from other annotation layers. For this<br />

purpose, we have converted the latest Version 6 of<br />

TüBa-D/Z to the multi-layer XML format PAULA<br />

(Dipper, 2005). We then converted and edited the corpus<br />

using the SaltNPepper converter framework (Zipser &<br />

Romary, 2010), which gives us an in-memory<br />

representation of the corpus that can be manipulated<br />

more easily. During this step, we disentangled the<br />

syntactic, topological and coreference annotations. The<br />

resulting corpus was then exported and fed into ANNIS<br />

(Zeldes et al., 2009), a corpus search and visualization<br />

tool for multi-layer corpora. The resulting annotation<br />

layers are visualized in Figure 2, which shows a separate<br />

syntax tree (without topological fields), spans<br />

representing fields, and a full document view for the<br />

coreference annotation in which coreferent expressions<br />

are highlighted in the same color.<br />

Figure 2: TüBa-D/Z in ANNIS with separate<br />

annotation layers.<br />

3. Corpus Search<br />

Using the new architecture and the ANNIS Query<br />

Language (AQL) 1 it becomes possible to query syntax,<br />

topological fields and coreference more easily and<br />

intuitively, both simultaneously and separately. In the<br />

following, we will discuss three example applications<br />

briefly: one investigating topological fields only, one<br />

combining all three annotation layers, and one extracting<br />

syntactic frames with the help of the exporter<br />

functionality in ANNIS.<br />

3.1. Application 1: Topological Fields<br />

As a simple example of the easily accessible topological<br />

field information, we can consider the following query,<br />

which retrieves clauses before the left sentence bracket,<br />

in the main-clause preverbal domain (Vorfeld, VF),<br />

which contain two complementizer fields (C) after one<br />

another (the operator ‘>’ represents dominance, and ‘.*’<br />

represents indirect precedence, the numbers ‘#1’ etc.<br />

refer to the nodes declared at the beginning of the query,<br />

in order):<br />

(1) field="VF" & field="C" & field="C"<br />

& #1 > #2 & #1 > #3 & #2 .* #3<br />

Figure 3 shows an example result with its separate field<br />

grid and syntax tree, for the sentence: Daß und wie<br />

Demokratie funktionieren kann, hat der zähe Kampf der<br />

Frauen um den Paragraphen 218 gezeigt ‘The women’s<br />

tenacious fight for paragraph 218 has shown that, and<br />

how, democracy can work.’ By directly querying the<br />

topological fields we can avoid having to consider<br />

possible syntactic nodes intervening between VF and C.<br />

3.2. Application 2: Coreference, Syntax and Fields<br />

Next, let us first search for objects that are cataphors,<br />

but not reflexive pronouns. In TüBa-D/Z, cataphors are<br />

linked to their subsequents via the ‘cataphoric’ relation.<br />

The AQL expression is given in (2a): there is a node –<br />

any node – number 1 and another node, number 2, and<br />

node 1 points to node 2 using the ‘cataphoric’ relation.<br />

(2a.) node & node & #1 ->cataphoric #2<br />

We now add syntactic constraints: the cataphor, node 1,<br />

shall be an object (OA or OD, i.e. accusative or dative).<br />

In TüBa-D/Z, the grammatical function of a noun phrase<br />

1 A tutorial of the query language can be found at<br />

http://www.sfb632.uni-potsdam.de/~d1/annis/.


Multilingual Resources and Multilingual Applications - System Presentations<br />

Figure 3: Separate fields and syntactic phrases for VF with two C positions<br />

(NX) is specified as a label of the dominance edge<br />

connecting this NX and its parent.<br />

(2b.) node & node & #1 ->cataphoric<br />

#2 & cat="NX" & #3 _=_ #1 & cat & #4<br />

>[func=/O[AD]/] #3 & pos!="PRF" & #5<br />

_=_ #1<br />

(read: there is a node, number 3, of category NX, and<br />

node 3 covers the same tokens as node 1, and there is a<br />

node of any category, and this node number 4 dominates<br />

‘>’ node number 3, with the edge label ‘func’ (function)<br />

= OA or OD. We use regular expressions to specify the<br />

label. To exclude reflexive pronouns (part of speech<br />

‘PRF’), we use negation (‘!=’)). The search yields 51<br />

results, with scalable contexts and color-highlighting of<br />

the matches (cataphors and their subsequents).<br />

Secondly, let us query for noun phrases in the VF, with a<br />

definite determiner and their antecedents in the leftneighbour<br />

sentences.<br />

(2c.) field="VF" & cat="NX" & #1 _=_<br />

#2 & pos="ART" & #2 > #3 &<br />

tok=/[Dd]../ & #3 _=_ #4 & node & #5<br />

_=_ #2 & node & #5 ->coreferential #6<br />

& cat="TOP" & #7 _i_ #1 & cat="TOP" &<br />

#8 _i_ #6 & #8 . #7<br />

(‘>*’ represents indirect dominance)<br />

This query yields 766 results. Using the match counts of<br />

(2c.) and similar queries, we can create a contingency<br />

table of definite vs. pronomial VF-constituents and<br />

whether their respective antecedents occur in the leftneigbour<br />

sentence (‘close’) or more distantly: 43% of<br />

the definites and 61% of the pronouns in VF have a<br />

‘close’ antecedent – a difference that is highly<br />

significant (χ²=142.72, p #2 & #2 > #3 & #1<br />

>[func="OA"] #4 & #4 >[func="HD"] #5<br />

This query searches for a verbal phrase dominated by a<br />

clause (SIMPX) and dominating the lemma schreiben,<br />

where the same clause also dominates a nominal phrase<br />

with the function OA, which in turn dominates its head<br />

noun (pos="NN", func="HD"). Using the built-in<br />

WEKA exporter, we can produce a list of all the nominal<br />

object arguments of a verb much like in a dependency<br />

treebank, along with the form and part-of-speech of the<br />

relevant verb, as shown in Figure 3. Note that both finite<br />

and non-finite clauses are found, as well as verb-second<br />

and verb-final clauses, which now all have similar tree<br />

structures regardless of topological fields.<br />

'271192', 'wer immer seine Texte schreibt', 'SIMPX',<br />

'271186', 'Texte', 'apm', 'NN', '271187', 'seine Texte',<br />

'NX', '271189', 'schreibt', '3sis', 'VVFIN', '271190',<br />

'schreibt', 'VXFIN'<br />

'1134826', 'Songs schreiben', 'SIMPX', '1134820',<br />

'Songs', 'apm', 'NN', '1134821', 'Songs', 'NX',<br />

'1134823', 'schreiben', '--', 'VVINF', '1134824',<br />

'schreiben', 'VXINF'<br />

'1526561', 'Ich schreibe Satire', 'SIMPX', '151<strong>96</strong>02',<br />

'Satire', 'asf', 'NN', '1526559', 'Satire', 'NX',<br />

'1519599', 'schreibe', '1sis', 'VVFIN', '1526557',<br />

'schreibe', 'VXFIN'<br />

Figure 3: Excerpt of results from the WEKA Exporter<br />

for query (3).<br />

The exporter gives the values of all annotations for the<br />

nodes we have searched for, in order, as well as the text<br />

covered by those nodes. We can therefore easily get<br />

261


Multilingual Resources and Multilingual Applications - System Presentations<br />

tabular access to the contents of the clause (e.g. Songs<br />

schreiben ‘to write songs’), the object (Songs), the form<br />

and part-of-speech of the verb (schreiben, VVINF),<br />

morphological annotation (apm for a plural masculine<br />

noun in the accusative), etc.<br />

262<br />

4. Conclusion<br />

We have suggested an advanced, layer-separated<br />

representation architecture for TüBa-D/Z. This<br />

architecture facilitates corpus querying and exploitation.<br />

By means of examples, we have shown that the corpus<br />

search tool ANNIS allows for a qualitative and<br />

quantitative study of the interplay of syntactic,<br />

topological and information structural factors annotated<br />

in TüBa-D/Z.<br />

5. References<br />

Dipper, S. (2005): XML-based Stand-off Representation<br />

and Exploitation of Multi-Level Linguistic<br />

Annotation. Proceedings of Berliner XML Tage 2005<br />

(BXML 2005). Berlin, Germany, pp. 39-50.<br />

Hinrichs, E. W., Kübler, S., Naumann, K., Telljohann,<br />

H., Trushkina, J. (2004): Recent developments in<br />

linguistic annotations of the TüBa-D/Z treebank.<br />

Proceedings of the Third Workshop on Treebanks and<br />

Linguistic Theories.<br />

Lezius, W. (2002): Ein Suchwerkzeug <strong>für</strong> syntaktisch<br />

annotierte Textkorpora. PhD Thesis, Institut <strong>für</strong><br />

maschinelle Sprachverarbeitung Stuttgart.<br />

Mengel, A., Lezius, W. (2000): An XML-based<br />

encoding format for syntactically annotated corpora.<br />

Proceedings of the Second International Conference<br />

on Language Resources and Engineering (LREC<br />

2000). Athens.<br />

Müller, C., Strube, M. (2006): Multi-Level Annotation<br />

of Linguistic Data with MMAX2. In: Braun, Sabine,<br />

Kohn, Kurt & Mukherjee, Joybrato (eds.), Corpus<br />

Technology and Language Pedagogy. Frankfurt: Peter<br />

Lang, pp. 197-214.<br />

Telljohann, H., Hinrichs, E. W., Kübler, S. (2003):<br />

Stylebook for the Tübingen Treebank of Written<br />

German.<br />

Zeldes, A., Ritz, J., Lüdeling, A., Chiarcos, C. (2009):<br />

ANNIS: A Search Tool for Multi-Layer Annotated<br />

Corpora. Proceedings of Corpus Linguistics 2009,<br />

Liverpool, July 20-23, 2009.<br />

Zipser, F., Romary, L. (2010): A model oriented<br />

approach to the mapping of annotation formats using<br />

standards. Proceedings of the Workshop on Language<br />

Resource and Language Technology Standards,<br />

LREC 2010. Malta, pp. 7-18.


Multilingual Resources and Multilingual Applications - System Presentations<br />

MT Server Land Translation Services<br />

Christian Federmann<br />

DFKI – Language Technology Lab<br />

Stuhlsatzenhausweg 3, D-66123 Saarbrücken, GERMANY<br />

E-mail: cfedermann@dfki.de<br />

Abstract<br />

We demonstrate MT Server Land, an open-source architecture for machine translation services that is developed by the MT group at<br />

DFKI. The system can flexibly be extended and allows lay users to make use of MT technology within a web browser or by using<br />

simple HTTP POST requests from custom applications. A central broker server collects and distributes translation requests to several<br />

worker servers that create the actual translations. User access is realized via a fast and easy-to-use web interface or through an<br />

XML-RPC-based API that allows integrating translation services into external applications. We have implemented worker servers<br />

for several existing translation systems such as the Moses SMT decoder or the Lucy RBMT engine. We also show how other,<br />

web-based translation tools such as Google Translate can be integrated into the MT Server Land application. The source code is<br />

published under an open BSD-style license and is freely available from GitHub.<br />

Keywords: Machine Translation, Web Service, Translation Framework, Open-Source Tool<br />

1. Introduction<br />

Machine translation (MT) is a field of active research<br />

with lots of different MT systems being built for shared<br />

tasks and experiments. The step from the research<br />

community towards real-world application of available<br />

technology requires easy-to-use MT services that are<br />

available via the Internet and allow collecting feedback<br />

and criticism from real users. Such applications are<br />

important means to increase visibility of MT research and<br />

to help shaping the multi-lingual web. Applications such<br />

as Google Translate 1<br />

allow lay users to quickly and<br />

effortlessly create translations of texts or even complete<br />

web pages; the continued success of such services shows<br />

the potential that lies in usable machine translation,<br />

something both developers and researchers should be<br />

targeting.<br />

In the context of ongoing MT research projects at DFKI's<br />

language technology lab, we have decided to design and<br />

implement such a translation application. We have<br />

released the source code under a permissive open-source<br />

license and hope that it becomes a useful tool for the<br />

MT community. A screenshot of the MT Server Land<br />

application is shown in Figure 1.<br />

1 http://translate.google.com<br />

Figure 1: Screenshot of MT Server Land<br />

2. System Architecture<br />

The system consists of two different layers: first, we have<br />

the so-called broker server that handles all direct<br />

requests from end users or via API calls alike. Second, we<br />

have a layer of worker servers, each implementing some<br />

sort of machine translation functionality. All<br />

communication between users and workers is channeled<br />

through the broker server that acts as a central “proxy”<br />

server. An overview of the system architecture is given in<br />

Figure 2.<br />

For users, both broker and workers “constitute” the MT<br />

263


Multilingual Resources and Multilingual Applications - System Presentations<br />

Server Land system; the broker server is the “visible”<br />

part of the application while the various worker servers<br />

perform the “invisible” translation work. The system has<br />

been designed to make it easier for lay users to access and<br />

use machine translation technology without the need to<br />

fully dive into the complexities of current MT research.<br />

Within MT Server Land, translation functionality is<br />

available by starting up suitable worker server instances<br />

for a specific MT engine. The startup process for workers<br />

is standardized using some easy-to-understand<br />

parameters for, e.g., the hostname/IP address or port<br />

number of the worker server process. All “low-level”<br />

work (de-/serialization, transfer of requests/results, etc.)<br />

is handled by the worker server instances. Of course, it is<br />

possible to design and create new worker server instances,<br />

e.g., to demonstrate new features in a research translation<br />

system or to integrate other MT systems.<br />

Human users connect to the system using any modern<br />

web browser; API access can be implemented using<br />

HTTP POST and/or XML-RPC requests. It would be<br />

relatively easy to extend the current API interface to<br />

support other protocols such as SOAP or REST. By<br />

design, all internal method calls that connect to the<br />

worker layer have to be implemented with XML-RPC. In<br />

order to prevent encoding problems with the input text,<br />

we send and receive all data encoded as Base64 Strings<br />

between broker and workers; the broker server takes care<br />

of the necessary conversion steps. Translation requests<br />

are converted into serialized, binary Strings using Google<br />

protocol buffer compilation.<br />

Figure 2: Architecture overview of MT Server Land<br />

264<br />

2.1. Broker Server<br />

The broker server has been implemented using the<br />

django web framework 2<br />

, which takes care of low-level<br />

tasks and allows for rapid development and clean design<br />

of components. We have used the framework for other<br />

project work before and think it is well suited to the task.<br />

The framework itself is available under an open-source<br />

BSD-license.<br />

2.1.1. Object Models<br />

The broker server implements two main django models<br />

that we describe subsequently. Please note that we have<br />

also developed additional object models, e.g. for quota<br />

management. See the source code for more information.<br />

� WorkerServer stores all information related to a<br />

remote worker server. This includes source and<br />

target language, the respective hostname and port<br />

address as well as a name and a short description.<br />

Available worker servers within MT Server Land can<br />

be constrained to function for specific user and/or<br />

API accounts only.<br />

� TranslationRequest models a translation job and<br />

related information such as the chosen worker server,<br />

the source text and the assigned request id.<br />

Furthermore we store additional metadata<br />

information. Once the translation result has been<br />

obtained from the translation worker server, it is also<br />

stored within the instance so that it can be removed<br />

from the worker server’s job queue.<br />

2.1.2. User Interface<br />

We developed a browser-based web interface to access<br />

and use the MT Server Land application. End users first<br />

have to authenticate before they can access their<br />

dashboard that lists all known translation requests for<br />

the current user and also allows creating new requests.<br />

When creating a new translation request, the user may<br />

choose which translation worker server should be used to<br />

generate the translation for the chosen language pair. We<br />

use a validation step to ensure that the respective worker<br />

server supports the selected language pair and is currently<br />

able to receive new translation requests from the broker<br />

server; after successful validation, the new translation<br />

request is sent to the worker server that starts processing<br />

the given source text.<br />

2 http://www.djangoproject.com/


Multilingual Resources and Multilingual Applications - System Presentations<br />

Once the chosen worker server has completed a<br />

translation request, the result is transferred to (and also<br />

cached by) the object instance inside the broker server's<br />

data storage. The user can view the result within the<br />

dashboard or download the file to a local hard disk.<br />

Translation requests can be deleted at any time,<br />

effectively terminating the corresponding thread within<br />

the connected worker server (if the translation is still<br />

running). If an error occurs during translation, the system<br />

will recognize this and deactivate the respective<br />

translation requests.<br />

2.1.3. API Interface<br />

In parallel to the browser interface, we have designed and<br />

implemented an API that allows connecting applications<br />

to the MT functionality provided by our application using<br />

HTTP POST requests. Again, we first require<br />

authentication before any machine translation can be<br />

used. We provide methods to list all requests for the<br />

current “user” (i.e. the application account) and to create,<br />

download, or delete translation requests. Extension to<br />

REST or SOAP protocols is possible.<br />

2.2. Worker Server Implementations<br />

A layer of so-called worker servers that are connected to<br />

the central broker server implements the actual machine<br />

translation functionality. For the MT Server Land, we<br />

have implemented worker servers for the following MT<br />

systems:<br />

� Moses SMT: a Moses (Koehn et al., 2007) worker is<br />

configured to serve exactly one language pair. We<br />

use the Moses Server mode to keep translation and<br />

language model in memory, which helps to speed up<br />

the translation process. As the limitation to one<br />

language pair effectively means that a huge number<br />

of Moses worker server instances has to be started in<br />

a typical usage scenario, we have also worked on a<br />

better implementation which allows to serve any<br />

number of language pairs from one worker instance.<br />

Future improvements could be achieved by<br />

integrating “on-the-fly” configuration switching and<br />

remote language models to reduce the amount of<br />

resources required by the Moses worker server.<br />

� Lucy RBMT: our Lucy (Alonso & Thurmair, 2003)<br />

worker is implemented using a Lucy Server mode<br />

wrapper. This is a small Python program running on<br />

the Windows machine on which Lucy is installed.<br />

We have implemented a simple XML-RPC based<br />

API interface to send translation requests to the Lucy<br />

engine and later fetch the corresponding results. For<br />

integration in MT Server Land, we simply had to<br />

“tunnel” our Lucy worker server calls to this Lucy<br />

server mode implementation.<br />

� Joshua SMT: similar to the Moses worker, we have<br />

created a Joshua (Li et al., 2010) worker that works<br />

by creating a new Joshua instance for each<br />

translation request.<br />

We have also created worker servers for popular online<br />

translation engines such as Google Translate, Microsoft<br />

Translator, or Yahoo! BabelFish. We will demonstrate<br />

the worker servers in our presentation.<br />

3. Acknowledgements<br />

This work was supported by the EuroMatrixPlus project<br />

(IST-231720) that is funded by the European Community<br />

under the Seventh Framework Programme for Research<br />

and Technological Development.<br />

4. References<br />

Alonso, J. A., Thurmair, G. (2003). The Comprendium<br />

Translator System. In Proceedings of the Ninth<br />

Machine Translation Summit.<br />

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,<br />

Federico, M., Bertoldi, N., Cowan, B., Shen, W.,<br />

Moran, C., Zens, R., Dyer, C. J., Bojar, O., Constantin,<br />

A., Herbst, E. (2007). Moses: Open Source Toolkit for<br />

Statistical Machine Translation. In Proceedings of the<br />

45th Annual Meeting of the Association for<br />

Computational Linguistics Companion Volume<br />

Proceedings of the Demo and Poster Sessions,<br />

pp. 177–180, Prague, Czech Republic. Association for<br />

Computational Linguistics.<br />

Li, Z., Callison-Burch, C., Dyer, C., Ganitkevitch, J.,<br />

Irvine, A., Khudanpur, S., Schwartz, L., Thornton, W.,<br />

Wang, Z., Weese, J., Zaidan, O. (2010). Joshua 2.0: A<br />

Toolkit for Parsing-based Machine Translation with<br />

Syntax, Semirings, Discriminative Training and other<br />

Goodies. In Proceedings of the Joint Fifth Workshop<br />

on Statistical Machine Translation and MetricsMATR,<br />

pp. 133–137, Uppsala, Sweden. Association for<br />

Computational Linguistics.<br />

265


WORKING PAPERS IN MULTILINGUALISM <strong>•</strong> Series B<br />

ARBEITEN ZUR MEHRSPRACHIGKEIT <strong>•</strong> Folge B<br />

Publications to date <strong>•</strong> Bisher erschienen:<br />

1. Jürgen M. Meisel: On transfer at the initial state<br />

of L2 acquisition: Revisiting Universal Grammar.<br />

2. Kristin Bührig, Latif Durlanik & Bernd Meyer:<br />

Arzt-Patienten-Kommunikation im Krankenhaus:<br />

konstitutive Handlungseinheiten, institutionelle<br />

Handlungslinien.<br />

3. Georg Kaiser: Dialect contact and language<br />

change. A case study on word-order change in<br />

French.<br />

4. Susanne S. Jekat & Lorenzo Tessiore: End-to-<br />

End Evaluation of Machine Interpretation Systems:<br />

A Graphical Evaluation Tool.<br />

5. Thomas Ehlen: Sprache - Diskurs - Text. Überlegungen<br />

zu den kommunikativen Rahmenbedingungen<br />

mittelalterlicher Zweisprachigkeit <strong>für</strong> das<br />

Verhältnis von Latein und Deutsch.<br />

Nikolaus Henkel: Lateinisch-Deutsch.<br />

6. Kristin Bührig & Jochen Rehbein: Reproduzierendes<br />

Handeln. Übersetzen, simultanes und konsekutives<br />

Dolmetschen im diskursanalytischen<br />

Vergleich.<br />

7. Jürgen M. Meisel: The Simultaneous Acquisition<br />

of Two First Languages: Early Differentiation<br />

and Subsequent Development of Grammars.<br />

8. Bernd Meyer: Medizinische Aufklärungsgespräche:<br />

Struktur und Zwecksetzung aus diskursanalytischer<br />

Sicht.<br />

9. Kristin Bührig, Latif Durlanik & Bernd Meyer<br />

(Hrsg.): Dolmetschen und Übersetzen in medizinischen<br />

Institutionen. Beiträge zum Kolloquium<br />

‘Dolmetschen in Institutionen’ vom17. - 18.03.<br />

2000 in Hamburg.<br />

10. Juliane House: Concepts and Methods of Translation<br />

Criticism: A Linguistic Perspective.<br />

11. Bernd Meyer & Notis Toufexis (Hrsg.):<br />

Text/Diskurs, Oralität/Literalität unter dem Aspekt<br />

mehrsprachiger Kommunikation.<br />

12. Hans Eideneier: Zur mittelalterlichen Vorgeschichte<br />

der neugriechischen Diglossie.<br />

13. Kristin Bührig, Juliane House, Susanne J. Jekat:<br />

Abstracts of the International Symposium on Linguistics<br />

and Translation, University of Hamburg,<br />

20th - 21st November 2000.<br />

14. Sascha W. Felix: Theta Parametrization. Predicate-Argument<br />

Structure in English and Japanese.<br />

15. Mary A. Kato: Aspects of my Bilingualism: Japanese<br />

as L1 and Portuguese and English as L2.<br />

16. Natascha Müller, Katja Cantone, Tanja Kupisch<br />

& Katrin Schmitz: Das mehrsprachige Kind: Italienisch<br />

– Deutsch.<br />

17. Kurt Braunmüller: Semicommunication and Accommodation:<br />

Observations from the Linguistic<br />

Situation in Scandinavia.<br />

18. Tessa Say: Feature Acquisition in Bilingual Child<br />

Language Development.<br />

19. Kurt Braunmüller & Ludger Zeevaert: Semikommunikation,<br />

rezeptive Mehrsprachigkeit und verwandte<br />

Phänomene. Eine bibliographische Bestandsaufnahme.<br />

20. Nicole Baumgarten, Juliane House & Julia<br />

Probst: Untersuchungen zum Englischen als ‘lingua<br />

franca’ in verdeckter Übersetzung. Theoretischer<br />

Hintergrund, Weiterentwicklung des Analyseverfahrens<br />

und erste Ergebnisse.<br />

21. Per Warter: Lexical Identification and Decoding<br />

Strategies in Interscandinavian Communication.<br />

22. Susanne J. Jekat & Patricia J. Nüßlein: Übersetzen<br />

und Dolmetschen: Grundlegende Aspekte und<br />

Forschungsergebnisse.<br />

23. Claudia Böttger & Julia Probst: Adressatenorientierung<br />

in englischen und deutschen Texten.<br />

24. Anja Möhring: The acquisition of French by<br />

German children of pre-school age. An empirical<br />

investigation of gender assignment and gender<br />

agreement.<br />

25. Jochen Rehbein: Turkish in European Societies.<br />

26. Katja Francesca Cantone & Marc-Olivier Hinzelin:<br />

Proceedings of the Colloquium on Structure,<br />

Acquisition, and Change of Grammars: Phonological<br />

and Syntactic Aspects. Volume I.<br />

27. Katja Francesca Cantone & Marc-Olivier Hinzelin:<br />

Proceedings of the Colloquium on Structure,<br />

Acquisition, and Change of Grammars: Phonological<br />

and Syntactic Aspects. Volume II.<br />

28. Utta v. Gleich: Multilingualism and multilingual<br />

Literacies in Latin American Educational Systems.<br />

29. Christine Glanz & Utta v. Gleich: Mehrsprachige<br />

literale Praktiken im religiösen Alltag. Ein Vergleich<br />

literaler Ereignisse in Uganda und Bolivien.<br />

30. Jürgen M. Meisel: From bilingual language acquisition<br />

to theories of diachronic change.<br />

31. Florian Coulmas & Makoto Watanabe: Japan‘s<br />

Nascent Multilingualism.<br />

32. Tanja Kupisch: The acquisition of the DP in<br />

French as the weaker language.<br />

33. Utta v. Gleich, Mechthild Reh & Christine Glanz:<br />

Mehrsprachige literale Praktiken im Kulturvergleich:<br />

Uganda und Bolivien. Die Datenerhebungs-<br />

und Auswertungsmethoden.<br />

34. Thomas Schmidt: EXMARaLDA - ein System zur<br />

Diskurstranskription auf dem Computer.<br />

35. Esther Rinke: On the licensing of null subjects in<br />

Old French.<br />

36. Bernd Meyer & Ludger Zeevaert: Sprachwechselphänomene<br />

in gedolmetschten und semikommunikativen<br />

Diskursen.


37. Annette Herkenrath & Birsel Karakoç: Zum Erwerb<br />

von Verfahren der Subordination bei türkisch-deutsch<br />

bilingualen Kindern – Transkripte<br />

und quantitative Aspekte.<br />

38. Guntram Haag: Illokution und Adressatenorientierung<br />

in der Zwettler Gesamtübersetzung und<br />

der Melker Rumpfbearbeitung der ‘Disticha Catonis’:<br />

funktionale und sprachliche Einflussfaktoren.<br />

39. Kristin Bührig: Multimodalität in gedolmetschten<br />

Aufklärungsgesprächen. Grafische Abbildungen<br />

in der Wissensvermittlung.<br />

40. Jochen Rehbein: Pragmatische Aspekte des Kontrastierens<br />

von Sprachen – Türkisch und Deutsch<br />

im Vergleich.<br />

41. Christine Glanz & Okot Benge: Exploring Multilingual<br />

Community Literacies. Workshop at the<br />

Ugandan German Cultural Society, Kampala,<br />

September 2001.<br />

42. Christina Janik: Modalisierungen im Dolmetschprozess.<br />

43. Hans Eideneier: „Rhetorik und Stil“ – der griechische<br />

Beitrag.<br />

44. Annette Herkenrath, Birsel Karakoç & Jochen<br />

Reh-bein: Interrogative elements as subordinators<br />

in Turkish – aspects of Turkish-German bilingual<br />

children’s language use.<br />

45. Marc-Olivier Hinzelin: The Aquisition of Subjects<br />

in Bilingual Children: Pronoun Use in Portuguese-German<br />

Children.<br />

46. Thomas Schmidt: Visualising Linguistic Annotation<br />

as Interlinear Text.<br />

47. Nicole Baumgarten: Language-specific Realization<br />

of Extralinguistic Concepts in Original and<br />

Translation Texts: Social Gender in Popular<br />

Film.<br />

48. Nicole Baumgarten: Close or distant: Constructions<br />

of proximity in translations and parallel<br />

texts.<br />

49. Katrin Monika Schmitz & Natascha Müller:<br />

Strong and clitic pronouns in monolingual and<br />

bilingual first language acquisition: Comparing<br />

French and Italian.<br />

50. Bernd Meyer: Bilingual Risk communication.<br />

51. Bernd Meyer: Dolmetschertraining aus diskursanalytischer<br />

Sicht: Überlegungen zu einer Fortbildung<br />

<strong>für</strong> zweisprachige Pflegekräfte.<br />

52. Monika Rothweiler, Solveig Kroffke & Michael<br />

Bernreuter: Grammar Acquisition in Bilingual<br />

Children with Specific Language Impairment:<br />

Prerequisites and Questions.<br />

Solveig Kroffke & Monika Rothweiler: The Bilingual´s<br />

Language Modes in Early Second Language<br />

Acquisition – Contexts of Language Use<br />

and Diagnosis of Language Disorders.<br />

53. Gerard Doetjes: Auf falsche[r] Fährte in der interskandinavischen<br />

Kommunikation.<br />

54. Angela Beuerle & Kurt Braunmüller: Early Germanic<br />

bilingualism? Evidence from the earliest<br />

runic inscriptions and from the defixiones in Roman<br />

utility epigraphy.<br />

Kurt Braunmüller: Grammatical indicators for<br />

bilingualism in the oldest runic inscriptions?<br />

55. Annette Herkenrath & Birsel Karakoç: Zur Morphosyntax<br />

äußerungsinterner Konnektivität bei<br />

mono- und bilingualen türkischen Kindern.<br />

56. Jochen Rehbein, Thomas Schmidt, Bernd Meyer,<br />

Franziska Watzke & Annette Herkenrath: Handbuch<br />

<strong>für</strong> das computergestützte Transkribieren<br />

nach HIAT.<br />

57. Kristin Bührig & Bernd Meyer: Ad hocinterpreting<br />

and the achievement of communicative<br />

purposes in specific kinds of doctor-patient<br />

discourse.<br />

58. Margaret M. Kehoe & Conxita Lleó: The emergence<br />

of language specific rhythm in German-<br />

Spanish bilingual children.<br />

59. Christiane Hohenstein: Japanese and German ‘I<br />

think–constructions’.<br />

60. Christiane Hohenstein: Interactional expectations<br />

and linguistic knowledge in academic expert discourse<br />

(Japanese/German).<br />

61. Solveig Kroffke & Bernd Meyer: Verständigungsprobleme<br />

in bilingualen Anamnesegesprächen.<br />

62. Thomas Schmidt: Time-based data models and<br />

the Text Encoding Initiative’s guidelines for transcription<br />

of speech.<br />

63. Anja Möhring: Against full transfer during early<br />

phases of L2 acquisition: Evidence from German<br />

learners of French.<br />

64. Bernadette Golinski & Gerard Doetjes: Sprachverstehensuntersuchungen<br />

im semikommunikativen<br />

Kontext.<br />

65. Lukas Pietsch: Re-inventing the ‘perfect’ wheel:<br />

Grammaticalisation and the Hiberno-English<br />

medial-object perfects.<br />

66. Esther Rinke: Wortstellungswandel in Infinitivkomplementen<br />

kausativer Verben im Portugiesischen.<br />

67. Imme Kuchenbrandt, Tanja Kupisch & Esther<br />

Rinke: Pronominal Objects in Romance: Comparing<br />

French, Italian, Portuguese, Romanian<br />

and Spanish.<br />

68. Javier Arias, Noemi Kintana, Martin Rakow &<br />

Susanne Rieckborn: Sprachdominanz: Konzepte<br />

und Kriterien.<br />

69. Matthias Bonnesen: The acquisition of questions<br />

by two German-French bilingual children<br />

70. Chrystalla A. Thoma & Ludger Zeevaert: Klitische<br />

Pronomina im Griechischen und Schwedischen:<br />

Eine vergleichende Untersuchung zu synchroner<br />

Funktion und diachroner Entwicklung<br />

klitischer Pronomina in griechischen und schwedischen<br />

narrativen Texten des 15. bis 18. Jahrhunderts<br />

71. Thomas Johnen: Redewiedergabe zwischen<br />

Konnektivität und Modalität: Zur Markierung<br />

von Redewiedergabe in Dolmetscheräußerungen<br />

in gedolmetschten Arzt-Patientengesprächen<br />

72. Nicole Baumgarten: Converging conventions?<br />

Macrosyntactic conjunction with English ‘and’<br />

and German ‘und’<br />

73. Susanne Rieckborn: Entwicklung der ‚schwachen<br />

Sprache‘ im unbalancierten L1-Erwerb


74. Ludger Zeevaert: Variation und kontaktinduzierter<br />

Wandel im Altschwedischen<br />

75. Belma Haznedar: Is there a relationship between<br />

inherent aspect of predicates and their finiteness<br />

in child L2 English?<br />

76. Bernd Heine: Contact-induced word order<br />

change without word order change<br />

77. Matthias Bonnesen: Is the left periphery a vulnerable<br />

domain in unbalanced bilingual first language<br />

acquisition?<br />

78. Tanja Kupisch & Esther Rinke: Italienische und<br />

portugiesische Possessivpronomina im diachronischen<br />

Vergleich: Determinanten oder Adjektive?<br />

79. Imme Kuchenbrandt, Conxita Lleó, Martin Rakow,<br />

Javier Arias Navarro: Große Tests <strong>für</strong> kleine<br />

Datenbasen?<br />

80. Jürgen M. Meisel: Exploring the limits of the<br />

LAD<br />

81. Steffen Höder, Kai Wörner, Ludger Zeevaert:<br />

Corpus-based investigations on word order<br />

change: The case of Old Nordic<br />

82. Lukas Pietsch: The Irish English “After Perfect”<br />

in context: Borrowing and syntactic productivity<br />

83. Matthias Bonnesen & Solveig Kroffke: The acquisition<br />

of questions in L2 German and French<br />

by children and adults<br />

84. Julia Davydova: Preterite and present perfect in<br />

Irish English: Determinants of variation<br />

85. Ezel Babur, Solveig Chilla & Bernd Meyer:<br />

Aspekte der Kommunikation in der logopädischen<br />

Diagnostik mit ein- und mehrsprachigen Kindern<br />

86. Imme Kuchenbrandt: Cross-linguistic influences<br />

in the acquisition of grammatical gender?<br />

87. Anne Küppers: Sprecherdeiktika in deutschen<br />

und französischen Aktionärsbriefen<br />

88. Demet Özçetin: Die Versprachlichung mentaler<br />

Prozesse in englischen und deutschen Wirtschaftstexten<br />

89. Barbara Miertsch: The acquisition of gender<br />

markings by early second language learners of<br />

French<br />

90. Kurt Braunmüller: On the relevance of receptive<br />

multilingualism in a globalised world: Theory,<br />

history and evidence from today’s Scandinavia<br />

91. Jill P. Morford & Martina L. Carlson: Sign perception<br />

and recognition in non-native signers of<br />

ASL<br />

92. Andrea Bicsar: How the “Traveling Rocks” of<br />

Death Valley become “Moving Rocks”: The Case<br />

of an English-Hungarian Popular Science Text<br />

Translation<br />

93. Anne-Kathrin Preißler: Subjektpronomina im<br />

späten Mittelfranzösischen: Das Journal de Jean<br />

Héroard<br />

94. Ingo Feldhausen, Ariadna Benet, Andrea<br />

Pešková: Prosodische Grenzen in der Spontansprache:<br />

Eine Untersuchung zum Zentralkatalanischen<br />

und porteño-Spanischen<br />

95. Manuela Schönenberger, Franziska Sterner, Tobias<br />

Ruberg: The realization of indirect objects<br />

and case marking in experimental data from child<br />

L1 and child L2 German<br />

<strong>96</strong>. Hanna Hedeland, Thomas Schmidt, Kai Wörner<br />

(eds.): Multilingual Resources and Multilingual<br />

Applications – Proceedings of the Conference of<br />

the German Society for Computational Linguistics<br />

and Language Technology (GSCL) <strong>2011</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!