Selasa, 21 Desember 2010

PARALLEL TEXTS IN COMPUTER-ASSISTED LANGUAGE LEARNING

JOHN NERBONNE

University of Groningen, The Netherlands

Key words:

Abstract:

Language Learning, Computer-Assisted Language Learning, Computer-Aided
Instruction, Vocabulary

Parallel bilingual texts are a valuable source of information to advanced language
learners, particularly in the area of lexis, subtle lexical dependencies. Typically this
information is either not available or sporadically available only in very large
dictionaries. To be most effective, the corpora in question should be indexed by
lexeme (not string, or word form), and should be aligned into parallel sentences. This
paper surveys use and prospects.

INTRODUCTION

This brief paper surveys the use of parallel bilingual texts in language learning.
Although it contains sections on language learning and computer-assisted language
learning (CALL), the focus is entirely on the potential use of parallel, bilingual
texts. There is a review of the literature on the use of parallel, bilingual corpora in
CALL. These sections make no pretense at comprehensiveness except with
respect to the focus. The following sections of the paper report on a working
prototype of a system which allowed native speakers of Dutch, intermediate-level
French students, to examine inter alia bilingual, aligned texts as a source of
information on unknown words. The students were positive about the prototype,
making it worthwhile to note some issues about preparing such parallel texts for
pedagogical use. The final section draws some conclusions about prospects for
future work.
LANGUAGE LEARNING

Foreign and second language learning is studied in applied linguistics; a
distinction is drawn between foreign language learning, which normally takes
place in classrooms, and always remote from extensive natural opportunity to use
the foreign language, and second language learning, which occurs in a
“naturalistic” environment, normally in a country where the language is spoken.
There are researchers who prefer the term “second language acquisition”, because
“acquisition” (as opposed to “learning”) emphasizes the degree to which automatic
processes may play a role in the more natural situation when a language from the
immediate environment is adopted. The two branches of language learning share
an applied focus: both consistently research not only how language learning
normally proceeds, but also how it succeeds best. They seek to optimize learning,
naturally with respect to the goals of language learners (e.g., scientific literature,
tourism, or commerce), their (linguistic) backgrounds, and their age and
educational level. Van Els et al. (1977) is an excellent reference on issues in this
branch of applied linguistics. One principle on which the different schools agree is
that the material to which learners are exposed must be comprehensible to the
learners in order for learning to proceed optimally (Widdowson, 1990:111, citing
Krashen, 1982).
Parallel texts have played a traditional role in traditional language learning
even if they are not a popular object of current research interest. Parallel texts
show translation near originals, and they are a reasonable guarantee that textual
material will be comprehensible, in accordance with the requirement just noted.
Linguistics scholars, but also school children, are fond of foreign language texts
for which parallel translations are provided. An example may be evocative:

Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae,…
Gaul is a whole divided into three parts, one of which the Belgae inhabit …
(Caesar, De Bello Galico, Loeb Classical Library, Harvard)

While the largest market for such texts may well be school children cramming
for exams they might better prepare for by learning Latin, the texts serve a
legitimate purpose in allowing less experienced readers to approach natural, even
challenging texts more quickly than they otherwise might. Sometimes parallel
texts are accompanied by glosses, i.e., word-by-word translations accompanied by
brief notes on the grammatical information in inflections.

COMPUTER-ASSISTED LANGUAGE LEARNING

Computer-Assisted Language Learning (CALL) seeks to employ computers in
order to improve language-learning techniques. CALL spans the range of
activities in language pedagogy—h
earing, speaking, reading, and writing—
and
draws from nearly all areas of information and communication technology (ICT).
Even if most CALL applications are automated language exercises, exploiting
hypertext, simple database and network technology, and digital audio and video,
one finds many others, including ingenious applications of everyday technology
such as word-processing and email. Levy (1997) surveys the surprisingly long
history of CALL, reports on the field’s extensive reflection on its proper relation
to applied linguistics, computer science, and psychology, and presents his own
astute view of its proper, technology-driven nature in the final chapters. There is
no mention of opportunities for text alignment software, however. Jager, Nerbonne
and van Essen (1998) explore especially the opportunities for language technology
in CALL, and include several reports on CALL applications that exploit parallel
texts.

Corpora and CALL

There is substantial, focused interest in using language corpora for CALL
(Wichmann et al., 1997). Corpora are valued for providing access to authentic
language use, unmediated by grammarians’ theories, prescriptivists’ tastes,
pedagogy’s traditions, or even lexicographers’ limitations. There are moderate
and extreme views on how corpora should best be utilized. The moderate view
espouses the value of corpora, especially when accompanied by good search and
concordancing tools, for instructors and very advanced students—
those for whom
unabridged dictionaries and comprehensive grammars are insufficient as sources
of information on nuances of meaning, common usage, or stylistic level. Let’s not
attribute the extreme view to Tim Johns, but it is nonetheless associated with what
Johns has dubbed ‘data-driven learning’, which emphasizes the role of discovery
in the language classroom, facilitated by tools for corpus analysis. Johns (1991,
p.2) finds that “the language learner is also, essentially, a research worker whose
learning needs to be driven by access to linguistic data”.
The fundamental reason to explore bilingual texts in CALL is that they grant
the language learner the same access to authentic language use, only now
accompanied by convenient translation into a known language. This increases the
chances, of course, that the foreign-language corpus material will be
comprehensible to learners, which, as noted above, is one of the prime
requirements of all effective foreign-language pedagogical material (Krashen,
1982). The advantages of immediate access to genuine material thus accrue to





The project foresaw two main areas where Glossser-like applications might
profitably be used. First, in language learning and second, as a tool for users that
have a bit of knowledge of a foreign language, but cannot read it easily or reliably.
The latter group might not be trying to learn, only to cope with a specific text. A
user might, for instance, need to read a software manual that contains a number of
unfamiliar words. Glosser provides the user (or learner) with a means of looking
up information on unfamiliar words in a straightforward and user-friendly manner.
The guiding vision behind Glosser was to recast the basic idea of the glossed
text using modern means, including both restrictions and extensions. The idea was
recast by using automatic morphological analysis to provide the glosses—bo the grammatical information carried by the morphological inflections and the
dictionary equivalent. This means that essentially any French text is now available
with Dutch glosses, for essentially the low cost of computer processing (ignoring
the amortization of development). A further modernization of the idea has been to
move the glosses to a hyper plane, so that readers control how many words are
glossed. Practically, this just means that the glosses are supplied only on request.
The idea of the classical glossed text has been extended by providing other
examples of word use, drawn automatically from corpora. Wherever possible,
examples from bilingual corpora were offered to users. Note that the program is
capable of finding alternatively inflected forms of words, just as in this example in
Figure 2, in which the string atteindra is found based on a search for examples of
atteindre. This dictionary entry was found based on the (inflectional) alternative
atteignissent. This extension to the fundamental concept of glossing was intended
to supplement dictionary explanation for advanced users. It has a welcome side
effect, however, that we’ll want to note: in dealing with highly inflected
languages, such as French or Spanish, searches for the same string will be much
less effective than searches for the same lemma. This is so because a single
lemma, a French verb such as atteindre, may have hundreds of inflected forms. A
further extension to the classical glossed text is that a complete morphological
analysis was provided. This was done by Xerox software.
The restriction of the software that' s been accepted (vis-à-vis the older glossed
texts) is that the third line---the coherent translation---is in general not available.
This is not technically feasible unless a humanly prepared translation is accessible.
The latter option is explored in the corpus of examples (wherever parallel bilingual
corpora could be found).
The metaphor of the glossed text suggests why Glosser is successful---just as
these texts have been. Simple, quick dictionary access alleviates the tedium and
wasted time of dictionary lookup by hand (or by an online dictionary that isn' t
integrated into a reading browser). In stating this so baldly, we are ignoring
objections from occasional language teachers that dictionary lookup time is the
motivating factor behind lexical learning.

Figure 2. The windows displaying examples from bilingual, parallel corpora. In every case, the right
window (French) is shown, and the left (Dutch) window showing the translation is available on
demand (see button at upper right).

Technical Issues

Nerbonne and Dokter (1999) presents Glosser technically. We note here only
some issues with respect to the parallel bilingual corpora. We sought texts of
different sorts in order to provide a variety of examples; we attempted to vary the
inflection form of examples in order to provide the student with a feeling for this
variety (this would be less valuable to advanced students); and, finally, we
preferred examples in bilingual texts because of the added information the
translation provides.
A corpus of approximately 6.2 MB was created, including 16,701 different
lexemes. Of this only 2 MB was bilingual text, because this kind of text (French-
Dutch) proved difficult to obtain. A rough calculation suggests that ten times as
much text would be necessary to provide examples of the approx. 30,000 entries in
advanced learners’dictionaries. The bilingual texts were predominantly texts from
official documents or records of the European Union, e.g. the treaty of Maastricht,
and hearings before the European parliament. Some of the alignment was done
with MARK ALISTeR, a tool developed by the Bulgarian Academy of Science, a
project partner in Glosser (Paskaleva and Mihov, 1998). Alignment accuracy was
not measured, but seemed reliable to within a sentence or two. It would have been
more useful to users to have corresponding words marked where possible. The
monolingual corpus was more varied, including poetry, political and commercial
texts, literature etc. Early experiments showed that indexing was necessary for
acceptable behavior. To allow for inflectional variety, a mapping was generated
from inflected forms (in texts) to base-forms or lexemes (dictionary forms which
served as basis for searches). A desirable side effect was that many more
examples could be found within a fixed-size corpus. For lemmatization Xerox
software Locolex was used (Bauer, Segond and Zaenen, 1995). Locolex provides
part-of-speech (POS) disambiguation, morphological analysis, and lemmatization.
In a prepocessing phase, the entire corpus is lemmatized, and each lexeme is
written to the index file, with its POS, the file it occurs in, and the byteposition of
the surface form.
The byteposition of the translation is also recorded if a
translation is available.
We also subjected the prototype to an analysis of errors (Nerbonne and Dokter,
1999). The most frequent error related to corpora was not finding examples, as
expected. Lemmatization also regarded derivationally related forms as word
variants, which is not appropriate for this application

Users’ Reactions

Nerbonne, Dokter and Smit (1998) report more completely on Glosser from a
language learning perspective, including, in particular, the results of a user study in
which a group of second-year French students were randomly divided into a group
using Glosser and a second one using a hand-held dictionary. Glosser users were
more effective in dictionary access and also understood the text better (but the
latter improvement was statistically insignificant). There was unanimity among
those tested that they should prefer to continue using the program.
The users were also nearly unanimous in identifying the dictionary as the most
valuable single source of information. The other facilities—
morphological
analysis and examples from monolingual French and bilingual French-Dutch
corpora—
were not appreciated to anywhere near the same extent. This may reflect
the task the users were given, that of answering questions about a text they read.
But all users were given opportunity to experiment with the program before
actually beginning. So we might expect that user evaluation reflects the value of
the information sources at least to some degree.
Besides learners’ reactions, we were also interested in instructors’ reactions,
and, although we conducted no formal study, we presented Glosser to a number of
groups of foreign language instructors. The instructors did not question the
positive reactions of the students, but they viewed Glosser, with or without
justification, as a better version of a dictionary— tool for which they feel little
a
responsibility. With few exceptions, they were unenthusiastic about incorporating
Glosser or similar tools into their instruction. One often heard the reasoning that
one simply cannot teach vocabulary, and that it is therefore up to students to pick it
up on their own. When asked whether they thought Glosser might help in this,
they answered affirmatively, but that they would leave it up to the students
whether or not to use it.

Using Glosser

Although Glosser has not be used in extensive instruction of any sort, we
should be interested in an experiment in which it would be used— self-
in
instruction or tutored, in an academic or commercial setting. Most attractive
would be a group focused on reading for professional purposes.
We noted above the generally accepted principle of foreign language learning
that students should practice on comprehensible material. This principle implies
that texts will not be appropriate for all students, without regard to proficiency
level. In general, the choice of texts to be read will be left up to the instructor. It
might also be possible to make search procedures depend on learner level, using
some simple measures such as percentage of vocabulary (in texts) among the most
common words. This could improve the usefulness of parallel bilingual texts for
language learners.

CONCLUSIONS

Parallel bilingual texts are a valuable source of information to advanced
language learners, particularly in the area of lexis, subtle lexical dependencies.
Typically this information is either not available or is sporadically available only
in very large dictionaries. To be most effective, the corpora in question should be
indexed by lexeme (not string, or word form), and should be aligned into parallel
sentences. The best chances to provide the information to language learners may
be in larger CALL systems, offering several useful sources of information.

REFERENCES

Anderman, G. (1996). The Word is my Oyster. In Anderman, G. and Rogers, M. (eds.) Words,
Words, Words: The Language Learner and the Translator. Multilingual Matters, Clevedon. 41-55.
Bailin, A. (1995). Intelligent Computer-Assisted Language Learning: A Bibliography. Computers and the Humanities 29 (5), 375-387.
Barlow, M. (1996). Parallel Texts in Language Teaching. In Botley et al. (eds.) Proceedings of Teaching and Language Corpora 1996. 45-56.
Bauer, D., Segond, F., Zaenen, A. (1995). LOCOLEX: Translation Rolls off your Tongue. In Conference Abstracts: ACH-ALLC ’95 , Santa Barbara, USA. Avail. at
http://www.xrce.xerox.com/publis/mltt/mlttart.html
Blank, I. (this volume). Terminology Extraction from Parallel Technical Texts.
Bonhomme, P. and Romary, L. (1995). The Lingua Parallel Concordancing Project: Managing
Multilingual Texts for Educational Purposes. In: Proceedings of Language Engineering.
Botley, S., Glass, J., McEnery, T., Wilson, A. (eds.) (1996) Proceedings of Teaching and Language
Corpora 1996. Technical Paper 9, University Centre for Computer Corpus Research on
Language, Lancaster.
Danielsson, P. and Ridings, D. (1966). PEDANT. Parallel Texts in Göteborg. Research Reports
from the Department of Swedish, GU-ISS-96-2.
Danielsson, P. and Ridings, D. (1966a). Corpus and Terminology: Software for the Translation
Program at Göteborgs Universitet, or Getting Students to do the Work. In Botley et al. (eds.)
Proceedings of Teaching and Language Corpora 1996, 57-67.
Fung, P. (this volume) A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora.
Gaussier, E., Hull, D., and Ait-Mokhtar, S. (this volume) Word Alignment in Use: Translation Memory and Cross-Language Retrieval.
Glosser (1997). Glosser, Final Report. Final project report, University of Groningen. Available at www.let.rug.nl/~glosser.
Jager, J., Nerbonne, J., van Essen, A. (eds.) (1998). Language Teaching and Language Technology,Swets & Zeitlinger, Lisse.
Johns, T. (1991). Should you be Persuaded—
Two Samples of Data-Driven Learning Materials. In English Lan

Tidak ada komentar:

Posting Komentar