by Charles Brewster,
BSc, BA – iEARN UK,
iEARN Europe WebMaster
Background
After
some 15 years of the internet being available as a public resource, it is
hardly a surprise that English should have become the most widely used language
on the WorldWideWeb. Given that the first language for many of the pioneers was
English, and the network's American origin, it would be more remarkable if it
were otherwise.
A
recent analysis of web content by language shows 68% of all pages in English,
yet for 65% of web users English is not the first language1.
This
imbalance is being addressed with online resources like Altavista's 'BabelFish'
and Google's page translation service, but the range of languages on offer is
limited.
After
some preliminary testing carried out with iEARN Europe
colleagues during 2006, it seemed worthwhile progressing with 'Project Lingo'
which aims to make a contribution towards online multilingual resources.
Open Source and 'Wiki'
For
a project seeking to facilitate and support translation between multiple
languages, it makes sense to frame concepts of ownership of the data (and of
the means to make use of it) as widely and inclusively as possible. Lingo
dictionaries and cross-translation phrase lists
will be available for download under the GNU Public License (www.gnu.org/copyleft/gpl.html).
It
is hoped that the sense of communal ownership of the resource will encourage a wide
Wiki-like participation in contribution both of content and translation
material to websites making use of the Lingo database.
Although
the reference language for the prototype server is English, it is anticipated
that co-hosting arrangements will be agreed with individuals or organisations
in other first-language locales, to bring online servers with other reference
languages.
Server
system software is entirely Open Source, and therefore free in terms of
licensing.
Project Sustainability
Hardware
component cost for the prototype server has been significant, and current
running costs (aside from any matter of salary) are not insubstantial. Going
forward, there will need to be similarly configured back-up servers, and almost
certainly relocation to premises equipped to offer higher bandwidth
communications than are currently available (or likely to be any time soon) in
a typical Welsh mountain village.
In
parallel with Project Lingo development, two 'public service' websites are
being launched; these are:
The business model for these sites involves
context-sensitive web advertising. Another potential revenue stream lies in
hosting multi-lingual websites both for advertisers and for third parties.
A
company, OurWorld Internet Services Ltd, has been formed to promote and manage
the commercial aspects. It is not ruled out that control of that company could
pass to an established professional
organisation who would be appropriately placed to drive it forward. Hosting
arrangements for iEARN-related sites, and the Open Source ethos of the Lingo
Project itself, would not be compromised by any such transaction, albeit that
all are served from the same machines today.
A Language Teaching and
Learning Resource
Students
are encouraged to contribute material and local reportage to sites using the
Lingo system, including iEARN Europe (www.iearneurope.org), and to test their
skills by offering translations to sentences and phrases drawn from real
websites. Both assessment and tutorial modes of use will be available.
"Machine Hints" for students being permissible by the teacher, or
switched off for assessment purposes.
Future Directions
An
ultimate long-term goal has to be real-time machine translation (MT) between
any pair of supported languages, probably using a system reference language as
an intermediate step. Statistical methods and algorithms which could accomplish
this, given sufficiently powerful hardware, and an adequate body of
already-translated material, are available on the web2.
The
current phase of the project seeks to assess currently affordable hardware for
purpose, and to augment currently available copyright-free translated material
(usually more than 50 years old) with contributed material which reflects
modern usage and idiom.
Supported Languages
In
its earliest phase, Lingo was framed to support the languages of the countries
of Europe. It soon became clear that an
inclusive approach, encompassing expatriate communities living within Europe, would mean effectively supporting languages from
all parts of the world. This prospect was however made much less daunting by
the adoption across the computing industry in October 2006 of the 'Unicode'
standard for encoding of all the writing systems for the world's languages3.
Unicode
is supported by all modern web browsers, and the Lingo server software is
Unicode-capable throughout by design. This takes care of intricacies like
treatment of left-to-right and right-to-left directional renditions of content
in different language versions of the same material. Text is stored in the
character-order in which it is entered through a keyboard, and displayed text
ordering is handled by the web browser's internal processing of the Unicode
characters and their attributes.
The
Ethnologue database (www.ethnologue.org)
holds details of more than 6000 "living" languages across the world,
and Lingo would be happy to support any of those which is taught or studied at
any level in a school or university anywhere, either as a curriculum subject or
as an instruction medium.
The
key criterion in deciding which to support initially has been whether there
exists some enthusiasm for the presence of a particular language on the
Internet. Evidence for this has been drawn from WikiPedia (http://meta.wikimedia.org/wiki/List_of_Wikipedias)
- broadly, if there is an active WikiPedia attracting entries in a particular
language right now, then Lingo supports it. Minority languages having
"official" status within a sovereign territory are also supported in
general, although there may be exceptions.
Please
have a look at the languages listed on the new iEARN Europe site: www.iearneurope.org
. If your language isn't there, even after you click "List All", just ask and it will be added. Contribute, use
and enjoy!
--
References:
1. Global Reach Internet Statistics: www.glreach.com/globstats/refs.php3
2.
Knight, K (1999) ‘A statistical MT Tutorial Workbook’
www.isi.edu/natural-language/mt/wkbk.rtf
3. The Unicode Consortium (2006) ‘The Unicode 5.0
Standard’ (ISBN 0-321-48091-0)