Project Lingo - Working Together to Build a Multi-Language Database

by Charles Brewster, BSc, BA – iEARN UK, iEARN Europe WebMaster

Background

After some 15 years of the internet being available as a public resource, it is hardly a surprise that English should have become the most widely used language on the WorldWideWeb. Given that the first language for many of the pioneers was English, and the network's American origin, it would be more remarkable if it were otherwise.

A recent analysis of web content by language shows 68% of all pages in English, yet for 65% of web users English is not the first language1.

This imbalance is being addressed with online resources like Altavista's 'BabelFish' and Google's page translation service, but the range of languages on offer is limited.

After some preliminary testing carried out with iEARN Europe colleagues during 2006, it seemed worthwhile progressing with 'Project Lingo' which aims to make a contribution towards online multilingual resources.

Open Source and 'Wiki'

For a project seeking to facilitate and support translation between multiple languages, it makes sense to frame concepts of ownership of the data (and of the means to make use of it) as widely and inclusively as possible. Lingo dictionaries and cross-translation phrase lists  will be available for download under the GNU Public License (www.gnu.org/copyleft/gpl.html).

It is hoped that the sense of communal ownership of the resource will encourage a wide Wiki-like participation in contribution both of content and translation material to websites making use of the Lingo database.

Although the reference language for the prototype server is English, it is anticipated that co-hosting arrangements will be agreed with individuals or organisations in other first-language locales, to bring online servers with other reference languages.

Server system software is entirely Open Source, and therefore free in terms of licensing.

Project Sustainability

Hardware component cost for the prototype server has been significant, and current running costs (aside from any matter of salary) are not insubstantial. Going forward, there will need to be similarly configured back-up servers, and almost certainly relocation to premises equipped to offer higher bandwidth communications than are currently available (or likely to be any time soon) in a typical Welsh mountain village.

In parallel with Project Lingo development, two 'public service' websites are being launched; these are:

The business model for these sites involves context-sensitive web advertising. Another potential revenue stream lies in hosting multi-lingual websites both for advertisers and for third parties.

A company, OurWorld Internet Services Ltd, has been formed to promote and manage the commercial aspects. It is not ruled out that control of that company could pass to an established  professional organisation who would be appropriately placed to drive it forward. Hosting arrangements for iEARN-related sites, and the Open Source ethos of the Lingo Project itself, would not be compromised by any such transaction, albeit that all are served from the same machines today.

A Language Teaching and Learning Resource

Students are encouraged to contribute material and local reportage to sites using the Lingo system, including iEARN Europe (www.iearneurope.org), and to test their skills by offering translations to sentences and phrases drawn from real websites. Both assessment and tutorial modes of use will be available. "Machine Hints" for students being permissible by the teacher, or switched off for assessment purposes.

Future Directions

An ultimate long-term goal has to be real-time machine translation (MT) between any pair of supported languages, probably using a system reference language as an intermediate step. Statistical methods and algorithms which could accomplish this, given sufficiently powerful hardware, and an adequate body of already-translated material, are available on the web2.

The current phase of the project seeks to assess currently affordable hardware for purpose, and to augment currently available copyright-free translated material (usually more than 50 years old) with contributed material which reflects modern usage and idiom.

Supported Languages

In its earliest phase, Lingo was framed to support the languages of the countries of Europe. It soon became clear that an inclusive approach, encompassing expatriate communities living within Europe, would mean effectively supporting languages from all parts of the world. This prospect was however made much less daunting by the adoption across the computing industry in October 2006 of the 'Unicode' standard for encoding of all the writing systems for the world's languages3.

Unicode is supported by all modern web browsers, and the Lingo server software is Unicode-capable throughout by design. This takes care of intricacies like treatment of left-to-right and right-to-left directional renditions of content in different language versions of the same material. Text is stored in the character-order in which it is entered through a keyboard, and displayed text ordering is handled by the web browser's internal processing of the Unicode characters and their attributes.

The Ethnologue database (www.ethnologue.org) holds details of more than 6000 "living" languages across the world, and Lingo would be happy to support any of those which is taught or studied at any level in a school or university anywhere, either as a curriculum subject or as an instruction medium.

The key criterion in deciding which to support initially has been whether there exists some enthusiasm for the presence of a particular language on the Internet. Evidence for this has been drawn from WikiPedia (http://meta.wikimedia.org/wiki/List_of_Wikipedias) - broadly, if there is an active WikiPedia attracting entries in a particular language right now, then Lingo supports it. Minority languages having "official" status within a sovereign territory are also supported in general, although there may be exceptions.

Please have a look at the languages listed on the new iEARN Europe site: www.iearneurope.org . If your language isn't there, even after you click "List All", just ask and it will be added. Contribute, use and enjoy!

--

References:

1. Global Reach Internet Statistics:  www.glreach.com/globstats/refs.php3

2. Knight, K (1999) ‘A statistical MT Tutorial Workbook’

www.isi.edu/natural-language/mt/wkbk.rtf

3. The Unicode Consortium (2006) ‘The Unicode 5.0 Standard’ (ISBN 0-321-48091-0)

 
Email
Password
 
iEARN Europe Home
Language Support
Acknowledgements
FAQ
iEARN Europe (Old)
iEARN International
 
 
Site Launch Presentation Slides

Slide01
Slide 1

Slide03
Slide 3

Slide04
Slide 4

Slide05
Slide 5

Slide06
Slide 6

Slide07
Slide 7

Slide08
Slide 8

Slide09
Slide 9

Slide10
Slide 10

Slide11
Slide 11

Slide12
Slide 12

Slide13
Slide 13