Projects

O Goireasan Akerbeltz
Share/Save/Bookmark
Gearr leum gu: seòladh, lorg

I'm writing these in reverse order in the sense that I've done a dozen or so projects now but I'm starting with the ones I'm currently working on first.

Susbaint

Adaptxt

A neat open-source project for predictive texting. Given how much people text, you should really consider this for you language if no such thing exists currently. The product site is here but the hard work happens here. You might as well bookmark both.

You'll most likely need someone who can do a little bit of Unix style code and probably a Linux system or at least an emulator.

Basics

You'll need to following ingredients:

  • an inclusion.txt file
  • a corpus.txt file
  • an xml file

The inclusion file is a long text file containing all the words in your language you want in the program. If a spellchecker exists for your language, that can be a good starting point.

The corpus file is used to give statistical weighting to each word. For example, thimble and thing both begin with thi. But thing is way more common than thimble and the corpus file contains the raw data which will add that info. Before you moan that your language doesn't have a corpus - Adaptxt will work reasonably well even if you can't provide such statistical data, not least of all because it learns as you use it. But it works better straight off if you do.

The xml file is fairly easy. Contains data about the codes for your language (for example, Irish Gaelic is gleie (gle for Irish, ie for Ireland) and so on.

The inclusion file

Checklist:

  • each word once, followed by a space and a comma
  • encoding must be Unicode (UCS 2 Little Endian to be precise)
  • number initial list of items
  • www initial list of items
  • each letter used in your language
  • any letter you may be using for elision once with and once without the eliding character (more on that in the Elision section)

The list doesn't have to be alphabetical. I find it helpful to have the numbers in one section, the words in the next, the www stuff and the letters of the alphabet and the elision characters at the end.

So mine starts with
0:00 ,
0:30 ,
and so on, then
AE ,
Abar ,
Aden ,
Adhamhnan ,
Afganastan ,
Afraga ,
Ailean ,
and so on, then
www ,
www.aa.com ,
www.adaptxt.com ,
www.adobe.com ,
and then finally
a ,
à ,
á ,
b ,

The corpus file

The corpus file contains at least one thing. At the very basic level, it needs to contain the same entries as your inclusion file but if you do that, remember that the prediction won't be very smart. To have smarter prediction, what you need a separate corpus (basically a load of text in your language). You analyse that for word frequencies. Let's say were' working on Gaelic and you used the following mini-corpus:

Tha cù aig mo mhàthair. Tha cù aig mo bhràthair cuideachd. Ach chan eil cù agamsa.

If you analyse that for word frequencies, you get the following:

Ach 1
Tha 2
agamsa 1
aig 2
bhràthair 1
chan 1
cuideachd 1
cù 3
eil 1
mhàthair 1
mo 2

Firefox

If you're a good translator but not great shakes at code, your best bet is the Locamotion project. It's essentially a Pootle server (a server which handles those po files) where you grab the po files and translate them, either online or using a translation memory, whatever - and then admins deal with most of the black magic to put them where they need to be on Mozilla. Gaelic and Welsh are on there, as are a lot of African languages. [More to come...]

Aurora, Beta and Central?

When you look at the signoff table for a locale (for example the Gaelic one), you will see 3 of each product, Aurora, Beta and Central. The short version is, they're different ... buckets with more or less the same stuff in them. Central is the place where developers play, changes a lot. Not really interesting for translators. From a translation point of view, you really only want to maintain Aurora.

Careful though, occasionally you get a project which takes its translations from Beta rather than Aurora. Lightning, the calendar application for example.

Jitsi

Jitsi has a funny thing about using ' in strings which have placeholders. Very annoying if your language uses a lot of them. The simplest solution is to grab your po file and run the following script in the terminal of a Linux machine. It turns all single ' into '' which solves the problem:

$ cat gd.po | msgcat --no-wrap - | sed "/^msgstr/s/''/'/g" | sed "/^msgstr/s/'/''/g" | msgcat - > gd2.po

If you've done some manually, don't worry, you won't end up with ''''.

reddit

On the whole, a fairly straight forward project but some niggly annoyances. Here's a roadmap of what to do:

  1. Sign up on the reddit Transifex page AND post to the reddit l18n site that you want to translate into your language. Start translating.
  2. Make sure translations are reviewed, otherwise they won't go live. If you're from a bigger language, best to find a collaborator and to get them to request reviewer rights on /r/l18n. If you're a small locale, the (single) translator can also be granted reviewer rights.
  3. When you get to about 25%, post on /r/l18n and request someone enables the new locale
  4. Continue translating and reviewing (this isn't stressed enough in the documentation)
  5. Updating the strings from Transifex is a manual job for the admins, it happens about once every 2-4 weeks.

Ubuntu

The translation of Ubuntu takes places on a platform called Launchpad. Beware: there are thousands of localization projects on Launchpad but there's several catches which are not immediately obvious. The two most fundamental ones are:

  1. What's in Launchpad, as a rule of thumb, only feeds into Ubuntu, not any of the other Linux versions or even the upstream projects. For example, if you translate Gimp in Launchpad, it will go into Ubuntu. But not the main GIMP project, so you won't get a build (version) for Windows or Mac.
  2. There are many dead projects which should not be on Launchpad at all, like Firefox. Only about 5% are actually specifically for Ubuntu.

Soooo... it's save to translate the main Ubuntu parts on Launchpad (that's the official localization site for Ubuntu). That's basically anything on the first page of the modules. Anything else, best ask the Ubuntu mailing list. I know it sucks. If you do translate stuff like GIMP on Launchpad, download the file and make sure it goes to the main GIMP projects. When in doubt, ask the list. I think Launchpad had really big plans which didn't work out. I'm working on a detailed list here.

Locale stuff

The locale data for Ubuntu is held here on 2xlibre. Steps for correcting locale data

  1. register and log in
  2. edit with the online editor
  3. export in glibc format (text) with the link at bottom
  4. review the differences with the original file and keep only those that make sense
  5. make a diff and attach it to a new bug report in bugzilla for glibc. The be patient and wait for a glibc maintainer

The yes/no stuff just looks confusing. The English is ^[yY].* which means that if someone types y or Y on their keyboard, it will be interpreted as a Yes response. You can add to that, so for Gaelic we changed this to ^[tTyY].* (for Tha).

Keyboard

The default keyboards are defined in the xkeyboard-config file. If there is no localized layout file, the defaults in keyboard-configuration will be used. Current layouts are here; click on plain next to your locale to see the keyboards for it. Basically, the one at the top is the default settings, the others are offered as alternatives.

Common translation issues

  • Plural strings: some plural strings that have placeholders that use %'d as a placeholder throw an error; use %'d instead, that should solve the problem.
  • Placeholder order: if you get two %s or %d placeholders in a string and you have to change the order because of the syntax of your language, change %s %s to %2s %1s (don't use %$2s %$1s as in some other localiation projects, or Launchpad will throw a stupid msgstr is not a valid Python format string, unlike msgid. Reason: In the directive number 1, the character $ is not a valid conversion specifier. error.

Working offline

You can get the po files and translate offline, the put them online again. Common issues:

  • An uploaded po does not go live immediately, even if you have rights as a locale admin. Sucks, I know. All in all, allow 24-48 hours before it will show.

WordPress

Nice project. DUH setup. If you're thinking of localizing WordPress, the first question you actually have to ask is - which one? There are two. There's WordPress.org, which is the one you install yourself on your own hosting package. Has the advantage of being add-free. Then there's WordPress.com where you just sign up and WordPress hosts your blog. Easier to set up but unless you pay, you get a small amount of advertising. Yes, I know, really clear naming convention. Anyway. They have also, in their inscrutable wisdom, split the translation projects into two (for .org and .com), so your entry points are different. You should really use a translation memory if you're wanting to do both because there are hundreds, if not thousands, of strings which are the same.

WordPress.org

In short (more to follow) this is the overview page which lists all the versions which are (being) localized and if you drill down (i.e. click around) you will get the full list of what you have to translate. To get permissions (if you're a new locale) your entry point is WP Polyglotts, which confusingly is hosted on WordPress.com. See what I mean? Anyway, get an account and post that you need a new locale set up. Take a note of the tagging policy cause otherwise you might wait a long time before anyone answers.

WordPress.com

All the current languages are listed here. Yup, it's a large file but there's two bits of good news. Once you reach about 30%, they will activate your language, so you can go live or at least proofread before you've done them all. You update the po files yourself and about twice a week someone does some black magic at the WordPress end to make them go live. It's not instantaneous.

The support site for WordPress.com is - yes, you guessed it - different from .org. <sigh> They call it GlotPress Support.

One word of advice - if you're a tiny team or pressed for time otherwise, I'd get an account on .com and work off the most important strings first.

VLC - VideoLAN

Neat little multimedia program, plays just about anything under the sun on just about any platform. Translating it into a new language is surprisingly easy, for once a project I'd give an A rating. So what's the best way?

Translating

  1. Unless you're working with a big team, it's a big file. There's over 30,000 words to do, plus about 400 for the main page of the website, if you want that in your language. But here's a trick: sign up to Locamotion, a Pootle server and grab the Minimal VLC file. It's a little out of date but not much. It's only 3,400 words or so but covers 95% of the interface that most people will ever see (I've been using it for years and that would certainly cover most of what I might see of the interface). Translate that.
  2. Sign up to the VLC mailing list and ask someone to merge your Minimal VLC file with the main po file. Transifex seems to be the platform of choice for VLC at the moment for languages which want a low-tech approach, so they'll probably put it up there and getting an account will make sense.
  3. Once you've checked them in live (hang on, coming up below), tell them you want to be included in the next release, whenever that is.
  4. At least for some time you'll be wanting to update your translations regularly. Once you've done a chunk, upload them to Transifex and post to the list to let them know you've update there (or that you've uploaded the latest .po file somewhere elese).

Once you have a release, the updates to the translations are tied to the release cycle. Meaning that because the translations are bundled with the download users grab, you can only update the translation for everybody else when there's a new version, for example moving from x.0.2 to x.0.3.

Testing

Now, testing - there's a neat trick that's not obvious in the help stuff (probably the only shortfall of the VLC translation project, it's a bit thin in general). It's a little convoluted but allows you to test your translation in the live program immediately - no need for nightly builds or stuff like that.

  1. Take your po file and even if you don't normally use PoEdit to translate, open it in PoEdit and save it. No need to edit anything. This creates a .mo file in the same folder that your .po file is in. Rename that vlc.mo (if it's not already called that).
  2. Copy that file
  3. Find the installation folder of VLC. On my machine, that happens to be Programmes (86) » VideoLAN.
  4. Go to VLC » Locale and pick a language you don't use. Say, Spanish. Go down into the LC_MESSAGES folder and paste your vlc.mo over the .mo file already in there.
  5. Go to VLC, open Tools » Options and select Spanish as your interface language. Then restart VLC and hey presto.
  6. You can keep replacing the same .mo file, just remember to restart and you'll see your translations pop up.

The really neat thing, in passing, is the fluid approach they've taken to width - not messing about with setting menu widths or having to truncate or shorten strings. Withing reason, the menu adapts to the width of the longest string. Such a refreshing difference I can tell you...

Release checklist

Now, getting it to release... make sure that in the runup to a release you keep reminding them of commit the latest version of your po. Check this page and if the percentage looks wrong, ask them.

There's a cutoff for being included in the dropdown menu under Tools » Options, guidance isn't clear on this but anything above 20% should be good enough. Make sure that you pester them often enough about putting it on the list or you'll end up with a release that you can't get to (meaning that the language file will be distributed with the install file but that unless you're good with code, you can't get at it)... yes, speaking from experience.

l10n for Humans
Basics - Projects - Gear - Terminology


Innealan pearsanta
Namespaces

Tionndaidhean
Gnìomhan
Seòladh
Bogsa-innealan
Ceanglaichean sòisealta