An diofar eadar na mùthaidhean a rinneadh air "Projects"

O Goireasan Akerbeltz
Jump to navigation Jump to search
Loidhne 52: Loidhne 52:
  
 
===The corpus file===
 
===The corpus file===
 +
The corpus file contains at least one thing. At the very basic level, it needs to contain the same entries as your inclusion file but if you do that, remember that the prediction won't be very smart. To have smarter prediction, what you need a separate corpus (basically a load of text in your language). You analyse that for word frequencies. Let's say were' working on Gaelic and you used the following mini-corpus:
  
 +
<code>Tha cù aig mo mhàthair. Tha cù aig mo bhràthair cuideachd. Ach chan eil cù agamsa.</code>
 +
 +
If you analyse that for word frequencies, you get the following:
 +
 +
Ach 1<br />
 +
Tha 2<br />
 +
agamsa 1<br />
 +
aig 2<br />
 +
bhràthair 1<br />
 +
chan 1<br />
 +
cuideachd 1<br />
 +
cù 3<br />
 +
eil 1<br />
 +
mhàthair 1<br />
 +
mo 2<br />
  
 
{{l10n}}
 
{{l10n}}

Mùthadh on 19:55, 13 dhen Ghearran 2012

I'm writing these in reverse order in the sense that I've done a dozen or so projects now but I'm starting with the ones I'm currently working on first.

Adaptxt

A neat open-source project for predictive texting. Given how much people text, you should really consider this for you language if no such thing exists currently. The product site is here but the hard work happens here. You might as well bookmark both.

You'll most likely need someone who can do a little bit of Unix style code and probably a Linux system or at least an emulator.

Basics

You'll need to following ingredients:

  • an inclusion.txt file
  • a corpus.txt file
  • an xml file

The inclusion file is a long text file containing all the words in your language you want in the program. If a spellchecker exists for your language, that can be a good starting point.

The corpus file is used to give statistical weighting to each word. For example, thimble and thing both begin with thi. But thing is way more common than thimble and the corpus file contains the raw data which will add that info. Before you moan that your language doesn't have a corpus - Adaptxt will work reasonably well even if you can't provide such statistical data, not least of all because it learns as you use it. But it works better straight off if you do.

The xml file is fairly easy. Contains data about the codes for your language (for example, Irish Gaelic is gleie (gle for Irish, ie for Ireland) and so on.

The inclusion file

Checklist:

  • each word once, followed by a space and a comma
  • encoding must be Unicode (UCS 2 Little Endian to be precise)
  • number initial list of items
  • www initial list of items
  • each letter used in your language
  • any letter you may be using for elision once with and once without the eliding character (more on that in the Elision section)

The list doesn't have to be alphabetical. I find it helpful to have the numbers in one section, the words in the next, the www stuff and the letters of the alphabet and the elision characters at the end.

So mine starts with
0:00 ,
0:30 ,
and so on, then
AE ,
Abar ,
Aden ,
Adhamhnan ,
Afganastan ,
Afraga ,
Ailean ,
and so on, then
www ,
www.aa.com ,
www.adaptxt.com ,
www.adobe.com ,
and then finally
a ,
à ,
á ,
b ,

The corpus file

The corpus file contains at least one thing. At the very basic level, it needs to contain the same entries as your inclusion file but if you do that, remember that the prediction won't be very smart. To have smarter prediction, what you need a separate corpus (basically a load of text in your language). You analyse that for word frequencies. Let's say were' working on Gaelic and you used the following mini-corpus:

Tha cù aig mo mhàthair. Tha cù aig mo bhràthair cuideachd. Ach chan eil cù agamsa.

If you analyse that for word frequencies, you get the following:

Ach 1
Tha 2
agamsa 1
aig 2
bhràthair 1
chan 1
cuideachd 1
cù 3
eil 1
mhàthair 1
mo 2

l10n for Humans
Basics - Projects - Gear - Terminology - Other neat stuff