How glossary matching works in Felix

Built-in glossary searching is one of the key features of Felix. In this post, I want to describe how the glossary searching algorithm works, and how results are displayed.

Finding Matches

Felix has two choices for glossary searches. If you choose a minimum score of below 100% (Tools >> Preferences >> Glossary >> “Minimum fuzzy score”), then Felix will do fuzzy matching based on the Levenshtein (edit) distance.

If you select a score of 100%, it will only count perfect matches.

To give an idea of what this means, consider this glossary entry:


Now, say you’re translating this sentence:

Put the aaCaa in the box.

If you’re not using fuzzy matching, then no match will be found for aaCaa. If you set the fuzzy threshold to around 80%, then this will be retrieved as a candidate.

You can also set whether to ignore case, wide/narrow character distinctions, and distinctions between Hiragana and Katakana.

Case: “aaa” is the same as “AAA”
Wide/narrow: “123” is the same as “123”
Hiragana/Katakana: “いろは” is the same as “イロハ”

Displaying Results

All the glossary matches for the current sentence are displayed in the glossary window. The matches are displayed by reference count, string length, and score. That is, the match with the highest reference count is shown first in the list of matches; if two matches have the same reference count, then the longer match goes first; and so on.

Reference count: The number of times the translation has been retrieved by the user
String length: How long the source word/phrase is
Score: If you use fuzzy glossary matching, how close the match is.

Room for Improvement

There are several ways in which the glossary matching algorithm could be improved. Felix user Steven Venti proposed a search algorithm that I would characterize as based on “closeness” or “stickiness,” and gave the program Jamming (Japanese) as an example of a program that does dictionary searches very well.

Another feature I’ve been thinking about for a while is the ability to create rule-based glossary entries, using wildcards or regular expressions. For example, you could do this to create translations for dates, or product names consisting of set patterns.

The way that matches are displayed can also be improved. I could make it possible for users to determine the sort criteria (what order matches are displayed in), both through preferences and dynamically. I’m also planning to make it possible to easily show and hide details about glossary matches — for example, click “details” to show all the information about the match, such as creator and date created, and “minimal” to show just the source and translation (thus allowing more matches to be shown at once).

In a way, being able to specify the order in which matches are displayed could make up for the “feast or famine” problem that Steven mentions: getting either too few or too many matches. If you set the match score low enough that you get lots of matches, but could arrange so that the matches you want are shown first, I think that would go a long way toward improving usability.

Felix resources

Felix glossaries compiled from Wiktionary

I’ve just added 1,388 new glossaries from 43 language pairs, compiled from the Wiktionary project.

Go to Felix Wiktionary glossaries page

Wiktionary is a community-contributed dictionary site that is a spin-off of Wikipedia. There are hundreds of langauges on Wiktionary, but I narrowed this down to 43 using this list of the 50 most widely spoken languages in the world.

The glossaries were compiled from a site snapshot taken on November 12, 2008. I scanned through the XML site download, created lists of all translation pairs, and then compiled Felix glossaries from them.

Wiktionary is licensed under the GNU Free Documentation License, and so are the Felix glossaries compiled from it.

Other tools

Using Microsoft Excel as a glossary-conversion tool

As translators, we get glossaries in all sorts of formats: XML, HTML, tab-delimited text, comma-separated value (CSV), …

A good example is the Microsoft terminology glossary: a monstrous CSV file of terminology used for localizing Microsoft user interafaces.

We often need to convert these glossaries into other formats, especially to get them into a terminology management program. Microsoft Excel is actually a great tool for doing this. It can open all the formats listed above, and more. Using Felix, you could then import the glossary directly, or if you’re using some other tool, you could save the glossary in many popular formats, such as tab-delimited text or csv; chances are your terminology manager will support one of them.

Another cool trick with Excel is loading glossaries from the Internet. When Excel is installed, the context menu in Internet Explorer gets an “Export to Microsoft Excel” command; so when you have a glossary in a table on a website, you can simply right click on it, export it to Excel, and from there put it into any of a number of formats.

Export to Microsoft Excel menu selection

Of course, there are limitations to using Excel as an intermediary for glossary conversion. The main one is when terminology managers use special formats, which Excel can’t interpret in a meaningful way. In this case, you can often get around it by using one of the generic “save as” file options of your terminology manager.


New Felix resource added: TM and glossary of legal terms (J-E)

I’ve converted the “Standard Bilingual Dictionary” into a Felix translation memory (TM) and glossary, and posted them to the Felix website:

Felix TM and glossary of Japanese-English legal terms

These should be of use to anyone who has to translate Japanese laws into English.

About the Standard Bilingual Dictionary

The Standard Bilingual Dictionary is a glossary of official translations of terms from Japanese law. It’s part of a major effort by the Japanese government to translate its laws into English.