Archive for June 16th, 2009

How glossary matching works in Felix

Jun. 16th 2009

Built-in glossary searching is one of the key features of Felix. In this post, I want to describe how the glossary searching algorithm works, and how results are displayed.

Finding Matches

Felix has two choices for glossary searches. If you choose a minimum score of below 100% (Tools >> Preferences >> Glossary >> “Minimum fuzzy score”), then Felix will do fuzzy matching based on the Levenshtein (edit) distance.

If you select a score of 100%, it will only count perfect matches.

To give an idea of what this means, consider this glossary entry:

aaBaa

Now, say you’re translating this sentence:

Put the aaCaa in the box.

If you’re not using fuzzy matching, then no match will be found for aaCaa. If you set the fuzzy threshold to around 80%, then this will be retrieved as a candidate.

You can also set whether to ignore case, wide/narrow character distinctions, and distinctions between Hiragana and Katakana.

Ignore…
Case: “aaa” is the same as “AAA”
Wide/narrow: “123” is the same as “123”
Hiragana/Katakana: “いろは” is the same as “イロハ”

Displaying Results

All the glossary matches for the current sentence are displayed in the glossary window. The matches are displayed by reference count, string length, and score. That is, the match with the highest reference count is shown first in the list of matches; if two matches have the same reference count, then the longer match goes first; and so on.

Reference count: The number of times the translation has been retrieved by the user
String length: How long the source word/phrase is
Score: If you use fuzzy glossary matching, how close the match is.

Room for Improvement

There are several ways in which the glossary matching algorithm could be improved. Felix user Steven Venti proposed a search algorithm that I would characterize as based on “closeness” or “stickiness,” and gave the program Jamming (Japanese) as an example of a program that does dictionary searches very well.

Another feature I’ve been thinking about for a while is the ability to create rule-based glossary entries, using wildcards or regular expressions. For example, you could do this to create translations for dates, or product names consisting of set patterns.

The way that matches are displayed can also be improved. I could make it possible for users to determine the sort criteria (what order matches are displayed in), both through preferences and dynamically. I’m also planning to make it possible to easily show and hide details about glossary matches — for example, click “details” to show all the information about the match, such as creator and date created, and “minimal” to show just the source and translation (thus allowing more matches to be shown at once).

In a way, being able to specify the order in which matches are displayed could make up for the “feast or famine” problem that Steven mentions: getting either too few or too many matches. If you set the match score low enough that you get lots of matches, but could arrange so that the matches you want are shown first, I think that would go a long way toward improving usability.

Posted by Ryan Ginstrom | in Felix | 3 Comments »
  • Search

  • Categories

  • Calendar

    June 2009
    M T W T F S S
    « May   Jul »
    1234567
    891011121314
    15161718192021
    22232425262728
    2930  
  • Pages

  • Meta