Currently I work on a project, where we use Lucene for full text indexing. The project is a web portal for historical data and texts. All searches are based on Lucene using the German analyzer.
During testing the users complained that wrong results are found. For example they complained that a search for “Hans” (German for John) found also documents that highlighted “hant” and “han” (old German verb variations for have). And when searching for “und” nothing is found, but when searching for “nit” (old German word for not) they get results.
One side of the problem is the user expectation. Because of that this paper explains the difference between searches against a relational database and a full text search. For the full text search it is also describing how Lucene’s full text engine is handling text when using the German analyzer.
Differences between classical database search and full text search
The users that were testing the system use a software product (for many years) that uses relational database search. When doing a search the users can select if they want to make an exact search, or if they want to search for words that begin (or end) with the search term they enter. The search that is executed against the database looks something like
Select * from tbl_sample where text like '<searchTerm>'
for exact matching or
Select * from tbl_sample where text like '<searchTerm>%'
for searches for words that start with the search term.
They are used to the behavior that they exactly find what they have entered. If one character is different in the database from what is entered, nothing will be found. So if they search for “Haus” they won’t find “Häuser”, or if they search for “Zürich” they won’t find “Zuerich”.
Sometimes this is a problem and so they start changing the search term using wildcards. For example they search for “H?us*” which will find “Haus” and “Häuser” and they understand why the search is returning “Hausmann”, “Hausmannskost” etc. as well.
One of the big problems with this classical database search is, that it may be (very) slow. Searching for a name in a field called “lastname” will (always) be very fast, given the field is properly indexed. However if the user wants to search for words that may occur within a larger text, indexes can’t be used. Having large databases this results in searches that can take minutes or even hours.
Full text search engines on the other hand work completely differently, allowing for fast searches for words that are buried in large texts within large data amounts.
Full text engines can index large amounts of data and allows the user to search for words or phrases within that index. Before a user can start a search, the full text engine must analyze the documents and build up the search index.
No matter what kind of full text index engine you take, the basic steps are all the same. Take the text and tokenize it. In the simplest form you can say that the tokens equal the individual words. Then the process may perform some additional operation on the tokens. For example a lowercase filter can be applied so the resulting searches are not case sensitive. Typically the engine will also remove frequent but meaningless tokens from the input (the so called stop words like a, an, the, in and so on in English text.)
Similarly, it’s common to process input tokens to reduce them to their roots, by using a StemFilter. For example the GermanStemFilter will reduce both “Haus” and “Häuser” to the root “Haus”. (But “Hausmann” will remain “Hausmann”).
After the text is analyzed and the text is searched, the search input is also analyzed and the same transformations are applied. With the resulting token(s) the search engine looks up in which documents (records) this word occurs and returns the matches. Because it only searches tokens (words) this search is very fast.
The full text search has another advantage. The search engine has some statistics for the search index. It knows how many documents are index, how many different words the index contains, how many words each individual document has and so on.
This information allows to create a ranking of the search results. That is why full text search engines can present the most meaningful result at the top. Something a relational database search never can accomplish.
How the rankings are calculated is influenced by many factors. Most search engines allow for some customizations. A typical rule could be that words that occur in the title field of a document are weighed more than the same word in the body text.
But to make things easy one can say that smaller documents that contain the searched term a few times are ranked higher than larger documents that contain the term only once.
Lucence German Analyzer
At the beginning of the article I wrote that the users complained about false results, because a search for “Hans” found matches like “han” or “hant” in some documents. To be precise these matches were found in the lowest ranking documents, whereas the highest ranking documents really contained the word “Hans”.
But anyway, why would Lucene highlight “han” or “hant” when we search for “Hans”? The answer for this question lies in the inner workings of the German analyzer that is used in this project.
How the German Analyzer works
If you create a default German Analyzer with the “minimal” constructor like this
you will get a predefined pipeline of Tokenizers and Filters. Each text that gets indexed will be run through this pipeline. The pipeline consist of:
At the end of the pipeline we do have a TokenStream (a list of words) that is stored in the Lucene index. But because the tokens have been filtered, the tokens that are stored in the index may look quite different from their occurrence in the text.
The picture shows how the tokens that occur in the title field are stored in the Lucene index. (Screenshot is taken from Luke, a tool for looking at the Lucence index.)
As you can see Gemeinde as been reduced to gemei and Corpus is reduced to corpu.
When we do a search, our search text is sent through the same pipeline again and our search words will be transformed as well. That is why we find things in the index again, and that is why we have to use the same Analyzer for searching as we used when we indexed the documents.
So the questions is, what each tokenizer and filter is doing to our text.
The standard tokenizer should be good enough for most European-language documents. It’s job is to spliz a text into individual tokens (words).
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
If an applications has specific tokenizer needs, you can create your own tokenizer.
The standard filter normalizes the result of the standard tokenizer.
- Removes ’s from the end of the words
- Removes dots from acronyms.
That is an easy one. Simply normalizes each token to lower case.
Removes all tokens that have been defined as stop words. For the default German Analyzer the stop words are:
"einer", "eine", "eines", "einem", "einen", "der", "die", "das", "dass", "daß", "du", "er", "sie", "es", "was", "wer", "wie", "wir", "und", "oder", "ohne", "mit", "am", "im", "in", "aus", "auf", "ist", "sein", "war", "wird", "ihr", "ihre", "ihres", "als", "für", "von", "dich", "dir", "mich", "mir", "mein", "kein", "durch", "wegen"
The German stem filter is basically the heart and soul of the German analyzer. Here specialties of the German language gets filtered out, so the users search is as meaningful as possible. The algorithm is based on the report “"A Fast and Simple Stemming Algorithm for German Words" by Jörg Caumanns (email@example.com).
In this report the author points out: If one knows that stemming is never perfect, why should one try? Form most applications it doesn’t matter if the error rate caused by wrong stemming is 2.1% or 3.1%. In either case the application has to deal with some amount of uncertainty.
This statement is important, because it clearly says that users cannot expect 100% perfect matches. But what happens in the GermanStemFilter anyway?
- First every term is checked if it can be stemmed. If a term does contain a character that is not a letter, it is not stemmable. Thus the next token is processed.
- If a token is stemmable, do some substitutions for the term to reduce overstemming:
- Substitute Umlauts with their corresponding vowel: äöü ==> aou, "ß" is substituted by "ss"
- Substitute a second char of a pair of equal characters with an asterisk: ?? ==> ?* for example mm ==> m*
- Substitute some common character combinations with a token:
sch/ch/ei/ie/ig/st ==>$ / § / % / & / # / !
- Suffix stripping (stemming) on the current term. The stripping is removing these "base" suffixes "e", "s", "n", "t", "em", "er" and "nd" recursively. The simplification causes some overstemming, and way more irregular stems, but still provides unique discriminators in the most of those cases.
The algorithm is context free, except of the length restrictions.
Here it is where “hans” (we already have lower case) is reduced to “han”.
- A small optimization step where female plurals are cut off. If a token ends with “erin*” the last character is removed from the term and again suffix stripping is performed.
The other optimization is for special cases of plurals. If a term ends with a “z” the “z” is converted into an “x”.
You may ask what the hell, but with suffix tripping done before it will turn a word like “Matrizen” into “Matrix”.
- Resubstitute: Undoes the changes made by Substitute. That is character pairs and character combinations that were substituted are replaced with their original values. But there are exceptions: Umlauts will remain as their corresponding vowel, as "ß" remains as "ss".
- Removes a particle denotion ("ge") from a term. Meaning words that start with “gege” like gegeben will be stripped from the starting ge resulting in geben.
When creating the GermanAnalyzer you can optionally set a flag that indicates to normalize according to DIN2.
public GermanAnalyzer(Version matchVersion, bool normalizeDin2)
The documentation says: If the flag set to true, the DIN-2007-2 style stemmer should be used in addition to DIN1. This will cause words with 'ae', 'ue', or 'oe' in them (expanded umlauts) to be first converted to 'a', 'u', and 'o' respectively, before the DIN1 stemmer is invoked.
Actually it’s invoking the substitution of the character pairs as described above, before the “default” substitution step is invoked.
Currently this GermanDin2Stemmer does have a bug and is not working. (Lucene.NET 3.0). It might have been fixed in the meantime.
We have explained the differences between classic database search and full text search. We have seen how a full text index is trying to reduce the words to a common root during its analyze process. This “reduction” is called stemming and a different stemming algorithm must be used for each language.
The algorithm for stemming the words is never 100% accurate and that is why the same roots may be produced for seemingly unrelated words.
The main advantage of a full text search is (of course) its speed. But then a full text search finds all records with the same linguistic roots of the search term. But as explained this advantage comes with a cost attached to it. Depending on the structure of the indexed text an error rate between 1%-3% percent is normal.
In these cases the users should be mature enough to see quickly and easily if that hit is relevant for their context or not.