Author: Craig Stancl, Scott Bauer, Cory Endle
Email: Stancl.craig@mayo.edu, bauer.scott@mayo.edu, endle.cory@mayo.edu
Team: LexEVS
Contract: S13-500 MOD4
Client: NCI CBIIT
National Institutes of Heath
US Department of Health and Human Services
Overview
The extent of text match algorithms in LexEVS has grown quite a lot over the decade the application has been in existence. Many matching algorithms overlap in their functionality and dependencies. We've created a review of each of these algorithms with notes on their index dependencies and search focus with an eye towards simplifying and updating the search functionality. NCI should review and decide if any of these can be removed or updated.
Current Text Matches
- Lucene Query
- phrase
- contains
- leading and trailing wild card
- exact match
- substring
- spelling error tolerant substring match
- stemmed Lucene query
- literal contains
- starts with
- non leading wild card literal substring
- literal
- Weighted double metaphone Lucene query
- literal substring
- Double metaphone Lucene query
- Regular expression
Text Match Breakdown:
Lucene Query
Search with the Lucene query syntax. http://lucene.apache.org/core/5_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
Phrase
Searches for a Phrase in text using the regular Lucene query parser. The only addition is a set of escaped quotation marks at the beginning and end of the phrase. It could be done in the regular Lucene Query by the user. No special indexing.
Contains
Equivalent to ' term* ' - in other words - a trailing wildcard on a term (but no leading wild card) and the term can appear at any position. Searches on property value only.
Leading and Trailing Wild Card
Equivalent to '*term*' This should be a very poor performing search and is not recommended especially when entering a phrase.
Exact Match
Exact match (case insensitive). Requires it's own indexed value, a lower case, untokenized property value.
SubString
Search based on a \"*some sub-string here*\". Functions much like the Java String.indexOf method. This requires two indexed fields to manage this without significant overhead. One field is the tokenized property value which causes no extra indexing, the other is reversed which requires an extra indexed field.
Spelling Error Tolerant Substring Match
Adds Spelling-error tolerance to 'subString' search. This makes use of the double metaphone indexed value as well as literal property values. Since it shares these with other algorithms it probably doesn't add much to the index. However this doesn't look like a high performing search.
Stemmed Lucene Query
Search with the Lucene query syntax, using stemmed terms. A search for 'trees' will get a hit on 'tree' This requires an extra indexed field when it is enabled in the load.
Literal Contains
Works the same as contains but uses the literal property value enabling searches on special characters.
Starts With
Equivalent to 'term*' (case insensitive) This runs against the same indexed property value as exactMatch so no extra indexing is needed. The query may require increased overhead however.
Non Leading Wild Card Literal Substring
Search based on a \"*some sub-string here*\" Functions much like the Java String.indexOf method. Single term searches will match '*term' and 'term*' but not '*term*'. This is because leading wildcards are very inefficient. Special characters are included. This seems to be very similar to the literal contains, but makes use of the reverse index.
Literal
All special characters are taken literally. Since we usually normalize we can search on a string with colons, parentheses and other special characters using this search.
Weighted Double Metaphone Lucene Query
Search with the Lucene query syntax, using a 'sounds like' algorithm. A search for 'atack' will get a hit on 'attack' Also, the exact user-entered text is taken into account -- so correct spelling will override the 'sounds like' algorithm. Searches on the same indexed property value as the other double metaphone search. Does not add anything more to the index, but does add more overhead to the search.
Literal Substring
The same as the substring search but with special characters enabled. However this doesn't seem to make use of the optimized reverse string. It may be slower as a result.
Double Metaphone Lucene Query
Search with the Lucene query syntax, using a 'sounds like' algorithm. A search for 'atack' will get a hit on 'attack' Searches on a property value that has been double metaphone enabled.
Regular Expression
A Regular Expression query. Searches against the lowercased text, so a regular expression that specifies an uppercase character will never return a match. Additionally, this searches against the entire string as a single token, rather than the tokenized string - so write your regular expression accordingly. This is the apache implementation of Regular Expression so follow their documentation as needed. This is from the old Jakarta project. We may want to update to the latest Lucene supported version of this. This runs against the same indexed value as exact match so it doesn't cost any more in terms of indexed data.