NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Scrollbar
iconsfalse
Panel
titleDocument Information

Author: Craig Stancl, Scott Bauer, Cory Endle
Email: Stancl.craig@mayo.edu, bauer.scott@mayo.edu, endle.cory@mayo.edu
Team: LexEVS
Contract: S13-500 MOD4
Client: NCI CBIIT
National Institutes of Heath
US Department of Health and Human Services

Panel
titleTable of Contents

Table of Contents

Overview

The extent of text match algorithms in LexEVS has grown quite a lot over the decade the application has been in existence.   Many matching algorithms overlap in their functionality and dependencies.  We've created a review of each of these algorithms with notes on their index dependencies and search focus with an eye towards simplifying and updating the search functionality.  NCI should review and decide if any of these can be removed or updated.

Current Text Matches

  • Lucene Query
  • phrase
  • contains
  • leading and trailing wild card
  • exact match
  • substring
  • spelling error tolerant substring match
  • stemmed lucene Lucene query
  • literal contains
  • starts with
  • non leading wild card literal substring
  • literal
  • Weighted double metaphone lucene Lucene query
  • literal substring
  • Double metaphone lucene Lucene query
  • Regular expression

Text Match Breakdown:

...

Searches for a Phrase in text using the regular lucene Lucene query parser.  The only addition is a set of escaped quotation marks at the beginning and end of the phrase.  It could be done in the regular Lucene Query by the user.  No special indexing.

...

Equivalent to ' term* ' - in other words - a trailing wildcard on a term (but no leading wild card) and the term can appear at any position.   Searches on Property Value property value only.

Leading and Trailing Wild Card

...

Exact match (case insensitive).  Requires it's own indexed value, a lower case, untokenized Property Valueproperty value.

SubString

Search based on a \"*some sub-string here*\". Functions much like the Java String.indexOf method.  This requires two indexed fields to manage this without significant overhead.  One field is the tokenized Property Value property value which causes no extra indexing, the other is reversed which requires an extra indexed field.

...

Works the same as contains but uses the literal Property Value property value enabling searches on special characters. 

...

Equivalent to 'term*' (case insensitive)  This runs against the same indexed Property Value property value as exactMatch so no extra indexing is needed. The query may require increased overhead however.

...

Search based on a \"*some sub-string here*\" Functions much like the Java String.indexOf method. Singe Single term searches will match '*term' and 'term*' but not '*term*'. This is because leading wildcards are very inefficient.  Special characters are included.  This seems to be very similar to the literal contains, but makes use of the reverse index.

...