NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Lucene stores information in documents, and these documents have fields that are used to hold information. Each document has a unique id. For example, an index of people may be indexed in Lucene as:

<source>

Code Block
Document: id 1
First Name: John
Last Name: Doe
Sex: Male
Age: 45

Document: id 2
First Name: Jane
Last Name: Doe
Sex: Female
Age: 40

... etc.

...

LexEVS stores information about entities in this way. Property names and values, as well as qualifiers, language, and various other information about the entity are held in Lucene indexes.

...

Lazy retrieval can be leveraged to increase performance in LexEVS. Consider this simplified LexEVS entity index:<source>

Code Block
Document: id 1
Code: C12345
Name: Heart

Document: id 2
Code: C67890
Name: Foot

Document: id 3
Code: C98765
Name: Heart Attack

</source>

If a user constructs a query (Name = Heart*), the query will return with the matching Document ids (1 and 2). Previously, LexEVS would immediately retrieve the Code and Name fields from the matches, and use them to construct the results that would be ultimately returned to the user. This does not scale well, especially for general queries in large ontologies. In a large ontology, a query of (Name = Heart*) may match tens of thousands of documents. Retrieving the information from all these documents is a significant performance concern.

...

This algorithm does not automatically assume that the user has spelled the terms incorrectly. Searches are also based on the actual text that the user has input, along with the Metaphone value. Again, if the user input "Breast", the query will still match "Breast" and "Prostrate", but "Breast" will have a higher match score, because the actual user text is considered. This algorithm adds a greater precision to this fuzzy-type query.

Algorithm: <source>

Code Block
get: user text input
2: total score = 0
3: metaphone score = 0
4: actual score = 0
5: metaphone value = lucene.computeMetaphoneValue(user text input)
6: metaphone score = lucene.scoreMetaphoneValue(metaphone value)
7: actual score = lucene.score(user text input)
8: total score = metaphone score + actual score
9: halt

...

Case-insensitive Substring

...

Code Block
'The patient wi_th a heart atta_ck was seen today.' 

Algorithm:
<source>

Code Block
get: user text input
2: user text input = '*' + user text input + '*'
3: score = lucene.score(user text input)
4: halt

</source>

Sorting - The org.LexGrid.LexBIG.Extensions.Extendable.Sort Interface

...

  • Sorting on Different Class types A single Sort may be applicable for a variety of Class types. For instance, both an 'Association' and an 'Entity' may be sorted by 'Code', but the actual implementation of retrieving the Code and comparing it may be different between the two. It is the job of the Sort to implement a Comparator for each potential Class that it is eligible to sort.
  • Default Sorting All result sets are sorted by default by Lucene Score, meaning that the best match according to Lucene will always be returned first by default. Note that if two or more result sets are being Unioned, Intersected, or Differenced, the user must explicitly call a 'matchToQuery' sort on the result set as a whole to order all of the results.
  • Sort Contexts Sorts may be applicable in one or more 'Contexts.' (see: org.LexGrid.LexBIG.DataModel.InterfaceElements.types.SortContext). This means that a Sort may apply only to a CodedNodeSet, or only to a CodedNodeGraph, or some combination. Sorts will only be employed by the API if they match the Context in which the results are being sorted.
  • Performance Issues Sorting is generally computationally expensive, because in order to correctly sort, the field to be sorted has to be fully retrieved for the entire result set. For very specific or refined queries, this may not be a problem, but for large ontologies or very general queries, performance may be a concern. To alleviate this, 'Post sort' has been introduced.
  • Post Sorting In order to minimize the performance impact of sorting, users are encouraged to use a 'Post sort' where possible. A Post sort is done after the result set has been restricted, thus limiting the amount of information that must be retrieved in order to perform the sort. For instance, a query may match a set of Entities:

...

Code Block
\{"Heart", "Heart Failure", "Heart Attack", "Arm", "Finger", ...\}

...

As described earlier, all results are by default sorted by Lucene score, so if we limit the result set to the top 3, the result is:
<source>

Code Block
\{"Heart", "Heart Failure", "Heart Attack"\}

...

The restricted set can then be 'Post' sorted; and because the result set has been limited to a reasonable number of matches, sorting and retrieval time can be minimized.

Algorithm:
<source>

Code Block
1: get: Sort requested by user
2: get: Context sort is being applied to
3: if: sort is not valid for Context
halt
4: else:
5: get: Class to be sorted on
6: if: sort is not valid for Class
halt
7: get: Comparator for Sort - given (Class to be sorted on)
8: sort results using Comparator for Sort
9: halt

</source>

SQL Optimizations

The n+1 SELECTS Problem

...

Code

Qualifier

C01234

isAnOrgan

C98765

isADisease

<source>

Code Block
SELECT * FROM Codes

...

Results in:

Code

Name

C01234

Heart

C98765

Heart Attack

To get the Qualifiers, separate SELECTs must be used for each.
<source>

Code Block
SELECT * FROM Qualifiers where Code = C01234
And
SELECT * FROM Qualifiers where Code = C98765

</source>

This sequence results in 1 Query to retrieve the data from the Codes table, and then n Queries from the Qualifiers table. This results in n+1 total Queries.

...

Code

Qualifier

C01234

isAnOrgan

C98765

isADisease

...

Code Block
SELECT * FROM Codes JOIN Qualifiers ON Code

...

Results in:

Code

Name

Qualifier

C01234

Heart

isAnOrgan

C98765

Heart Attack

isADisease

...