NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

After the Lucene search is complete, the system stores only the Document id of documents that match the search criteria. Then, when information from the document is needed, it is retrieved from the document. This is helpful in iterator-type scenarios, where retrieval can be done one at a time.

...

Background - Lucene Documents

Lucene stores information in documents, and these documents have fields that are used to hold information. Each document has a unique id. For example, an index of people may be indexed in Lucene as:

<source>

Code Block

Document: id 1

...


First Name: John

...


Last Name: Doe

...


Sex: Male

...


Age: 45

...



Document: id 2

...


First Name: Jane

...


Last Name: Doe

...


Sex: Female

...


Age: 40

...



... etc.

</source>

LexEVS stores information about entities in this way. Property names and values, as well as qualifiers, language, and various other information about the entity are held in Lucene indexes.

...

Background - Querying Lucene

Lucene provides a query mechanism to search through the indexed documents. Given a search query, Lucene will provide the document id and the score of the match. (Lucene assigns every match a score, depending on the strength of the match given the query.)

...

Lazy retrieval can be leveraged to increase performance in LexEVS. Consider this simplified LexEVS entity index:

<source>

Code Block

Document: id 1

...


Code: C12345

...


Name: Heart

...



Document: id 2

...


Code: C67890

...


Name: Foot

...



Document: id 3

...


Code: C98765

...


Name: Heart Attack

</source>

If a user constructs a query (Name = Heart*), the query will return with the matching Document ids (1 and 2). Previously, LexEVS would immediately retrieve the Code and Name fields from the matches, and use them to construct the results that would be ultimately returned to the user. This does not scale well, especially for general queries in large ontologies. In a large ontology, a query of (Name = Heart*) may match tens of thousands of documents. Retrieving the information from all these documents is a significant performance concern.

Instead of retrieving the information up front, LexEVS will simply store the document id for later use. When this information is actually needed by the user (for example, the information needs to be displayed), it is retrieved on demand.

Searching

The org.LexGrid.LexBIG.Extensions.Extendable.Search

...

Interface

This interface enables the user to plug in custom search algorithms. Users can construct any type of query given search text. The query can include wildcards, it can group search terms, etc.

...

This algorithm does not automatically assume that the user has spelled the terms incorrectly. Searches are also based on the actual text that the user has input, along with the Metaphone value. Again, if the user input "Breast", the query will still match "Breast" and "Prostrate", but "Breast" will have a higher match score, because the actual user text is considered. This algorithm adds a greater precision to this fuzzy-type query.

Algorithm:
<source>

Code Block

get: user text input

...


2: total score = 0

...


3: metaphone score = 0

...


4: actual score = 0

...


5: metaphone value = lucene.computeMetaphoneValue(user text input)

...


6: metaphone score = lucene.scoreMetaphoneValue(metaphone value)

...


7: actual score = lucene.score(user text input)

...


8: total score = metaphone score + actual score

...


9: halt

</source>

Case-insensitive

...

Substring

The SubStringSearch algorithm is intended to find substrings within a large string. For example:
'with a heart attack'
Will ...will match:
{{ 'The patient with a heart attack was seen today.'}}

Also, a leading and trailing wildcard will be added, so
{{ 'th a heart atta'}}
Will
...will also match:
{{ 'The patient wi_th a heart atta_ck was seen today.'}}

Algorithm:
<source>

Code Block
get: user text input
2: user text input = '*' + user text input + '*'
3: score = lucene.score(user text input)
4: halt

</source>

Sorting

The org.LexGrid.LexBIG.Extensions.Extendable.Sort

...

Interface

This interface allows users to plug in customized Sort algorithms to sort query results:

Class:

org.LexGrid.LexBIG.Extensions.Extendable.Sort

Method:

public <T> Comparator<T> Comparator getComparatorForSearchClass(Class<T> Class searchClass) throws LBParameterException

Description:

Given a Class that this Sort is valid for, return the correct Comparator to compare the results and sort.

Method:

public boolean isSortValidForClass(Class<?> clazz)

Description:

Return whether or not this Sort is valid for Sorting on a given Class.

...

Given two database tables, retrieve the Code, Name, and Qualifier for each Code

Table Codes

Code

Name

C01234

Heart

C98765

Heart Attack

Table Qualifiers

Code

Qualifier

C01234

isAnOrgan

C98765

isADisease

...

Code Block
SELECT * FROM Codes

</source>

Results in:

Code

Name

C01234

Heart

C98765

Heart Attack

...

Given two database tables, retrieve the Code, Name, and Qualifier for each Code.

Table Codes

Code

Name

C01234

Heart

C98765

Heart Attack

Table Qualifiers

Code

Qualifier

C01234

isAnOrgan

C98765

isADisease

...

Code Block
SELECT * FROM Codes JOIN Qualifiers ON Code

</source>

Results in:

*

Code

Name

Qualifier

C01234

Heart

isAnOrgan

C98765

Heart Attack

isADisease

...