Page History

Versions Compared

Old Version 6

changes.mady.by.user Jackel, Laura (NIH/NCI) [C]

Saved on Dec 20, 2011

compared with

New Version Current

changes.mady.by.user Unknown User (wileyal)

Saved on Jan 20, 2014

Key

This line was added.
This line was removed.
Formatting was changed.

Comment: Migration of unmigrated content due to installation of a new plugin

Scrollbar

icons	false

Page info
title
title

Section

Column

Panel

title	Contents of this Page

Table of Contents

minLevel	2

Column

Align

align	right

Include Page

	Menu LexEVS 6.x Programmers to Include
	Menu LexEVS 6.x Programmers to Include

Introduction

This document is a section of the LexEVS 6.x Programmer's Guide.

LexEVS 6.0 implements the following performance and behavior enhancements in the Query Services Extension:

Lucene lazy loading for improved query retrieval performance
Search interface for plugging in custom search algorithms
Enhanced and new search algorithms for improved accuracy and performance
Sort interface for plugging in custom sort algorithms
SQL optimization for improved performance in large scale query retrievals

Lucene Lazy Loading

After the Lucene search is complete, the system stores only the Document id of documents that match the search criteria. Then, when information from the document is needed, it is retrieved from the document. This is helpful in iterator-type scenarios, where retrieval can be done one at a time.

Background - Lucene Documents

Lucene stores information in documents, and these documents have fields that are used to hold information. Each document has a unique id. For example, an index of people may be indexed in Lucene as:

Code Block
{scrollbar:icons=false} {highlight:red}This document refers to 5.1. I left it that way for review, but changed it to the 6.0 Programmer's Guide link.{highlight} h1. {page-info:title} {panel:title=Contents of this Page} {toc:minLevel=2} {panel} h2. Introduction This document is a section of the [WIP LJJ LexEVS 6.0 Programmer's Guide]. LexEVS v5.1 implements the following performance and behavior enhancements in the Query Services Extension: * Lucene lazy loading for improved query retrieval performance * Search interface for plugging in custom search algorithms * Enhanced and new search algorithms for improved accuracy and performance * Sort interface for plugging in custom sort algorithms * SQL optimization for improved performance in large scale query retrievals h2. Lucene Lazy Loading After the Lucene search is complete, the system stores only the Document id of documents that match the search criteria. Then, when information from the document is needed, it is retrieved from the document. This is helpful in iterator-type scenarios, where retrieval can be done one at a time. h3. Background - Lucene Documents Lucene stores information in documents, and these documents have fields that are used to hold information. Each document has a unique id. For example, an index of people may be indexed in Lucene as: <source> {code} Document: id 1 First Name: John Last Name: Doe Sex: Male Age: 45 Document: id 2 First Name: Jane Last Name: Doe Sex: Female Age: 40 ... etc. {code} </source> LexEVS stores information about entities in this way. Property names and values, as well as qualifiers, language, and various other information about the entity are held in Lucene indexes. h3. Backgroud - Querying Lucene Lucene provides a query mechanism to search through the indexed documents. Given a search query, Lucene will provide the document id and the score of the match. (Lucene assigns every match a score, depending on the strength of the match given the query.) So, if the above index is queried for "First Name = Jane AND Last Name = Doe", the result will be the document id of the match (2), and the score of the match (a float number, usually between 1 and 10). Notice that none of the other information is returned, such as sex or age. It is useful for that extra information to be there, because if it exists in the Lucene indexes we do not have to make a database query for it. But, retrieving data from Lucene documents is expensive, just as retrieving data from a database would be. h3. Lazy Retrieval Lazy retrieval can be leveraged to increase performance in LexEVS. Consider this simplified LexEVS entity index: <source> {code}

Code Block

Wiki Markup

{scrollbar:icons=false}
{highlight:red}This document refers to 5.1. I left it that way for review, but changed it to the 6.0 Programmer's Guide link.{highlight}

h1. {page-info:title}
{panel:title=Contents of this Page}
{toc:minLevel=2}
{panel}


h2. Introduction
This document is a section of the [WIP LJJ LexEVS 6.0 Programmer's Guide].  

LexEVS v5.1 implements the following performance and behavior enhancements in the Query Services Extension:
* Lucene lazy loading for improved query retrieval performance
* Search interface for plugging in custom search algorithms
* Enhanced and new search algorithms for improved accuracy and performance
* Sort interface for plugging in custom sort algorithms
* SQL optimization for improved performance in large scale query retrievals

h2. Lucene Lazy Loading
After the Lucene search is complete, the system stores only the Document id of documents that match the search criteria. Then, when information from the document is needed, it is retrieved from the document. This is helpful in iterator-type scenarios, where retrieval can be done one at a time.

h3. Background - Lucene Documents

Lucene stores information in documents, and these documents have fields that are used to hold information. Each document has a unique id. For example, an index of people may be indexed in Lucene as:

<source>
{code}
Document: id 1
First Name: John
Last Name: Doe
Sex: Male
Age: 45

Document: id 2
First Name: Jane
Last Name: Doe
Sex: Female
Age: 40

... etc.
{code}
</source>

LexEVS stores information about entities in this way. Property names and values, as well as qualifiers, language, and various other information about the entity are held in Lucene indexes.

h3. Backgroud - Querying Lucene

Lucene provides a query mechanism to search through the indexed documents. Given a search query, Lucene will provide the document id and the score of the match. (Lucene assigns every match a score, depending on the strength of the match given the query.)

So, if the above index is queried for "First Name = Jane AND Last Name = Doe", the result will be the document id of the match (2), and the score of the match (a float number, usually between 1 and 10).

Notice that none of the other information is returned, such as sex or age. It is useful for that extra information to be there, because if it exists in the Lucene indexes we do not have to make a database query for it. But, retrieving data from Lucene documents is expensive, just as retrieving data from a database would be.

h3. Lazy Retrieval

Lazy retrieval can be leveraged to increase performance in LexEVS. Consider this simplified LexEVS entity index:

<source>
{code}

LexEVS stores information about entities in this way. Property names and values, as well as qualifiers, language, and various other information about the entity are held in Lucene indexes.

Background - Querying Lucene

Lucene provides a query mechanism to search through the indexed documents. Given a search query, Lucene will provide the document id and the score of the match. (Lucene assigns every match a score, depending on the strength of the match given the query.)

So, if the above index is queried for "First Name = Jane AND Last Name = Doe", the result will be the document id of the match (2), and the score of the match (a float number, usually between 1 and 10).

Notice that none of the other information is returned, such as sex or age. It is useful for that extra information to be there, because if it exists in the Lucene indexes we do not have to make a database query for it. But, retrieving data from Lucene documents is expensive, just as retrieving data from a database would be.

Lazy Retrieval

Lazy retrieval can be leveraged to increase performance in LexEVS. Consider this simplified LexEVS entity index:

Code Block
Document: id 1 Code: C12345 Name: Heart Document: id 2 Code: C67890 Name: Foot Document: id 3 Code: C98765 Name: Heart Attack {code} </source> If a user constructs a query (Name =

If a user constructs a query (Name = Heart*),

...

the

...

query

...

will

...

return

...

with

...

the

...

matching

...

Document

...

ids

...

and

...

2).

...

Previously,

...

LexEVS

...

would

...

immediately

...

retrieve

...

the

...

Code

...

and

...

Name

...

fields

...

from

...

the

...

matches,

...

and

...

use

...

them

...

construct

...

the

...

results

...

that

...

would

...

ultimately

...

returned

...

the

...

user.

...

This

...

does

...

not

...

scale

...

well,

...

especially

...

for

...

general

...

queries

...

large

...

ontologies.

...

large

...

ontology,

...

query

...

(Name

...

Heart*)

...

may

...

match

...

tens

...

thousands

...

documents.

...

Retrieving

...

the

...

information

...

from

...

all

...

these

...

documents

...

significant

...

performance

...

concern.

...

Instead

...

retrieving

...

the

...

information

...

front,

...

LexEVS

...

will

...

simply

...

store

...

the

...

document

...

for

...

later

...

use.

...

When

...

this

...

information

...

actually

...

needed

...

the

...

user

...

(for

...

example,

...

the

...

information

...

needs

...

displayed),

...

retrieved

...

demand

...

Searching

The org.LexGrid.LexBIG.Extensions.Extendable.Search

...

Interface

...

This
...
interface
...
enables
...
the
...
user
...
to
...
plug
...
in
...
custom
...
search
...
algorithms.
...
Users
...
can
...
construct
...
any
...
type
...
of
...
query
...
given
...
search
...
text.
...
The
...
query
...
can
...
include
...
wildcards,
...
it
...
can
...
group
...
search
...
terms,
...
etc.
...
Class:
...
`org.LexGrid.LexBIG.Extensions.Extendable`
...
`.Search`
Method:
`public org.apache.lucene.search.Query`
...
`buildQuery(String`
...
`searchText)`
...
Description:
...
Given
...
a
...
String
...
search
...
string,
...
build
...
a
...
query
...
object
...
to
...
match
...
indexed
...
Lucene
...
documents

	`org.LexGrid.LexBIG.Extensions.Extendable`

`.Search`
Method:	`public org.apache.lucene.search.Query`

	Given

Default AND

Previously,

...

for

...

most

...

algorithms

...

Lucene

...

applied

...

the

...

terms

...

multiple

...

terms

...

were

...

input

...

text.

...

For

...

example,

...

'heart

...

attack'

...

would

...

match

...

all

...

documents

...

containing

...

'heart'

...

'attack'.

...

This

...

lead

...

non-intuitive

...

query

...

results

...

being

...

returned.

...

LexEVS

...

the

...

Lucene

...

default

...

changed

...

AND.

...

Consequently,

...

precision

...

increased

...

and

...

returned

...

results

...

are

...

intuitive.

...

most

...

cases

...

the

...

AND

...

shrinks

...

the

...

number

...

results

...

returned

...

for

...

given

...

query,

...

which

...

turn

...

increases

...

overall

...

performance.

Algorithms

More Precise DoubleMetaphoneQuery

DoubleMetaphoneQueries enable the user to input incorrectly spelled search text, while still returning results. Because this is a 'fuzzy' search, it is important to structure the Query in a way that the most appropriate results are returned to the user first. For example, the Metaphone computed value for "Breast" and "Prostrate" is the same. Given the search term "Breast", both "Breast" and "Prostrate" will match with exactly the same score. Technically, this is correct behavior, but to the end user this is not desirable. To overcome this, LexEVS 6.0 has introduced a new query, WeightedDoubleMetaphoneQuery.

WeightedDoubleMetaphoneQuery

This algorithm does not automatically assume that the user has spelled the terms incorrectly. Searches are also based on the actual text that the user has input, along with the Metaphone value. Again, if the user input "Breast", the query will still match "Breast" and "Prostrate", but "Breast" will have a higher match score, because the actual user text is considered. This algorithm adds a greater precision to this fuzzy-type query.

Algorithm:

Code Block
h3. Algorithms h4. More Precise DoubleMetaphoneQuery DoubleMetaphoneQueries enable the user to input incorrectly spelled search text, while still returning results. Because this is a 'fuzzy' search, it is important to structure the Query in a way that the most appropriate results are returned to the user first. For example, the Metaphone computed value for "Breast" and "Prostrate" is the same. Given the search term "Breast", both "Breast" and "Prostrate" will match with exactly the same score. Technically, this is correct behavior, but to the end user this is not desirable. To overcome this, LexEVS v5.1 has introduced a new query, WeightedDoubleMetaphoneQuery. h4. WeightedDoubleMetaphoneQuery This algorithm does not automatically assume that the user has spelled the terms incorrectly. Searches are also based on the actual text that the user has input, along with the Metaphone value. Again, if the user input "Breast", the query will still match "Breast" and "Prostrate", but "Breast" will have a higher match score, because the actual user text is considered. This algorithm adds a greater precision to this fuzzy-type query. _Algorithm:_ <source> {code} get: user text input 2: total score = 0 3: metaphone score = 0 4: actual score = 0 5: metaphone value = lucene.computeMetaphoneValue(user text input) 6: metaphone score = lucene.scoreMetaphoneValue(metaphone value) 7: actual score = lucene.score(user text input) 8: total score = metaphone score + actual score 9: halt {code} </source> h4.

Code Block



h3. Algorithms

h4. More Precise DoubleMetaphoneQuery
DoubleMetaphoneQueries enable the user to input incorrectly spelled search text, while still returning results. Because this is a 'fuzzy' search, it is important to structure the Query in a way that the most appropriate results are returned to the user first. For example, the Metaphone computed value for "Breast" and "Prostrate" is the same. Given the search term "Breast", both "Breast" and "Prostrate" will match with exactly the same score. Technically, this is correct behavior, but to the end user this is not desirable. To overcome this, LexEVS v5.1 has introduced a new query, WeightedDoubleMetaphoneQuery.

h4. WeightedDoubleMetaphoneQuery

This algorithm does not automatically assume that the user has spelled the terms incorrectly. Searches are also based on the actual text that the user has input, along with the Metaphone value. Again, if the user input "Breast", the query will still match "Breast" and "Prostrate", but "Breast" will have a higher match score, because the actual user text is considered. This algorithm adds a greater precision to this fuzzy-type query.

_Algorithm:_
<source>
{code}
get: user text input
2: total score = 0
3: metaphone score = 0
4: actual score = 0
5: metaphone value = lucene.computeMetaphoneValue(user text input)
6: metaphone score = lucene.scoreMetaphoneValue(metaphone value)
7: actual score = lucene.score(user text input)
8: total score = metaphone score + actual score
9: halt
{code}
</source>

h4.

Case-insensitive Substring

The SubStringSearch algorithm is intended to find substrings within a large string.

For example:

Code Block
substring The SubStringSearch algorithm is intended to find substrings within a large string. For example: {code} 'with a heart attack' {code}

...will

...

match:

{

Code Block
} 'The patient _with a heart attack_ was seen today.' {code}

Also,

...

leading

...

and

...

trailing

...

wildcard

...

will

...

added,

...

{

Code Block
} 'th a heart atta' {code}'

..will

...

also

...

match:

{

Code Block
} 'The patient wi_th a heart atta_ck was seen today.' {code} _

Algorithm:

Code Block
_ <source> {code} get: user text input 2: user text input = '' + user text input + '' 3: score = lucene.score(user text input) 4: halt {code} </source> h2. Sorting h3. The

Sorting - The org.LexGrid.LexBIG.Extensions.Extendable.Sort

...

Interface

...

This

...

interface

...

allows

...

users

...

plug

...

customized

...

Sort

...

algorithms

...

sort

...

query

...

results:

...

Class:

...

	`org.LexGrid.LexBIG.Extensions.Extendable.Sort`

...

Method:

...

	`public`

...

<T>

...

Comparator<T>

...

getComparatorForSearchClass(Class<T>

...

searchClass)

...

throws

...

`LBParameterException`
Description:	Given a Class that this Sort is valid for, return the correct Comparator to compare the results and sort.
Method:	`public boolean isSortValidForClass(Class<?>`

...

clazz)

...

Description:

...

	Return

...

whether

...

not

...

this

...

Sort

...

valid

...

for

...

Sorting

...

given

...

Class.

...

Sorting on Different Class types A single Sort may be applicable for a variety of Class types. For instance, both an 'Association' and an 'Entity' may be sorted by 'Code', but the actual implementation of retrieving the Code and comparing it may be different between the two. It is the job of the Sort to implement a Comparator for each potential Class that it is eligible to sort.
Default Sorting All result sets are sorted by default by Lucene Score, meaning that the best match according to Lucene will always be returned first by default. Note that if two or more result sets are being Unioned, Intersected, or Differenced, the user must explicitly call a 'matchToQuery' sort on the result set as a whole to order all of the results.
Sort Contexts Sorts may be applicable in one or more 'Contexts.' (see: org.LexGrid.LexBIG.DataModel.InterfaceElements.types.SortContext

...

This

...

means

...

that

...

Sort

...

apply

...

only

...

CodedNodeSet,

...

only

...

CodedNodeGraph,

...

some

...

combination.

...

Sorts

...

will

...

only

...

employed

...

they

...

match

...

Context

...

which

...

results

...

being

...

sorted.

...

Performance

...

Issues

...

Sorting

...

generally

...

computationally

...

expensive,

...

because

...

order

...

correctly

...

sort,

...

field

...

sorted

...

fully

...

retrieved

...

entire

...

result

...

set.

...

very

...

specific

...

refined

...

queries,

...

this

...

problem,

...

large

...

ontologies

...

very

...

general

...

queries,

...

performance

...

concern.

...

alleviate

...

this,

...

'Post

...

sort'

...

been

...

introduced.

...

Post

...

Sorting

...

order

...

minimize

...

performance

...

impact

...

sorting,

...

users

...

encouraged

...

'Post

...

sort'

...

where

...

possible.

...

Post

...

sort

...

done

...

after

...

result

...

been

...

restricted,

...

thus

...

limiting

...

amount

...

information

...

that

...

must

...

retrieved

...

order

...

perform

...

sort.

...

instance,

...

query

...

match

...

Entities:

Code Block
<source> {code} \{"Heart", "Heart Failure", "Heart Attack", "Arm", "Finger", ...\} {code} </source> As described

As described earlier,

...

all

...

results

...

are

...

default

...

sorted

...

Lucene

...

score,

...

limit

...

the

...

result

...

set

...

the

...

top

...

the

...

result

...

is:

Code Block
<source> {code} \{"Heart", "Heart Failure", "Heart Attack"\} {code} </source> The restricted set can then be

The restricted set can then be 'Post'

...

sorted;

...

and

...

because

...

the

...

result

...

set

...

has

...

been

...

limited

...

reasonable

...

number

...

matches,

...

sorting

...

and

...

retrieval

...

time

...

can

...

minimized.

...

Algorithm:

Code Block

_
<source>
{code}
1: get: Sort requested by user
2: get: Context sort is being applied to
3: if: sort is not valid for Context
halt
4: else:
5: get: Class to be sorted on
6: if: sort is not valid for Class
halt
7: get: Comparator for Sort - given (Class to be sorted on)
8: sort results using Comparator for Sort
9: halt
{code}
</source>

h2. SQL Optimizations

h3. The

SQL Optimizations

The n+1

...

SELECTS

...

Problem

...

The
...
n+1
...
SELECTS
...
Problem
...
refers
...
to
...
how
...
information
...
can
...
optimally
...
be
...
retrieved
...
from
...
the
...
database,
...
preferably
...
using
...
as
...
few
...
queries
...
as
...
possible.
...
This
...
is
...
desirable
...
because
...
query
...
overhead
...
is
...
a
...
concern.
...
Every
...
query
...
must
...
be
...
packaged
...
and
...
sent
...
to
...
the
...
database
...
engine,
...
processed,
...
packaged
...
again
...
and
...
transferred
...
to
...
the
...
client.
...
Although
...
the
...
overhead
...
may
...
be
...
minimal
...
(a
...
few
...
milliseconds),
...
it
...
does
...
not
...
scale.
...
Although
...
sometimes
...
obvious,
...
n+1
...
queries
...
can
...
remain
...
in
...
a
...
system
...
undetected
...
until
...
scaling
...
problems
...
are
...
noticed.
...
To
...
avoid
...
this
...
problem,
...
a
...
JOIN
...
query
...
can
...
be
...
used.
...
In
...
LexEVS
...
6.
...
0,
...
there
...
were
...
three
...
n+1
...
SELECT
...
queries
...
fixed:
...
The
...
EntryState
...
while
...
building
...
the
...
CodedEntry
...
The
...
EntityDescription
...
on
...
AssociatedConcepts
...
AssociationQualifiers
...
on
...
AssociatedConcepts
...

The

...

n+1

...

SELECTS

...

Problem

...

Example

...

Given

...

two

...

database

...

tables,

...

retrieve

...

the

...

Code,

...

Name,

...

and

...

Qualifier

...

for

...

each

...

Code.

...

Table Codes

Code	Name
C01234	Heart
C98765	Heart Attack

Table Qualifiers

Code	Qualifier
C01234	isAnOrgan
C98765	isADisease

Code Block
SELECT * FROM Codes

Results in:

Code	Name
C01234	Heart
C98765	Heart Attack

To get the Qualifiers, separate SELECTs must be used for each.

Code Block

 Codes_
|
| *Code* | *Name* |
| C01234 | Heart |
| C98765 | Heart Attack |

_Table Qualifiers_
|
| *Code* | *Qualifier* |
| C01234 | isAnOrgan |
| C98765 | isADisease |

<source>
{code}
SELECT * FROM Codes
{code}
</source>

_Results in:_
|
| *Code* | *Name* |
| C01234 | Heart |
| C98765 | Heart Attack |

To get the Qualifiers, separate SELECTs must be used for each.
<source>
{code}
SELECT * FROM Qualifiers where Code = C01234
And
SELECT * FROM Qualifiers where Code = C98765
{code}
</source>

This sequence results in 1 Query to retrieve the data from the Codes table, and then n Queries from the Qualifiers table. This results in n+1 total Queries.

h3. The n+1 SELECTS Solution Example
Given two database tables, retrieve the Code, Name, and Qualifier for each Code.

_Table Codes_
|
|*Code* | *Name* |
| C01234 | Heart |
| C98765 | Heart Attack |

_Table Qualifiers_
|
| *Code* | *Qualifier* |
| C01234 | isAnOrgan |
| C98765 C98765

This sequence results in 1 Query to retrieve the data from the Codes table, and then n Queries from the Qualifiers table. This results in n+1 total Queries.

The n+1 SELECTS Solution Example

Given two database tables, retrieve the Code, Name, and Qualifier for each Code.

Table Codes

Code	Name
C01234	Heart
C98765	Heart Attack

Table Qualifiers

Code	Qualifier
C01234	isAnOrgan
C98765	isADisease

Code Block

 | isADisease |

<source>
{code}
SELECT * FROM Codes JOIN Qualifiers ON Code
{code}
</source>

_Results in:_
|
| *Code* | *Name* | *Qualifier* |
| C01234 | Heart | isAnOrgan |
| C98765 | Heart Attack | isADisease |

Because of the JOIN, only one Query is needed to retrieve all of the data from the database.

{scrollbar:icons=false}

Results in:

Code	Name	Qualifier
C01234	Heart	isAnOrgan
C98765	Heart Attack	isADisease

Because of the JOIN, only one Query is needed to retrieve all of the data from the database.

Scrollbar

icons	false

Content

Space Tools

Page History

Versions Compared

Old Version 6

New Version Current

Key

Page infotitletitle

Introduction

Lucene Lazy Loading

Background - Lucene Documents

Background - Querying Lucene

Lazy Retrieval

Searching

The org.LexGrid.LexBIG.Extensions.Extendable.Search

Interface

Default AND

Algorithms

More Precise DoubleMetaphoneQuery

WeightedDoubleMetaphoneQuery

Case-insensitive Substring

Sorting - The org.LexGrid.LexBIG.Extensions.Extendable.Sort

Interface

SQL Optimizations

The n+1

SELECTS

Problem

The

n+1

SELECTS

Problem

Example

The n+1 SELECTS Solution Example

Page info
title
title