We advertise our Searches as being 'extensions', but in reality it is very difficult (or impossible) for a use to create a plug-in type Search.
Design ScopeGForge itemsPlease visit the LexEVS 5.1 Scope document found at: https://wiki.nci.nih.gov/display/EVS/LexEVS+5.1+Scope+document Solution ArchitectureProposed technical solution to satisfy the following requirements:
High Level ArchitectureThe LexEVS 5.1 infrastructure exhibits an n-tiered architecture with client interfaces, server components, domain objects, data sources, and back-end systems (Figure 1.1). This n-tiered system divides tasks or requests among different servers and data stores. This isolates the client from the details of where and how data is retrieved from different data stores.
The system also performs common tasks such as logging and provides a level of security for protected content. Clients (browsers, applications) receive information through designated application programming interfaces (APIs). Java applications communicate with back-end objects via domain objects packaged within the client.jar. Non-Java applications can communicate via SOAP (Simple Object Access Protocol) or REST (Representational State Transfer) services.
Most of the LexEVS API infrastructure is written in the Java programming language and leverages reusable, third-party components. The service infrastructure is composed of the following layers:
Application Service layer - accepts incoming requests from all public interfaces and translates them, as required, to Java calls in terms of the native LexEVS API. Non-SDK queries are invoked against the Distributed LexEVS API, which handles client authentication and acts as proxy to invoke the equivalent function against the LexEVS core Java API. The caGrid and SDK-generated services are optionally run in an application server separate from the Distributed LexEVS API.
Core API layer - underpins all LexEVS API requests. Search of pre-populated Lucene index files is used to evaluate query results before incurring cost of database access. Access to the LexGrid database is performed as required to populate returned objects using pooled connections.
Data Source layer---is responsible for storage and access to all data required to represent the objects returned through API invocation.
High Level Design Diagram
Figure 1.1 - High Level Diagram 1.0 Query Performance Enhancements
Lucene is very fast as a search engine. Given a text string, Lucene can find matching documents in huge indexes very fast. This is the purpose and strength of Lucene. Lucene is not, however, a database. Retrieving information from the documents that the search found as 'hits' is slow. Consider this scenario: A user searches for 'heart' in the NCI MetaThesaurus. When Lucene does its search, it will return probably 50,000+ 'hits'. This search is done very fast. LexEVS previously would retrieve all of those documents to populate the ResolvedConceptReference. Retrieving this many documents from Lucene is slow. The solution is to is lazy load the documents as needed. After the Lucene search is complete, we only store the Document Id. Then, when information from the document is needed, it is retrieved from the document. This is helpful in Iterator-type scenarios, where retrieval can be done one at a time.
As we move forward, it is important to keep current with the latest Lucene API. Not only is this important for performance reasons -- it will limit our ability to upgrade our Lucene dependencies if we rely on
We advertise our Searches as being 'extensions', but in reality it is very difficult (or impossible) for a use to create a plug-in type Search. The Interface org.LexGrid.LexBIG.Extensions.Query.Search will be introduced. The purpose of this interface is to give users a plug-in type Interface to implement different search strategies. This interface will accept
As with Searching, Sort algorithms are not currently easily extended. A well defined and 'Extension-ready' interface would allow users to add additional search functionality on demand, without rebuilding or recompiling. The existing Interface org.LexGrid.LexBIG.Extensions.Query.Search will be expanded to allow for easy implementation and flexibility, allowing rapid creation of new Sort Algorithms and techniques.
Join EntityDescription when building AssociatedConcepts Furthermore, this will allow the 'EntityDescription' to be available without requiring the actual 'CodedEntry' to be resolved. For most usescases, this should enable users to resolve Graphs with 'CodedEntryDepth=0'. Avoiding any resolving of the CodedEntry will keep resolve times to a minimum. Join EntryState when building CodedEntry 2.0 Metathesauraus Content (RRF)
3.0 Value Domain SupportOverviewThe LexEVS Value Domain and Pick List service will provide ability to load Value Domain and Pick List Definitions into LexGrid repository and provides ability to apply user restrictions and dynamically resolve the definitions during run time. Both Value Domain and Pick List service are integrated part of LexEVS core API.
The LexEVS Value Domain and Pick List service will provide programmatic access to load Value Domain and Pick List Definitions using the domain objects that are available via the LexGrid logical model. The LexEVS Value Domain and Pick List service will provide ability to apply certain user restrictions (ex: pickListId, valueDomain URI etc) and dynamically resolve the Value Domain and Pick List definitions during the run time
The LexEVS Value Domain and Pick List Service meant to expose the API particularly for the Value Domain and Pick List elements of the LexGrid Logical Model. For more information on LexGrid model see http://informatics.mayo.edu\\ LexEVS Value Domain and Picklist Service Class DiagramCommon Services Class DiagramThese are the classes that are used commonly across Value Domain and Pick List implementation.
Value Domain Class DiagramClasses that implements LexEVS Value Domain API
Picklist Class DiagramClasses that implements LexEVS Pick List API
LexBIG Services Class DiagramAn interface to LexEVS Value Domain and Pick List Services could be obtained using an instance of LexBigService.
4.0 Improved Loader Framework
Cross product dependenciesInclude a link to the Core Product Dependency Matrix. Changes in technologyInclude any new dependencies in the Core Product Dependency Matrix and summarize them here.
AssumptionsList any assumptions. Risks
Detailed DesignSpecify how the solution architecture will satisfy the requirements. This should include high level descriptions of program logic (for example in structured English), identifying container services to be used, and so on. Query Performance EnhancementsLucene Lazy LoadingBackgroud - Lucene Documents For example, an index of People may be indexed in Lucene as:
... etc. LexEVS stores information about Entities in this way. Property names and values, as well as Qualifiers, Language, and various other information about the Entity are held in Lucene indexes. Backgroud - Querying Lucene Lucene provides a Query mechanism to search through the indexed documents. Given a search query, Lucene will provide the Document id and the score of the match (Lucene assigns every match a 'score', depending on the strength of the match given the query). So, if the above index is queried for "First Name = Jane AND Last Name = Doe", the result will be the Document id of the match (2), and the score of the match (a float number, usually between 1 and 10). Notice that none of the other information is returned, such as Sex or Age. It is useful for that extra information to be there, because if it exists in the Lucene indexes we do not have to make a database query for it. BUT, retrieving data from Lucene Documents is expensive, just as retrieving data from a database would be.
If a user constructs a Query (Name = Heart*), the query will return with the matching Document ids (1 and 2). Previously, LexEVS would immediately retrieve the 'Code' and 'Name' fields from the matches, and use them to construct the results that would be ultimately returned to the user. This does not scale well, especially for general queries in large ontologies. In a large ontology, a Query of (Name = Heart*) may match tens of thousands of Documents. Retrieving the information from all these Documents is a significant performance concern. Instead of retrieving the information up front, LexEVS will simply store the Document id for later use. When this information is actually needed by the user (for example, the information needs to be displayed), it is retrieved on demand. SearchingTo allow users to plug in custom search algorithms, the LexEVS Extension framework needed to be extended to include Searches. The org.LexGrid.LexBIG.Extensions.Extendable.Search interface consists of one method to be implemented:
This enables the user to construct any type of Query given search text. Wildcards may be added, search terms may be grouped, etc. Algorithms More precice DoubleMetaphoneQuery For example, the Metaphone computed value for "Breast" and "Prostrate" is the same. Given the search term "Breast", both "Breast" and "Prostrate" will match with exactly the same score. Technically, this is correct behavior, but to the end user this is not desirable. To overcome this, we have introduced a new query, WeightedDoubleMetaphoneQuery. WeightedDoubleMetaphoneQuery Algorithm
Case-insensitive substring SubStringSearch - This algorithm is intended to find substrings within a large string. For example: Also, a leading and trailing wildcard will be added, so Algorithm
SortingSorting matched results is important part of interacting with the LexEVS API. Allowing users to plug in customized Sort algorithms helps LexEVS to be more flexible to more groups of users. To implement a Sorting algorithm, a user must implement the org.LexGrid.LexBIG.Extensions.Extendable.Sort Interface.
As described earlier, all results are by default sorted by Lucene score, so if we limit the result set to the top 3, the result is:
The restricted set can then be 'Post' sorted - and because the result set has be limited to a reasonable number of matches, sorting and retrieval time can be minimized.
SQL OptimizationsThe n+1 SELECTS ProblemThe n+1 SELECTS Problem refers to how information can optimally be retrieved from the database, preferably using as few queries as possible. This is desirable because:
To avoid this, a JOIN query can be used. The n+1 SELECTS Problem ExampleGiven two database tables, retrieve the Code, Name, and Qualifier for each Code Table Codes
Table Qualifiers
Results in:
To get the Qualifiers, separate SELECTs must be used for each.
This sequence results in 1 Query to retrieve the data from the Codes table, and then n Queries from the Qualifiers table. This results in n+1 total Queries. The n+1 SELECTS Problem Example (Solution)Given two database tables, retrieve the Code, Name, and Qualifier for each Code Table Codes
Table Qualifiers
Results in:
Because of the JOIN, only one Query is needed to retrieve all of the data from the database. Although sometimes obvious, n+1 queries can remain in a system undetected until scaling problems are noticed. In LexEVS there were 3 n+1 SELECT queries fixed:
Metathesauraus Content (RRF)Loads of the NCI MetaThesaurus RRF formatted data into the LexGrid model require a number of adjustments in order to accurately reflect the state of the data as it exists in the current RRF files. Data Model ElementsMost data elements will be loaded as either properties or property qualifiers: A few will be loaded as qualifiers to associations. Retrieval and API DocumentationNo new API retrieval methods will be implemented in the scope of LexEVS 5.1. However, some may be required in the scope of 6.0 for any mapping elements implemented as new model elements or model extensions to LexGrid. No changes to user interfaces will occur. Service methods for loading these elements will be consistent with the new Spring Batch loader framework. MRREL.RRF FileProblem: REL and RELA column elements from the RRF source need to be connected. Requirement: A single relationship should be loaded for a REL/RELA combination for a particular SAB between two CUIs. Solution: Since RELA type RRF elements have been defined as relationship names specific to sources and not independent relationships themselves, these elements will be loaded as association qualifiers in the LexGrid model. Problem and Requirement: User is unable to distinguish individual relationships from one source or another. The same association "entity" exists only once but has two "source" qualifiers. Proposed Solution: Propose AUI to AUI - the way CUI to CUI are currently handled in the implementation. Load supporting column elements from MRREL.RRF including contents of: These will be available as elements of the overriding Metathesaurus Association and loaded as association qualifiers Problem: Self Referencing Relationships (CUI1 = CUI2) cannot be fully represented in our model. Previously, these were loaded as PropertyLinks. This fit into the LexEVS model well, but left out important RRF information. Most notably, PropertyLinks cannot contain Qualifiers like normal relations can. Because of the increased number of Qualifiers that are required to be placed on relations, much information would be lost representing these relations as PropertyLinks Solution: Do not treat a CUI1 = CUI2 relationships differently than a CUI1 != CUI2 relationship. For API and query purposes, qualify these relationships with a 'selfReferencing=true' Qualifier. In this way, we can still avoid cycles in the API, but maintain all relevant Qualifier information in the relation. MRSAT.RRFProblem: MRSAT.RRF is not loaded but only accessed for given preferred term algorithms. This data should be loaded as concept properties (STYPE=CUI), properties on properties (STYPE=AUI, SAUI, CODE, SCUI, SDUI), qualifiers on associations (STYPE=RUI,SRUI). Some complexity may arise as concept properties can have additional qualifiers, but property-properties cannot and association-qualifiers cannot. Requirement: If the STYPE is something other than RUI or SRUI, you can load CUI - We use this as the entityCode and is loaded as such in the table. METAUI - load as a propertyQualifier (name=METAUI, value) STYPE - load as a propertyQualifier (name=STYPE, value) ATUI - load as propertyId ATN - load as property name SAB - load as a propertyQualifier (typeName=source) ATV- load as a propertyValue SUPPRESS - load as propertyQualifier if value != N MRRANK.RRFProblem: SAB specific ranking of representational form in MRRANK is not exposed to the user (used in an underlying ranking and specifying of preferred presentations for a given concept) Requirement: Load elements of MRRANK so that they are available to the user. Proposed Solution: Load MRRANK as property qualifier on Presentation type property with the property Name of "mrrank." Retrieval: Available in current LexEVS api MRSAB.RRF Problem: MRSAB.RRF file data is not loaded or is otherwise unavailable to the user. Requirement: Load MRSAB.RRF file data as metadata Implemented Solution: Entire content of each row of MRSAB file is loaded as metadata to an external xml file with tags created from column names and value inserted between tags as is appropriate MRMAP.RRF, MRSMAP.RRF Problem: MRMAP.RRF source load is not supported in current load. Currently this RRF file is not populated in NCI Metathesaurus distributions. Mapping is not explicitly supported in the LexGrid Model. Requirement: Load MRMAP data. Solution: To be evaluated for a load to current model elements or possible new model mapping elements. The general agreement is that this is more appropriately implemented in 6.0. MRHIER.RRF Problem: HCD is loaded as a property on the presentation but the SAB isn't associated with it so we do not know the source of the HCD. (only look at row that has HCD field populated) Requirement: These elements need to be loaded and available from the LexEVS api Solution: Load HCD associated field SAB as property qualifier when HCD is present. Load PTR as property. MRDOC.RRF Problem: MRDOC contains metadata unavailable to the user. It is not loaded by LexEVS. Requirement: This metadata will be made available to the user. Solution: MRDOC's column names and content will be processed as tag/value mappings to a metadata file. MRDEF.RRF Problem: Some values from each row are not loaded by LexEVS. Requirement: AUI should be loaded to connect it with the presentation ATUI, SUPPRESS, CVF, SATAUI, column values will be loaded as property qualifiers on the Definition type property derived from MRDEF column. MRCONSO.RRF Problem: Some elements from the columns of MRCONSO.RRF are not loaded by LexEVS. Requirement: Load LUI, SUI, SAUI, SDUI, SUPPRESS, CVS fields and expose to the user. Solution: All noted values will be loaded as property qualifiers.
Value Domain Support
Improved Loader Framework
Implementation PlanThis will include the technical environment (HW,OS, Middleware), external dependencies, teams/locations performing development and procedures for development (e.g. lifecycle model,CM), and a detailed schedule. Technical environmentNo new environment requirements exist for the the LexEVS 5.1, with the exception of additional storage to accommodate larger content loads. Software (Technology Stack)Operating System
Application Server
Database Server
Other Software Components
Server HardwareServer
Minimum Processor Speed
Minimum Memory
StorageExpected file server disk storage (in MB)
Expected database storage (in MB)
NetworkingApplication specific port assignments
JBoss Container Considerations There are specific requirements for JBoss containers for LexEVS 5.1. In order to support multiple versions of LexEVS (for example 5.0 and 5.1), there are JBoss considerations.
External dependenciesN/A Team/Location performing development
Procedures for DevelopmentDevelopment will follow procedures as defined by NCI. Detailed scheduleThe LexEVS 5.1 project plan is located in Gforge at: LexEVS 5.1 Project Plan and LexEVS 5.1 Project Plan (PDF) The LexEVS 5.1 BDA Project plan is located at: LexEVS 5.1 BDA Project Plan Training and documentation requirementsN/A Download center changesN/A
|