NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Mayo has looked at the report from the SI group from CBIIT.  
  • Larry indicated that SPARQL query is the most focus for the NCIT.  Also ability to federate queries across SPARQL end points.  Would like have consistent results across LexEVS, SPARQL.
  • Jason and Kim have been working on a project 
  • Gilberto - there are no use cases prepared.  However, there are things that a terminology server cannot provide.  Would like to have more integrated services.
    • For example, if researching Cancer and looking for gene data (how do I glue this information together). If both are in RDF, then can query using all with SPARQL.
    • Another example, is data elements - are there other data that exist that might be appropriate for my research.  Users can start to explore ontologies for this data discovery.
    • Federation of data from other SPARQL endpoints is the primary interest.  
  • Larry suggested that Instead of LexEVS - Hierarchy and traversals might be better implemented in SPARQL.
  • Gilberto - 
    • Federated queries - yes.
    • SPARQL doesn't need to support reasoning - however, some minimal reasoning 
    • Performance isn't priority, but it can't be a bottleneck.  (graph DB isn't in the consideration)
    • LexEVS/CTS2 doesn't need to tie to the triple-store (all shouldn't be exposed through the triple store)
  • Kevin provided an overview of "what does a terminology database need to do?" and reviewed Key value store, document store (mongoDB, CouchDB), relational db and graph db usage to satisfy specific functionality required by a terminology.
    • KVS - Key-Value store; DS - Document Store; RDBMS - Relation Database; GDB - Graph Database  

 

Datastore Feature

Datastore Type that Performs Well

Store a resource with an ID

KVS, DS, RDBMS, GDB

Find a resource by ID

KVS, DS, RDBMS, GDB

Find a resource by a set of properties

DS, RDBMS, GDB

Find all edges of a resource

GDB, RDBMS

Traverse a graph

GDB

Compute subgraphs

GDB

Perform set operations on subgraphs

GDB

Calculate paths

GDB

 

    • Need to best look at your requirements and needs when choosing the solution.  

    • Kevin looked at Neo4J, OrientDB, and others by performing benchmarks to determine how well these tools were improving.   
    • Overall, Kevin found arangoDB to be best all around solution.  It is a mix of document and graph solution.  
      • Modeling is open for documents, graphs, and key value pairs
      • Allows for Joins
      • Provides graph functionality.
    • Gilberto - does arangoDB provide SPARQL endpoint plugin?  Kevin indicated that arangoDB may not be supportive of SPARQL.
    • Demo of arangoDB
      • CTS2 JSON for parts of SNOMED loaded into arangoDB.
      • Benchmarks attempted
        • Neighborhood (Qualifier value) - LexEVS and CTS2 does this
          • returns in less than a second
        • Decendants (Qualifier value) - more difficult as maxDepth -1 (all)
          • returns in just over a second
          • typically done by building a table to traverse
        • Leaves (Event) (Return all the leaves)
          • Expensive to do in a DB
          • SNOMED Event branch - return all the leaves.
          • 7300 returned in less than 2 seconds.
        • Sub-Graphs (value set resolution related)
          • SNOMED root note - all Event branch with everything below, all observation branch and all of organism branch.
          • Return how many in each branch and then provide intersection of these branches and see what is returned.
          • returns in 3 seconds.
            • all - 354,000 
            • event - 8500
            • obs - 855
            • organism - 34000
            • intersection - 1
          • Slightly slower results on OrientDB.
        • Graph neighbors - count only
          • How many nodes are in the graph - is difficult in LexEVS
          • extremely fast result. 
        • JOINS from nodes to edges
          • Joining the edges to the entity.
          • returns relation, to and from
        • Shortest Path to Root
          • Returnes verticies and edges
      • Gilberto - how much difference were there between the reviewed tools?  
        • Kevin - OrientDB and ArangoDB are similar.   Neo4J is the most mature of all, but didn't have same performance and was more of a pure graph database.  
      • Tracy - to satisfy the need for SPARQL endpoint is Neo4J best?
        • Kevin - suggests that ArangoDB is not the way to go for SPARQL requirements.  
      • Kevin's usecases for using ArangoDB is based on performance and ability to quickly meet requirements of users.  
      • Larry - how could this be used in combination with LexEVS and other tools.   
        • Kevin - the use of multiple stores/services is becoming more common to accomplish specific tasks.
    • NG (Kim and Jason) have been working on SPARQL endpoint for LexEVS
      • Doesn't have to go through database layer so it is faster. 
      • Kim demoed some working code as part of the browser.  
      • Trees and Hierarchy is faster. 
      • Continuing to review and understand how SPARQL can apply to EVS tools.
    • Larry - how difficult will it be to deploy triple-store and graph DB in the NCI environment?
      • Sara - if part of build and deploy (aside from security concerns) then the tools support team can use.  (for example struts, spring, etc).
      • This impacts the DBAs more than systems.  It depends if the project teams need DBA support.
      • CBIIT managed hosting (supported by infrastructure teams) is currently how EVS is supported.

 

Wiki Markup
{scrollbar:icons=false}

...