NIH | National Cancer Institute | NCI Wiki  

We serve a number of vocabularies that come from outside sources, such as ChEBI and HGNC.  These procedures describes how to deal with each.

Scheduling

The NCI Thesaurus (NCIt) is published the last Monday of the month and requires loading of supporting history files, mappings and value sets.  Depending on the structure of the month this processing can extend beyond the first of the next month.  The monthly vocabularies can be downloaded as soon as they are ready, but loading should not start until the NCIt is on Stage. Any additional vocabularies can be loaded along with the monthlies or as part of a third release cycle once the monthlies reach Stage.

NCIT → DataQA → Editor review → Stage → Prod → post release tasks → announcement

                                                        |→ Load monthlies to DataQA → Editor review → Stage → Prod → announcement

We wait until the previous cycle reaches Stage to avoid any inconsistencies in the database.  During data promotion the entire DataQA database is sent up to the Stage tier.  We don't want half-finished loads to be carried along with it.  

Monthly Vocabularies

ChEBI - Chemical Entities of Biological Interest

Location: ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology

Update frequency: Monthly

File to download: chebi.obo.gz

Supporting files needed: manifest and metadata

Loader: LoadOBO.sh

Summary: ChEBI releases within a day or two of the first of the month, depending on weekend and holiday schedules.  The file is downloaded directly from ChEBI and loaded with no processing needed.

GO - Gene Ontology

Location: http://current.geneontology.org/ontology/

Update frequency: Monthly

File to download: go.obo

Supporting files needed: manifest and metadata

Loader: LoadOBO.sh

Summary: The OBO file is generated on the first of every month as an automated export. The file is downloaded directly from GO and loaded with no processing needed.

GO is under development constantly with daily downloads available in OWL.  We have discussed migrating to OWL but need further editor and government review.


HGNC - HUGO Gene Nomenclature Ontology

Location: https://www.genenames.org/download/statistics-and-files/

Update frequency: Monthly

File to download:  ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt

Supporting files needed: manifest, metadata and preferences

Loader: LoadOWL2.sh

The HGNC is updating constantly.  We download their complete data set once per month in tab-delimited text format.  We then process it using an in-house data transformation application to convert it into OWL.  Details here: HGNC Processing

BioPortal Ontologies

These vocabularies are all downloaded from the BioPortal ontology repository.  Each of the below links go to a summary page for the listed ontology.  We generally take the OWL version of the ontology, if available.  Be cautious, just because the released date is more recent than our version does not mean that there has been an actual release.  BioPortal's internal processes sometimes require re-uploads.  Also, adding text files or other formats can change the released date.  Always check the actual version within the vocabulary data to make sure the data has actually been updated before embarking on a load. 

Example summary page with download links:

OBI - Ontology for Biomedical Investigations

Location: https://bioportal.bioontology.org/ontologies/OBI

Update frequency: Every 3 months

Supporting files needed: manifest, metadata and preferences

Loader: LoadOWL2.sh

Summary: This is downloaded in OWL format and loaded directly.  Due to some editorial design decisions with the hierarchy, this has failed to load into LexEVS recently.  The last successful OBI load was in October 2016.  There is a LexEVS loader feature in 6.5.2 that should fix this issue.


Zebrafish

Location: http://bioportal.bioontology.org/ontologies/ZFA

Update frequency: Approximately twice a year

Supporting files needed: manifest and metadata

Loader: LoadOWL2.sh

Summary: We download the OBO file and load it with no processing.  This is one of the vocabularies where the released date can be misleading.  It is uploaded as OBO, then BioPortal eventually processes this and adds CSV, RDF and diff data, which changes the released date. 


MA - Adult Mouse Anatomy

Location: https://bioportal.bioontology.org/ontologies/MA

Update frequency: Approximately once a year

Supporting files needed: manifest and metadata

Loader: LoadOBO.sh

Summary: We download this as OBO and load it with no processing.


OBIB - Ontology for BioBanking

Location: https://bioportal.bioontology.org/ontologies/OBIB

Update frequency: Infrequently - last updated 2017

Supporting files needed: manifest and metadata

Loader: LoadOWL2.sh

Summary: We download this as OWL and load it with no processing.  This does have some structural issues that caused it to fail in earlier versions of LexEVS.  The 6.5.2 version should fix those problems


MGED - Micro Gene Expression Data

Location: https://bioportal.bioontology.org/ontologies/MO

Update frequency: Infrequently - last updated 2015

Supporting files needed: manifest, metadata and preferences

Loader: LoadOWL2.sh

Summary: We download this as OWL and load it with no processing.

NPO - NanoParticle Ontology

Location: https://bioportal.bioontology.org/ontologies/NPO

Update frequency: Infrequently - last updated 2011

Supporting files needed: manifest, metadata and preferences

Loader: LoadOWL2.sh

Summary: We download this as OWL and load it with no processing.


UMLS Semnet - UMLS Semantic Network

Location: https://lhncbc.nlm.nih.gov/semanticnetwork/download.html

Update frequency: Infrequently - Last update is 2020AA

Supporting files needed: metadata

Loader: LoadUMLSSemnet.sh

Summary: This underpins much of how UMLS, NCI Metathesaurus and NCIt works.  Each concept in NCIt is given a semantic type to categorize it.  Since this is such a backbone it is very rarely changed.  LexEVS has a custom loader for this that hasn't been used in a decade because there have been no updates to the data

SEER CanMED

Location: https://seer.cancer.gov/oncologytoolbox/canmed

Update frequency: Undetermined

Supporting files needed: manifest, metadata and preferences

Loader: LoadOWL2.sh

Summary:  We download two files from the CanMED website and then process these into an OWL file. 


  • No labels