caBIG®-NCI Data Standards
Common information building blocks or data standards, for capture of data and for reporting, facilitate the understanding and sharing of cancer research information. Variation in data descriptors (metadata) makes it nearly impossible to aggregate and manage even modest-sized data sets, to ask basic questions and obtain meaningful answers. One approach to data standards is using vocabulary-driven metadata created with terms that represent common semantic concepts.
The NCI, together with its partners in the Cancer Biomedical Informatics Grid (caBIG®) community, is actively developing common data elements (CDEs) and standard vocabularies to be used as metadata descriptors for NCI-sponsored research and for caCORE CBIIT and caBIG® applications. These data standards are based on health information business needs or use cases from data collection forms, databases, clinical applications, data exchange formats, UML models, and common vocabularies.
The Cancer Data Standards Registry (caDSR), based on the ISO/IEC 11179 metadata registry standard, provides the means to register CDEs with assignment of common semantic concepts to facilitate information discovery.
Data Standards Development Process
The NCI and the caBIG® community support a broad initiative to develop standard tools and best practices including controlled vocabularies, reusable CDEs, and logical models of entities within and across life science domains. The caBIG® Data Standard Development and Governance Model, a plan for the development, review, acceptance, and maintenance of data standards, was adopted early and helps guide development of data standards within the caBIG® and NCI community.
The caBIG® Vocabulary and Common Data Element Workspace ( VCDE ) oversees this process. Specific subject areas or external standards proposed for adoption as data standards may be presented to the VCDE from the CBIIT Context Administrators, the caBIG® Workspaces, and other interested parties. Each group developing a data standard is expected to review current caDSR usage and any relevant external resources or standards before presentation of a data standard package to the VCDE for consideration.
A proposed data standard is reviewed in detail by the VCDE. The VCDE may ask the submitter to clarify the proposal or to expand the scope of the proposed data standard to meet additional business needs or satisfy use cases uncovered during the review. When a final proposal is approved by the VCDE, the candidate data standard is submitted to the caBIG® community for review. Any comments received during the review period are addressed by the VCDE and may result in modification of the proposal. At the end of the review period, the VCDE makes a final determination on promotion of the candidate as a caBIG®-NCI data standard. Once a data standard is accepted, the user community may petition the VCDE for review and modification of the data standard. Any changes will undergo the same review process as the initial data standard.
Current caBIG®-NCI Data Standard Efforts
Data standards that are under review or have been approved by the the caBIG®-NCI community are listed below. The Registration Status associated with a CDE is an indicator of its progression through the caBIG®-NCI review and approval process. As a data standard is being developed, the component CDEs are assigned a caDSR Registration Status of Proposed or Candidate. When a data standard has been approved, the CDEs are assigned a caDSR Registration Status of Standard. A complete list of CDEs that are included in the data standards may be found in the CDE Browser using the left-hand tree to display the caBIG® Context Classification for Data Standards or on the caBIG® Data Standards site. Click the link for each standard to download a zip file of descriptive documentation.
Approved caBIG®-NCI Data Standards
Accepted: September 8, 2006
The Age data standard addresses the need to exchange information about the age of a study subject. The age of a subject may be calculated based on the difference in an event date and the date of birth of the subject, or a numeric age value may be self-reported by the subject.
Accepted: June 28, 2007
The Body Mass Index (BMI) standard consists of the components required to calculate and/or capture the body mass index of a person. The generic Common Data Element (CDE) for BMI includes a derivation rule that specifies the calculation of BMI using metric or international customary units. Four generic CDEs are included for height and weight and their respective units of measure using the Unified Code for Units of Measure (UCUM). Template CDEs provide guidance for the creation of descriptive CDEs when an Object Class other than Person is required.
Accepted: August 17, 2007
The Body Surface Area (BSA) standard consists of the components required to capture the Body Surface Area and the method of calculation for a person. The generic Common Data Element (CDE) for BSA includes a derivation rule that allows for the calculation of an estimate of BSA using a standard formula, nomogram, or linear regression (requiring only weight). An additional CDE is proposed to capture the method used to calculate BSA. This CDE includes a Value Domain with Permissible Values that apply to humans and other animals to provide flexibility for reuse. The Object Class for both of the Data Element Concepts for the generic CDEs is Person (C25190). Template CDEs provide guidance for the creation of descriptive CDEs when an Object Class other than Person is required. Example CDEs are included in the description of the standard, but are not being considered for standardization at this time. These example CDEs have the Registration Status of Qualified. Graphic illustrations of the Templates and Examples are included in the Documentation.
Accepted: April 2, 2005
The CDC Race and Ethnicity data standard is based on the Centers for Disease Control and Prevention (CDC) Race and Ethnicity codes. The code set uses a detailed set of Race and Ethnicity categories from the U.S. Bureau of Census to support more granular race and ethnicity reporting than is possible using the OMB categories. The code set has a hierarchical structure so that the information collected may be assigned to the OMB categories for reporting purposes. This data standard is also being used as the Health Level 7 (HL7) vocabulary set for Race and Ethnicity, and by reference, serves as the code set recommended by the Consolidated Health Informatics (CHI) Initiative.
Accepted: April 15, 2005
The Date and Time data standard includes data elements that can be used to record and exchange a full HL7 point in time specification or the individual date and time components of the specification.
Accepted: November 15, 2007
This standard captures the education level associated with an individual regardless of their role (study participant, patient, investigator, etc.). The education levels reflect typical patterns of education progression. Existing CDEs were considered for harmonization as well as external sources including: the U.S. Dept of Education, U.S. Census Bureau, U.S. Dept of Labor, HL7v3 Ballot (RIM Value Domain), BRIDG, and the United Nations Educational, Scientific and Cultural Organization (UNESCO) International Standard Classification of Education (ISCED). Following current best practices and curation processes, lists from these sources were harmonized with the EVS NCIt Education Level super concept (C17953) to create the enumerated value domain for this standard.
Accepted: January 25, 2007
The Email Address standard consists of three generic data elements (CDEs) with a standard property term and value domain that describes the format of an electronic email address. A Template is included to provide guidance for the creation of descriptive data elements using Role permissible values as a substitution for the Object Class in the Generic CDEs. There are two example CDEs provided as an illustration of how the template is to be used. These are not part of the data standard.
Accepted: April 3, 2008
The family member relationship standard consists of the administered items required to capture the name of the relationship between a person and their family members. The generic Common Data Element (CDE) for family member relationship includes a Value Domain with Permissible Values that apply to relationship data that may be collected to identify family members for registry purposes, but also to identify relatives for contact purposes. The Object Class for the Data Element Concept is Family Member (C41256), with a Property Concept of Relationship (C25648). The Value Domain includes a concept term, Type (C25284) for the representation of the instance data.
Accepted: April 27, 2005
The Functional Performance Status Scale standard consists of data elements that describe several commonly used scales for assessment of a person's physical state or performance. Doctors and researchers use these scales and criteria to assess how a patient's disease is progressing, assess how the disease affects the daily living abilities of the patient, and determine appropriate treatment and prognosis. The generic data elements may be used without modification or as templates to create other data elements by the addition of descriptive terms to meet programmatic needs.
Accepted: February 24, 2006
The Gene Identifier data standard addresses the lack of a common genomic identifier in biomedical databases that results in an interoperability problem in the caBIG™ environment. The question of how to deal with multiple identifiers (e.g. RefSeq ID, GenBank ID, Entrez Gene ID, Ensembl ID, UniProt ID) all pointing to the same object is addressed.
Accepted: June 13, 2006
The Language Name standard consists of data elements that describe the components needed to record and exchange language information. The data standard is based on the ISO 639-2 guidelines for reporting Language Name and will be updated to use ISO 639-3 when it becomes available. ISO 639-2 includes 523 languages. ISO 639-3 will include more than 7599 languages by incorporation of the Linguist List and Ethnologue list of languages.
Accepted: February 24, 2006
The Mailing Address standard includes components from the United States Postal Service (USPS) and the Universal Postal Union (UPU) specifications for mailing addresses. The address components provide for a Full United States Address based on USPS formatting conventions, map to the HL7v3 Address data type, and can be used to represent mailing addresses for 44 additional countries based on UPU formatting guidance.
Accepted: September 1, 2006
The Marital Status standard consists of a generic data element with a list of permissible values describing a person's self-reported, current marital status. The data element can be used without additional qualifiers. The permissible values were based on the Core Health Data Elements of the National Committee on Vital Health and Statistics (NCVHS). Modifications of the list and the definitions were agreed upon by consensus of the NCI Context Administrators.
Accepted: October 18, 2007
Additional CDE standards for Person Name were proposed to the VCDE WS to capture name components for prefix abbreviations that are used with a person's name for salutations and titles and suffix abbreviations that are used with a person's name to capture formal educational or professional titles.
Accepted: November 1, 2007
Additional CDE standards for Person Name were proposed to the VCDE WS to capture name components training and education suffix abbreviations that are used with a person's name to capture formal educational or professional titles.
Accepted: March 3, 2005
The OMB Race and Ethnicity data standard consists of CDEs, based on the Office of Management and Budget (OMB) requirements for Race and Ethnicity, with extensions to the Value Domain (added Unknown and Not Reported) to accommodate caBIG™ and NCI reporting needs.
Accepted: February 24, 2006
The Organization standard consists of data elements that describe the organization components, provides for full organization identification referencing the HL7v3 organization data type convention, and documents guidance for development of data elements that include an Affiliation/Role related to an Organization.
Accepted: October 28, 2005
The Person Name standard consists of data elements that describe the generic components of Person Name, provides for a HL7 Person Full-formatted Name, and documents guidance for recording and exchanging this type of information. This standard recognizes that there are conceptual differences in the determination of each term and relies on cultural practices, external standards, and business rules to determine which name part is appropriate where.
Accepted: September 1, 2006
The Person Religion Designation standard consists of data elements that describe the components of religious designation names and application to provide guidance for recording and exchanging this type of information. The following Candidate data standards are based on the HL7 version 2.5 guidelines for reporting Religion and include a generic data element and an instructional template data element. An example of a descriptive data element is presented to illustrate the use of the template and is not included in the data standard proposal.
Accepted: September 6, 2005
The Sex and Gender standard consists of data elements that describe the components of Sex and Gender information, and provides guidance for recording and exchanging this type of information. This standard recognizes that there are conceptual differences in the terms Sex and Gender and will provide for collection and exchange of both types of information.
Accepted: September 1, 2006
The Social Security Number standard consists of a data element and value domain that described the format of the number, #########. The standard package also includes a template CDE that can be used for the creation of Social Security Number data elements using Affiliation/Role qualifiers.
Accepted: January 25, 2007
The Telephone Number standard consists of data elements that describe the components of telephone number use and application to provide guidance for recording and exchanging this type of information. The first generic data element describes the format for telephone number. The recommended format is based in part on the MS OFFICE Standard Format without spaces: +CCC(AAA)LLLLLLL/XXXXX where +CCC is the country code, (AAA) is the area code, and LLLLLLL is the local code, and XXXXX is the extension number
Candidate Data Standards Under caBIG®-NCI Review
Modifications to CDE Standards Under Review
A spreadsheet lists the Modifications to CDE Standards that are under review during the period beginning on August 4, 2008 and ending on September 3, 2008.
The VCDE Workspace has reviewed and evaluated existing caBIG® Common Data Element (CDE) Standards according to the caBIG® CDE Standard Maintenance process and is now proposing modifications for caBIG® Community Review. This maintenance review arose from a previous comprehensive review of the standards for UML modeling recommendations in 2007. Modifications to thirteen CDE Standards are under review:
- Sex & Gender
- Person Age
- Email Address
- Social Security Number
- Telephone Number
- Person Name
- Performance Function Status
- Mailing Address
To submit feedback on the proposed modifications, please use the CDE Standard Review Tracker https://gforge.nci.nih.gov/tracker/?group_id=109.
Units of Measure
Requires Further Review (Oct 5, 2006) - Discussions Ongoing
The Units of Measure
data standard is based on The Unified Code for Units of Measure (UCUM) written by Gunther Schadow and Clement J. McDonald of The Regenstrief Institute for Health Care and Indiana University School of Medicine. The basis of the code system is the ISO 2955-1983, ANSI X3.50-1986 and HL7 specifications and it consists of a basic set of terminal symbols for units called unit atoms. It contains a set of multiplier prefixes, business rules for expression syntax, and algebraic terms to enable users to create units which can encompass all units used in international clinical science. UCUM is being used by HL7, LOINC, and has been adopted by the Consolidated Health Informatics (CHI) Initiative, DICOM, and the OpenGIS consortium