Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Skip to end of metadata
Go to start of metadata

What is the Goal of NLP?

The goal of the Open Health Natural Language Processing Consortium is to establish an open source consortium to promote past and current development efforts and to encourage participation in advancing future efforts. The purpose of this consortium is to facilitate and encourage new annotator and pipeline development, exchange insights and collaborate on novel biomedical natural language processing systems and develop gold-standard corpora for development and testing. The Consortium promotes the open source UIMA framework and SDK as the basis for biomedical NLP systems. Applications created within UIMA consist of software components (referred to as annotators) and their associated configuration files and external resources. Within the framework, one can also create complete pipelines composed of a sequence of annotators and the data flow between them.

Why use NLP?

The clinical and research medical community creates, manages and uses a wide variety of semi-structured and unstructured textual documents. To perform research, to improve standards of care and to evaluate treatment outcomes easily — and ideally, in an automated fashion — access to the content of these documents is required. The knowledge contained in unstructured textual documents (e.g., pathology reports, clinical notes), is critical to achieving all of these goals. For instance, clinical research usually requires the identification of cohorts that follow precisely defined patient- and disease-related inclusion and exclusion parameters. Biomedical NLP systems extract structured information from textual reports, facilitating searching, comparing and summarization.

What is NLP?

Natural language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. Natural language generation systems convert information from computer databases into readable human language. Natural language understanding systems convert samples of human language into more formal representations such as parse trees or first order logic that are easier for computer programs to manipulate.

NLP is used to classify, extract, encode and summarize from text documents. An NLP application will unlock the text to be used for decision support, outbreak detection and quality review.

NLP applications in the biomedical domain include:

  • mining of information from biomedical documents and publications
  • retrieval of information from large, unorganized collections
  • communication with clinicians, patients, and scientists through natural language

Examples of NLP tasks are:

  • classifying chief complaints into syndrome categories, for example the chief complaint of cough or SOB into the category of respiratory system
  • extracting a problem list from a history and physical examination of patient
  • determining change in a tumor size over a period of time (example of encoding)
  • summarizing pages of past clinical notes such as family history, chronic conditions, new complaints and test results

Use application examples of NLP include:

  • identifying unreported MRSA infections
  • extracting information on pacemaker implantation procedures
  • identifying family history diagnoses
  • generating a problem list
  • identifying adverse events
  • matching patients to clinical trials

There are two main approaches to NLP use application, the symbolic approach and the statistical approach.

Symbolic NLP includes:

  • Morphological Knowledge (how words are created)
  • Lexical Knowledge (string matching)
  • Syntactic Knowledge (how words can be combine to form sentences)
  • Semantic Knowledge (what words mean)
  • Pragmatic Knowledge (how sentences are used in different situations)
  • Discourse Knowledge (how the preceding sentences affect interpretation of next sentences)

Statistical NLP includes:

  • Modeling document content as bag-of-words (if “cough” appears > fluid)
  • Modeling probabilistic relationships among words and phrases (“purulent discharge” > fluid; “upon discharge > release)
  • Modeling probabilistic relationships between words and concepts (caries, cavity, abrasion > caries)

Where do I find out more?