This applications authentication system has been updated Dec 7th, please report access problems to the Helpdesk at 301-496-4357.
Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Skip to end of metadata
Go to start of metadata

NOTE: For the latest version of cTAKES, see Apache cTAKES (incubating) and follow the install instructions there.

These instructions are for end users for cTAKES 2.5. With these instructions you can install cTAKES 2.5, configure it, and use it to process text (typically text associated with a medical record). If you were planning to expand, change, or modify the code within cTAKES, refer to the cTAKES 2.5 Developer Install Instructions.

These instructions will cover installation and a test of the main product including trained models for sentence detection and tagging parts of speech, dictionaries from a subset of the UMLS, a very small subset of the full LVG resource, etc. Optional components will also be described.

Once you have finished installation of cTAKES, you will be able to see what cTAKES is capable of. Further exploitation of the software's ability may require following a few additional steps involving what dictionaries are being used. These are the last steps in these instructions.




1. Make sure you have Java 1.6 or higher. Most systems come with Java already installed.
Run this command to check your version.

java -version

If you do not you can install Java from

C:\>java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

Install cTAKES



1. Navigate to the source downloads for a released version on SourceForge

NOTE: For the latest version of cTAKES, see Apache cTAKES (incubating) and follow the install instructions there.


2. Download the file.
Save the file to a temporary location on your machine.

screenshot illustrating step

3. Unzip (extract the contents of) the compressed file you downloaded into a directory that you want to be the cTAKES install location.
For example, Windows:




This folder we will call <cTAKES_HOME>. You will need to refer to the directory later.

screenshot illustrating step

Process documents using cTAKES

This version allows you to test most components bundled in cTAKES in two different ways:

  1. Using cTAKES CAS Visual Debugger (CVD) to view the results stored as XCAS files or run the annotators or
  2. Using cTAKES collection processing engine (CPE) to process documents in cTAKES_HOME/testdata directory

CAS Visual Debugger (CVD)



1. Open a command prompt and change to the cTAKES_HOME directory.

cd \cTAKES-2.5


cd /usr/bin/cTAKES-2.5


cTAKES_HOME must be your current directory unless you are skilled at setting paths on your machine.

2. Start the CAS Visual Debugger by running this command:



The application may take a minute to start on slower hardware.

screenshot illustrating step

3. An analysis engine (AE) needs to be loaded in order to process text.
Use the Run -> Load AE menu bar command. Navigate to the file


Click Open.

screenshot illustrating step

4. Copy the text in the example at the right (next cell) and paste the contents into the Text section of CVD, replacing the text that is already there.
This example file can also be found in test data:


Dr. Nutritious

Medical Nutrition Therapy for Hyperlipidemia

Referral from: Julie Tester, RD, LD, CNSD
Phone contact: (555) 555-1212
Height: 144 cm Current Weight: 45 kg Date of current weight: 02-29-2001
Admit Weight: 53 kg BMI: 18 kg/m2
Diet: General
Daily Calorie needs (kcals): 1500 calories, assessed as HB + 20% for activity.
Daily Protein needs: 40 grams, assessed as 1.0 g/kg.
Pt has been on a 3-day calorie count and has had an average intake of 1100 calories.
She was instructed to drink 2-3 cans of liquid supplement to help promote weight gain.
She agrees with the plan and has my number for further assessment. May want a Resting
Metabolic Rate as well. She takes an aspirin a day for knee pain.

3. From the menu bar, click Run -> Run AggregatePlaintextProcessor.

You'll get a list of all the annotations in the Analysis Results frame.

screenshot illustrating step

4. Named entities are now recognized in this clinical document. Annotations of MedicationEventMention and EntityMention are created. To find one, in the Analysis Results frame, click on the key in front of:

Then select edu.mayo.bmi.uima.core.type.textsem.EntityMention or edu.mayo.bmi.uima.core.type.textsem.EventMention.MedicationEventMention.This will show an Annotation Index in the lower frame. Select any annotation in that lower frame and you will see the text discovered in the Text frame on the right. You may close CVD if you wish.

screenshot illustrating step

Collection processing engine (CPE)



1. Open a command prompt and change to the cTAKES_HOME directory:

cd C:\cTAKES2.5


cd /usr/bin/cTAKES2.5


Note that cTAKES_HOME must be your current directory unless you are skilled at setting paths on your machine.

2. Start the collection processing engine by running this command:



The application may take a minute to start on slower hardware.

screenshot illustrating step

3. This will bring up the Collection Processing Engine Configurator. In the Menu bar click File > Open CPE Descriptor

screenshot illustrating step

4. Navigate to the file


Click Open.

screenshot illustrating step

5. Click the Play button (green/blue play arrow near the bottom).

screenshot illustrating step

6. You should see that one document was processed. You did process a collection of documents. In this case the collection only contained one just to show how to do it. Close the results window.

screenshot illustrating step

7. Close the CPE application. You may be prompted to save changes. Since this was just a test you may click the No button.

screenshot illustrating step

8. Open a new command prompt and change to the <cTAKES_HOME>

No example.

9. To test the results there is a comparison tool that will help show that the results match expectations with the following syntax:

java -cp cTAKES.jar edu.mayo.bmi.utils.xcas_comparison.Compare
<First File> <Second File> <diff-html>

Where: <First File> is the first file to compare; <Second File> is the second file to compare; <diff-html> is where the results are written to

Copy and paste the example at the right (next cell) which has had our example files already substituted into a command prompt to run. In this case we have shipped an example of what the output should be for you to compare against.


java -cp cTAKES.jar edu.mayo.bmi.utils.xcas_comparison.Compare ^
"testdata\cdptest\testoutput\plaintext\sample_note_plaintext.xml" ^
"testdata\cdptest\testsampleoutput\plaintext\sample_note_plaintext.xml" ^


java edu.mayo.bmi.utils.xcas_comparison.Compare \
"/usr/bin/cTAKES2.5/testdata/cdptest/testoutput/plaintext\sample_note_plaintext.xml" \
"/usr/bin/cTAKES2.5/testdata/cdptest/testsampleoutput/plaintext/sample_note_plaintext.xml" \

10. The resulting file will open for you. Look at the comparison to see the annotations resulting from this pipeline.




screenshot illustrating step

Using the same CVD and CPE programs in the manner described above, you can test all the other components. The analysis engines and collection processing engines shipped with cTAKES for some of the annotators are described in the following table.




Example Analysis Engine (AE)

Example Collection processing Engine (CPE)

Example test data

Clinical Document Pipeline

the complete cTAKES pipeline to obtain majority of cTAKES annotations






obtain cTAKES chunking annotations





Dependency Parser

obtain dependency parsing tree





Drug NER

the annotator to obtain drug annotations





Dictionary Lookup

mapping cTAKES annotations to dictionaries (e.g., SNOMED_CT or RxNorm





PAD Term Spotter

identifying terms related to PAD





Smoking Status

the annotator to obtain document or patient-level smoking status





Side Effect

the annotator to find side effect mentions and sentences from clinical documents


















Next Steps

The cTAKES 2.5 Component Use Guide will help you to understand in great detail each of the cTAKES components that have been installed. In some cases you can learn how to improve the components. However, before you go on to process text in production you will need to consider dictionaries and models.


Bundled UMLS Dictionaries

cTAKES includes the complete UMLS (SNOMED-CT and RxNorm) dictionaries.

  • An rxnorm_index database (a Lucene index) containing drug names from RxNorm
  • A UMLS database (using two hsqldb tables) containing anatomical sites, procedures, signs/symptoms, and disorders/diseases from SNOMED-CT (umls_ms_2011ab)

To use them, you must have a UMLS username and password, and an Internet connection.


If you do not have a UMLS username and password, you may request one at UMLS Terminology Services

In order to use the UMLS dictionaries shipped with cTAKES you will need to do two things:

(1) Change the UMLSUser and UMLSPW <nameValuePair> strings in these descriptor files with your UMLS username and password.

  • Dictionary Lookup: <cTAKES_HOME>/cTAKESdesc/lookupdesc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml
  • (optional) Drug NER: <cTAKES_HOME>/cTAKESdesc/drugnerdesc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml

The following shows where in the files you would make the changes. (Do not change the <configurationParameters> by the same name.)


(2) Include the DictionaryLookupAnnotatorUMLS.xml Analysis Engine within your aggregate Analysis Engine or switch to the ones provided by cTAKES. cTAKES has provided duplicates of shipped Analysis Engine descriptors, put UMLS in the name, and placed DictionaryLookupAnnotatorUMLS.xml within them for these components:

  • Dictionary Lookup
  • Clinical Documents pipeline
  • Drug NER
  • Side Effect

So you simply need to switch to using those descriptors. For example, if you were using AggregateCdaProcessor.xml in the Clinical Documents pipeline you would switch to using AggregateCdaUMLSProcessor.xml instead and you will now hook into the complete dictionaries.

You can, of course, modify your own aggregate Analysis Engine files and place the DictionaryLookupAnnotatorUMLS.xml Analysis Engine within them.
Since this is an in-memory database implementation, please be patient during the initial load as it could take approximately 20-30 seconds for the database to initialize.

If you would like to go back to using the small sample dictionaries that do not require a UMLS username, use the DictionaryLookupAnnotator.xml (UMLS is not in the file name) Analyis Engine descriptor in your aggregate. Just removing your password from the DictionaryLookupAnnotatorUMLS.xml files will not switch you back to the small sample dictionaries.


We have successfully tested the 2008 release of the full LVG data. In order to use this release of the full LVG data you should:

  1. Download either the full version or the lite version from NIH Lexical Tools
  2. Extract the TGZ file that you downloaded with a tool like 7-zip (available online) to a temporary directory. On some operating systems, like Windows, this may need to be done in two steps, 1) to uncompress and 2) to unzip.
  3. Replace the directory <cTAKES_HOME>/resources/lvgresources/lvg/data/HSqlDb with data/HSqlDb from your extracted download. Replacing the entire directory is appropriate.
  4. In the future, you can upgrade to later versions of LVG by editing the <cTAKES_HOME>/resources/lvgresources/lvg/data/config/ file, replacing "lvg2008" with the name of the new release.

Building Your Own Dictionaries

To install customized dictionaries for RxNorm, SNOMED-CT, or other vocabularies that are available through the UMLS, see the following posts on the cTAKES forums:


Some models included in cTAKES may not represent your data distribution well. If you want to build or train your own models, please read the cTAKES 2.5 Component Use Guide, particularly:

  • Training a sentence detector model
  • Training a Part of Speech (POS) tagger model (Building a model Obtaining training data)
  • Creating a Part of Speech (POS) tag dictionary (Building a tag dictionary)
  • Training a chunker model (Building a model - Prepare GENIA training data)
  • Training a dependency parser (Dependency Parser)