Skip Navigation
National Cancer Institute U.S. National Institutes of Health www.cancer.gov
NCI Wiki New Account Help Tips
Skip to end of metadata
Go to start of metadata

cTAKES 1.3.1 Developer Install Instructions

These instructions are for developers. With these instructions you can set up cTAKES within your development environment and then change or extend the code, compile and deploy. If you simply want to be a user of the software, refer to cTAKES 1.3.1 User Install Instructions.

Once you have completed this install you will have all the source code and be able to compile and deploy it as needed. Information about what the components do is not supplied by these install instructions. That is found in the cTAKES 1.3 User Guide. There is no training or documentation (except for code comments) on the code itself. You must familiarize yourself with the components and then study the code on your own to be able to extend it.

In order to modify the source code for a cTAKES component, developers must download the code. Then you can utilize either an IDE, such as Eclipse, to do this or another editor of your choice. Compiles are then performed in Eclipse or with Ant (using a command line).
Follow the instructions in the appropriate sections here depending upon your developer preferences.

Prerequisites

The prerequisites for installing for a developer. In order to complete these instructions you will need the following:

  • Sun's distribution of the Java JDK version 1.5+
  • Apache UIMA 2.3.1+ and Eclipse 3.7+
  • Ant 1.7.1+
    || Step || Example ||

    1. Open a command prompt window.

    No example

    2. Install the JDK (not the runtime environment) of Java 1.5+.

    This software can be downloaded from java.com.

    You need two things here. One is the proper version and the other is the SDK, not just the Java Runtime environment.

    To check if you have the SDK, look in the lib directory of the Java install and see if the file tools.jar is there. If there is a lib directory and there is a file by that name then you have the SDK.

    To check if you have the proper version. Enter the command:
    Windows and Linux
    java -version on any command line to see what version you have now.

    C:\>java -version
    java version "1.6.0_20"
    Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
    Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

    Note

    This is the SDK and not only the Java Runtime Environment.

    3. It is possible that some commands and programs can find the Java runtime that you want to be used but it is best to set the JAVA_HOME environment variable. Set the value of JAVA_HOME to the absolute path of the root of the Java Runtime environment that you want UIMA to use. On Windows, right-click on My Computer > Properties > Advanced tab > Environment Variables button > New button for System variables. Keep clicking OK until you are out of the dialog series. On Linux use the command
    set JAVA_HOME <path>

    screenshot illustrating step

    4. Navigate to the UIMA Java framework & SDK from Apache UIMA 2.3.1+.

    Go to the Apache UIMA Project site

    screenshot illustrating step

    5. Download the UIMA Java framework & SDK

    Select the file to download based on your operating system:
    Windows: Download the Binary ZIP file
    Linux: Download the Binary TAR.GZ file

    Save the file to a temporary location on your machine.

    screenshot illustrating step

    6. Unzip the compressed file you downloaded.

    On Windows, launch (double-click) the file and extract the files to a directory like c:\uimaj-2.3.1-bin\apache-uima
    On Linux, run the tar command and extract the files to a directory like /usr/bin/uimaj-2.3.1-bin/apache-uima

    screenshot illustrating step

    7. (recommended) Rename the base directory to indicate a cTAKES install. For example: On Windows rename uimaj-2.3.1-bin cTAKES1.3.1. On Linux move uimaj-2.3.1-bin cTAKES1.3.1

    All of the example commands after this point will use the modified directory name. This root directory we will call <cTAKES_HOME>

    screenshot illustrating step

    8. Set the UIMA_HOME environment variable. UIMA requires a special environment variable for its commands to run.

    Use UIMA_HOME for the name of the variable and the absolute path to the <cTAKES_HOME> directory in the previous step as the value.

    On Windows, right-click on My Computer > Properties > Advanced tab > Environment Variables button > New button for System variables. Keep clicking OK until you are out of the dialog series. On Linux use the command set UIMA_HOME <path>

    screenshot illustrating step

    Note

    There is an underscore in the name of the variable. You cannot have spaces in the variable name nor in the path represented by the variable.

    9. An environment variable called PATH already exists. Modify that environment variable to add <cTAKES_HOME>/bin on the end of the value. For example, on Windows: ;c:\cTAKES1.3.1\apache-uima\bin
    on Linux: :/usr/bin/cTAKES1.3.1/apache-uima/bin

    screenshot illustrating step

    Note

    There is a semi-colon (Windows) or colon (Linux) between the existing value of the PATH and the directory you are placing on the end.

    10. Open a new command prompt (in order to pick up the environment variable changes). In your command prompt change to the cTAKES_HOME directory and run the command to set paths.
    On Windows: adjustExamplePaths.bat
    On Linux: .adjustExamplePaths.sh

    screenshot illustrating step

    11. (for developers using Eclipse) Install Eclipse and plug-ins. This documentation is not here. You must follow the install instructions provided with UIMA for Eclipse:

    Note

    There are UIMA plug-ins that need to be installed. Do not skip the installation of these plug-ins. Refer to the documentation on apache.org

    screenshot illustrating step

    12. (for developers using command line compile) Navigate to the Ant download site on apache.org

    screenshot illustrating step

    13.(for developers using command line compile) Download Ant 1.7.1 or later. Unzip the compressed file you downloaded. We will call this <ANT_HOME> Follow the instructions for installing Ant on apache.org. This will include changing the PATH and ANT_HOME environment variables.

    Tip

    If you will not be using Eclipse but still compiling source code from a command line, that is when you would need to install Ant.

    screenshot illustrating step

    screenshot illustrating step

    The documents upon which you can run cTAKES will take many forms. We will cover an example of doing this in the Testing section.

Download source code

Since cTAKES is an open source tool you can get the version that is currently in development through SVN. This is not recommended unless you know what you are doing. In order to get the latest, stable release you can follow the same instructions for users. Go to the Install cTAKES user instructions now, perform those instructions and then come back here.

If know what you are doing with the cTAKES code and you must get the latest code currently under development, then you need to use an SVN connection to retrieve the code. The pre-release versions are available from SVN code repository on SourceForge

Eclipse

Configure Eclipse

These instructions require the UIMA plug-ins. This was part of the prerequisites at the start of these instructions.

Step

Example

1. Open Eclipse to a new workspace.

File > Switch Workspace > Other

For the workspace location navigate to <cTAKES_HOME>

screenshot illustrating step

2. Add a new user library.

Window > Preferences > Java > Build Path > User Libraries

Click New...

Type UIMA (this name is required as the pre-build projects link in a library by this name)
Click OK.

screenshot illustrating step

3. Add UIMA JAR files to the library.

Click Add JARs...

Navigate to <cTAKES_HOME>\lib

Select all the JAR files and click Open.

screenshot illustrating step

4. Close the User Libraries dialog.

Click OK.

screenshot illustrating step

Although creating a new workspace is not required, we recommend you create one to separate cTAKES projects from your existing Eclipse projects.

Import and build projects

After this section is complete you will have all the makings of a development environment with the cTAKES code. You'll be able to modify the code, set breakpoints, check variable values in real time, and other fun things that programmers do.

Step

Example

1. In Eclipse use File -> Import...

Select Existing Projects into Workspace under General.

Click Next >.

screenshot illustrating step

2. Navigate to <cTAKES_HOME> for select root directory.

All the pre-built cTAKES Eclipse projects will show up in the Projects list. You should notice that there is one for each PEAR file. You may deselect the uimaj-examples project if you wish as it is not needed.

Leave the rest of the projects selected and click Finish.

screenshot illustrating step

3. Change to the Java perspective (if Eclipse is not already there):
Window -> Open Perspective -> Java

screenshot illustrating step

4. (optional) Each project should already have UIMA linked as a user library. If you want to check: Right-click on any project. Select Properties. Select the Java Build Path then Libraries tab and expand the + sign by UIMA.

screenshot illustrating step

5. Most new workspaces in Eclipse are going to be set to build automatically. Check this by selecting the Project menu item. If Build Automatically has a checkmark by it, then you are ready to move on. If not then select Project -> Build All.

screenshot illustrating step

SVN

If you checked out source files from the SVN repository, you will need to generate the type system from the following type system descriptors:

  • chunker/desc/TypeSystem.xml
  • clinical documents pipeline/desc/analysis_engine/TypeSystem.xml
  • context dependent tokenizer/src/edu/mayo/bmi/uima/cdt/type/CdtTypeSystem.xml
  • core/src/edu/mayo/bmi/uima/core/type/TypeSystem.xml
  • dictionary lookup/src/edu/mayo/bmi/uima/lookup/type/DictionaryLookupTypeSystem.xml
  • document preprocessor/desc/CDAToTextTypeSystem.xml
  • NE contexts/desc/TypeSystem.xml
  • POS tagger/desc/TypeSystem.xml
  • Drug NER/desc/type_system/NERTypeSystem.xml (optional component)
  • dependency parser/desc/TypeSystem.xml (optional component)
  • smoking status/desc/type_system/SmokingProductionTypeSystem.xml (optional component)
  • SideEffect/desc/type_system/SideEffectTypeSystem.xml (optional component)
  • PAD term spotter/desc/type_system/PADSiteAndTerm.xml (optional component)
  • Constituency Parser/desc/TypeSystem.xml (optional component)
  • coref-resolver/desc/type-system/VecInst.xml (optional component)
  • coref-resolver/desc/type-system/CorefTypes.xml (optional component)

To generate the type system from Eclipse:

  1. Select the file in the Package Explorer or Navigator
  2. Open the file in Component Descriptor Editor (right click on the file > Open with > Component Descriptor Editor)
  3. Click the tab Type System
  4. Click the JCasGen button (in the center)
  5. Repeat the above steps for each type system descriptor
  6. Click Project > Build All to build all the projects unless you have Build automatically already selected in the Projects menu

To generate the type system using Ant:

  1. Copy the build.xml ant script from the temporary directory to <cTAKES_HOME>
  2. Run the generate_types target within the build.xml ant script
  3. Refresh all projects within Eclipse to pick up the newly-generated files
  4. Select Project > Build All to build all the projects unless you have Build automatically already selected in the Projects menu

Process a sample clinical note

You can now launch or debug the cTAKES components that you have built. You could run commands from a command prompt, as found in the testing section of the user install instructions, but you can launch them from within Eclipse now instead.

Step

Example

1. Launch the CAS Visual Debugger (CVD).
Run > Run Configurations...

No example

2. Expand Java Applications and select UIMA_CVD--clinical_documents_pipeline
Click Run.

screenshot illustrating step

3. Load an analysis engine.
Menu > Run > Load AE

screenshot illustrating step

4. Navigate to and select the file <cTAKES_HOME>/clinical documents pipeline/desc/analysis_engine/AggregatePlaintextProcessor.xml

screenshot illustrating step

2. Copy the text in the example at the right (next cell) and paste the contents into the Text section of CVD, replacing the text that is already there.
This example file can also be found in test data: <cTAKES_HOME>/clinical documents pipeline/test/data/plaintext/testpatient_plaintext_1.txt

Dr. Nutritious

Medical Nutrition Therapy for Hyperlipidemia

Referral from: Julie Tester, RD, LD, CNSD
Phone contact: (555) 555-1212
Height: 144 cm Current Weight: 45 kg Date of current weight: 02-29-2001
Admit Weight: 53 kg BMI: 18 kg/m2
Diet: General
Daily Calorie needs (kcals): 1500 calories, assessed as HB + 20% for activity.
Daily Protein needs: 40 grams, assessed as 1.0 g/kg.
Pt has been on a 3-day calorie count and has had an average intake of 1100 calories.
She was instructed to drink 2-3 cans of liquid supplement to help promote weight gain.
She agrees with the plan and has my number for further assessment. May want a Resting
Metabolic Rate as well. She takes an aspirin a day for knee pain.

3. From the menu bar, click Run > Run AggregatePlaintextProcessor.

You'll get a list of all the annotations in the Analysis Results frame.

screenshot illustrating step

4. Named entities are now recognized in this clinical document. To find one, in the Analysis Results frame, click on the key in front of
AnnotationIndexuima.tcas.Annotation o-edu.mayo.bmi.uima.core.type.IdentifiedAnnotation o-edu.mayo.bmi.uima.core.type.NamedEntity

Then select edu.mayo.bmi.uima.core.type.NamedEntity itself. This will show an Annotation Index in the lower frame. Select any NamedEntity in that frame and you will see the text discovered in the Text frame on the right. Double click the NamedEntity in the lower left frame to see the NamedEntity's attributes

screenshot illustrating step

Command Line

Even if you are not using Eclipse's GUI, it is recommended that you have a viable Eclipse installation on hand. This section mostly assumes that case.

Prepare the compiling environment

We ship an Apache ant build file and a build.properties file in the stable release. Before using these you will need to modify the build.properties file by supplying your machine's configuration. Although this file is not required, we recommend you create one, which can ease debugging efforts, as environmental variables may be changed without your awareness. It also helps insert cTAKES version number to the generated Javadoc files.Here are the steps.

Step

Example

1. Copy the build files into cTAKES_HOME.

*copy <temp location>/build* <cTAKES_HOME>, for example screenshot illustrating step

copy c:\stuff\cTAKES-1.3.1-pear\build c:\cTAKES1.3.1\apache-uima* screenshot illustrating step

cp /tmp/cTAKES-1.3.1-pear/build /usr/bin/cTAKES1.3.1*

screenshot illustrating step

2. Edit the file: <cTAKES_HOME>\build.properties

You must place your path values in for the empty variables. The version variable is filled in for you. Remove the comments in front of the variable names. You should end up with a file similar to the one on the right. This example does have both Windows and Linux lines. The Linux example lines are currently commented out with the # sign.

Windows users: Please use escaped backslash in paths, for example:

or forward slash. for example:

3. Check or change UIMA jcasgen_merge script.

As of this writing, one utility script from Apache UIMA 2.2.2 calls Eclipse's startup.jar; however, since Eclipse 3.3, this file has been moved to ECLIPSE_HOME/plugins/org.eclipse.equinox.launcher_VERSION.jar. If you're using Eclipse 3.3 or higher, then make the modifications at the right.

Windows: Modify cTAKES_HOME/bin/jcasgen_merge.bat

"%UIMA_JAVA_CALL%" "%logger%" -cp "%ECLIPSE_HOME%\startup.jar" "-Duima.datapath=%UIMA_DATAPATH%" org.eclipse.core.launcher.Main %ARGS%
should be changed such that startup.jar is replaced with the file in plugins found to resemble plugins\org.eclipse.equinox.launcher_VERSION.jar, for example:
"%UIMA_JAVA_CALL%" "%logger%" -cp "%ECLIPSE_HOME%\plugins\org.eclipse.equinox.launcher_1.1.0.v20100507.jar" "-Duima.datapath=%UIMA_DATAPATH%" org.eclipse.core.launcher.Main %ARGS%

Linux: Modify cTAKES_HOME/bin/jcasgen_merge.sh

ES="$ECLIPSE_HOME/startup.jar"
should become
ES="$ECLIPSE_HOME/plugins/org.eclipse.equinox.launcher_VERSION.jar"

Another option, if you're not running Windows, is to create a symbolic link. Otherwise, if you don't use Eclipse or you use an earlier version, please ignore this.

Compile

To compile cTAKES, change to the cTAKES_HOME directory and simply run: ant

If everything goes smoothly, you'll get a fully-functioning set of cTAKES components in a few minutes.

If you are running without Eclipse installed, you will have to manually run cTAKES_HOME/bin/jcasgen.{bat|sh} on the Type System files. This is documented in the [cTAKES 1.3.1 Developer Install Instructions for SVN.

Process a sample clinical note at the command line

This is very similar to testing a clinical note above when called from Eclipse. This time, without Eclipse, you must launch the component from a command line or, in this case, by calling an ant target that has been provided.

Open a command prompt and change to <cTAKES_HOME>. Run the command: ant testrun

This will set the Java runtime classpath and bring up the UIMA CAS Visual Debugger (CVD). Once that is up you can proceed just like the section under Eclipse called #Process a sample clinical note.

Process a collection of documents

Processing text by cutting and pasting into a GUI like the CAS Visual Debugger is not going to be sufficient for real work. The underlying framework, UIMA, provides the Collection Processing Engine (CPE) to process multiple documents at once. Here we take you through a sample of processing a list. For your production work you will need to have access to clinical documents of your own.

Step

Example

1. Launch the Collection Processing Engine (CPE).

Run > Run Configurations...

No example

2. Expand Java Applications and select UIMA_CPE_GUI--clinical_documents_pipeline

Click Run.(for command line compile) Use the ant target testcpe to launch the CPE, for example: ant testcpe

screenshot illustrating step

3. This will bring up the Collection Processing Engine Configurator. In the Menu bar click File > Open CPE Descriptor

screenshot illustrating step

4. Navigate to the example file

<cTAKES_HOME>/clinical documents pipeline/desc/collection_processing_engine/test1.xml and click the Open button.

screenshot illustrating step

5. Click the Play button (green/blue play arrow near the bottom).

screenshot illustrating step

6. You should see that one document was processed. You did process a collection of documents. In this case the collection only contained one just to show how to do it. Close the results window.

screenshot illustrating step

7. Close the CPE application. You may be prompted to save changes. Since this was just a test you may click the No button.

screenshot illustrating step

8. Open a new command prompt and change to the <cTAKES_HOME>/utils/bin directory

 

9. To test the results (which you can not see using the CPE) there is a comparison tool that will help show that the results match expectations with the following syntax: java edu.mayo.bmi.utils.xcas_comparison.Compare <First File> <Second File> <diff-html>
Where: <First File> is the first file to compare <Second File> is the second file to compare <diff-html> is where the results are written to:

Copy and paste the at the right (next cell) which has had our example files already substituted into a command prompt to run.

Windows
java edu.mayo.bmi.utils.xcas_comparison.Compare^
"C:\cTAKES1.3.1\apache-uima\clinical documents pipeline\test\data\testpatient_cn_1.xml"^
"C:\cTAKES1.3.1\apache-uima\clinical documents pipeline\test\data\output\testpatient_cn_1.xml.xml"^
c:\stuff\diff-html.html

Linux
java edu.mayo.bmi.utils.xcas_comparison.Compare \
"/usr/bin/cTAKES1.3.1/apache-uima/clinical documents pipeline/test/data/testpatient_cn_1.xml" \
"/usr/bin/cTAKES1.3.1/apache-uima/clinical documents pipeline/test/data/output/testpatient_cn_1.xml.xml" \
/tmp/diff-html.html

10. The resulting file will open for you. Look at the comparison to see the annotations resulting from this pipeline.
WINDOWS c:\stuff\diff-html.html
Linux /tmp/diff-html.html

screenshot illustrating step

Optional components

Optional components may have already been downloaded in the install section. If you choose to skip the optional components during the cTAKES install and you want to install them now, please go back to the install section for instructions on doing so and then return here.

You can test any of the components now just as we did in the section Process one clinical note or Process a collection of documents section using the CVD or CPE. Follow the same steps there but use a test file from any other component. You can launch these from Eclipse or the command line.

Most components will have an analysis engine like to load like:
<cTAKES_HOME>/<component name>/desc/analysis_engine/<CVD files>

and a CPE directory like:
<cTAKES_HOME>/<component name>/desc/collection_processing_engine/<CPE files>

For example:
Test the dependency parser:
<cTAKES_HOME>/dependency parser/desc/analysis_engine/ClearParserPlaintextAggregate.xml
<cTAKES_HOME>/dependency parser/desc/collection_processing_engine/ClearParserTestCPE.xml

Test Drug NER:
<cTAKES_HOME>/Drug NER/desc/analysis_engine/DrugAggregatePlaintextProcessor.xml
<cTAKES_HOME>/Drug NER/desc/collection_processing_engine/DrugNER_PlainText_CPE.xml

Next steps

The cTAKES 1.3 User Guide will help you to understand in great detail each of the cTAKES components that have been installed. In some cases you can learn how to improve the components. However, before you go on to process text in production you will need to consider dictionaries and models.

Dictionaries

Bundled UMLS Dictionaries

cTAKES 1.3 includes the complete UMLS (SNOMED-CT and RxNorm) dictionaries.

  • An rxnorm_index database (a Lucene index) containing drug names from RxNorm
  • A UMLS database (using 2 hsqldb tables) containing anatomical sites, procedures, signs/symptoms, and disorders/diseases from SNOMED-CT (umls_ms_2011ab)

To use them, you must have a UMLS username and password, and an Internet connection.

Note

If you do not have a UMLS username and password, you may request one at UMLS Terminology Services.

In order to use the complete UMLS dictionaries shipped with cTAKES you will need to do two things:
(1) Update the DictionaryLookupAnnotatorUMLS.xml Analysis Engine file with your UMLS username and password. Change the UMLSUser and UMLSPW <nameValuePair> strings in these descriptor files above with your UMLS username and password.

  • Dictionary Lookup: <cTAKES_HOME>/dictionary lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml
  • (optional) Drug NER: <cTAKES_HOME>/Drug NER/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml

The following shows where in the files you would make the changes. (Do not change the <configurationParameters> by the same name.)

<nameValuePair>
<name>UMLSUser</name>
<value>
<string>YOUR_UMLS_USERNAME_HERE</string>
</value>
</nameValuePair>
<nameValuePair>
<name>UMLSPW</name>
<value>
<string>YOUR_UMLS_PASSWORD_HERE</string>
</value>
</nameValuePair>

2. Include the DictionaryLookupAnnotatorUMLS.xml Analysis Engine within your aggregate Analysis Engine or switch to the ones provided by cTAKES. cTAKES has provided duplicates of shipped Analysis Engine descriptors, put UMLS in the name, and placed DictionaryLookupAnnotatorUMLS.xml within them for these components:

  • Dictionary Lookup
  • Clinical Documents pipeline
  • Drug NER
  • Side Effect

So you simply need to switch to using those descriptors. For example, if you were using AggregateCdaProcessor.xml in Clinical Documents pipeline you would switch to using AggregateCdaUMLSProcessor.xml instead and you will now hook into the complete dictionaries.

You can, of course, modify your own aggregate Analysis Engine files and place the DictionaryLookupAnnotatorUMLS.xml Analysis Engine within them.
Since this is an in-memory database implementation, please be patient during the initial load as it could take approximately 20-30 seconds for the database to initialize.

If you would like to go back to using the small sample dictionaries that do not require a UMLS username, use the DictionaryLookupAnnotator.xml (UMLS is not in the file name) Analyis Engine descriptor in your aggregate. Removing your password from the DictionaryLookupAnnotatorUMLS.xml files will not work.

LVG

We have successfully tested the 2008 release of the full LVG data. In order to use this release of the full LVG data you should:

  1. Download either the full version or the lite version from NIH Lexical Tools
  2. Extract the TGZ file that you downloaded with a tool like 7-zip (available online) to a temporary directory. On some operating systems, like Windows, this may need to be done in 2 steps 1) to uncompress and 2) to unzip.
  3. Replace the directory <cTAKES_HOME>/LVG/resources/lvg/data/HSqlDb with data/HSqlDb from your extracted download. Replacing the entire directory is appropriate.
  4. In the future, you can upgrade to later versions of LVG by editing the <cTAKES_HOME>/LVG/resources/lvg/data/config/lvg.properties file, replacing "lvg2008" with the name of the new release.

Building Your Own Dictionaries

To install customized dictionaries for RxNorm, SNOMED-CT, or other vocabularies that are available through the UMLS, see the following posts on the cTAKES forums:

Models

Some models included in cTAKES may not represent your data distribution well, if you want to build or train your own models, please read the cTAKES 1.3 User Guide, particularly:

  • Training a sentence detector model
  • Building a Parts of Sentence (POS) tagger model (Building a model Obtaining training data)
  • Building a Parts of Sentence (POS) tag dictionary (Building a tag dictionary)
  • Building a chunker model (Building a model Prepare GENIA training data)
  • Training a dependency parser (Dependency Parser (optional)
Labels
  • None