Skip Navigation
National Cancer Institute U.S. National Institutes of Health www.cancer.gov
NCI Wiki New Account Help Tips
Skip to end of metadata
Go to start of metadata

IMPORTANT: MAF files can be submitted to the DCC only by following the procedure described here.

Document Information

Specification for Mutation Annotation Format
Version 2.4
March 6, 2013

Contents

Current version changes

This current revision is version 2.4 of the Mutation Annotation Format (MAF) specification.

The following items in the specification were added or modified in version 2.4 from version 2.3:

  • Header for MAF file is "#version 2.4"
  • Allowed "Validation_Status" values are now "Untested, Inconclusive, Valid, and Invalid."
  • "Mutation_Status" values are now aligned with the VCF VLS field and are "None, Germline, Somatic, LOH, Post-transcriptional modification, Unknown" (addition of the enumerated value of "Post-transcriptional modification")
    • The VCF VLS value of Wild Type will be replaced by None
  • The value in the "Validation_Status" field determines what values are allowed in the "Mutation_Status" field
  • The "Validation_Status" field will no longer accept NULL values
  • The values allowed in the "Sequence_Source" column have been changed to be a subset of the SRA 1.5 library_strategy values.
  • All Version 2.4 MAF files will be required to contain UUID columns (Tumor_Sample_UUID and Matched_Norm_Sample_UUID)
  • For the "Variant_Classification" field, the values of De_novo_Start_InFrame and De_novo_Start_OutOfFrame are no longer allowed.
  • The values for the"dbSNP_Val_Status" will be enforced. "none" will no longer be allowed.
  • "Somatic" is the only acceptable value for 'Mutation_Status' for a somatic MAF (named .somatic.maf). Protected MAF (named .protected.maf) has no such restriction and can contain Somatic, Germline, Unknown, LOH, Post-transcriptional modification, None for Mutation_Status.
  • For a somatic MAF, following rules should be satisfied:
    SOMATIC = A AND (B OR C OR D)
    A: Mutation_Status == "Somatic"
    B: Validation_Status == "Valid"
    C. Verification_Status == "Verified"
    D. Variant_Classification is not {Intron, 5'UTR, 3'UTR, 5'Flank, 3'Flank, IGR}, which implies that Variant_Classification can only be \{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region}.
  • If Validation_Status  == "Valid" then Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2 cannot  be null 
  • MAF 2.3 Validation rule 8a was superceded by:
    • If Validation_Status == "Invalid" then Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2 cannot be null AND Tumor_Validation_Allelle1 == Match_Norm_Validation_Allele1 AND Tumor_Validation_Allelle2 == Match_Norm_Validation_Allele2  (Added as a replacement for 8a as a result of breakdown)
       


       

About MAF specifications

Mutation annotation files should be transferred to the DCC. Those files should be formatted using the mutation annotation format (MAF) that is described below. File naming convention is also below.

Following categories of somatic mutations are reported in MAF files:

  • Missense and nonsense
  • Splice site, defined as SNP within 2 bp of the splice junction
  • Silent mutations
  • Indels that overlap the coding region or splice site of a gene or the targeted region of a genetic element of interest.
  • Frameshift mutations
  • Mutations in regulatory regions

Definition of open access MAF data

A large proportion of MAFs are submitted as discovery data and sites labeled as somatic in these files overlap with known germline variants. In order to minimize germline contamination in putative (unvalidated) somatic calls, certain filtering criteria have been imposed. Based on current policy, open access MAF data should:

  • include all validated somatic mutation calls
  • include all unvalidated somatic mutation calls that overlap with a coding region or splice site
  • exclude all other types of mutation calls (i.e., non-somatic calls (validated or not), unvalidated somatic calls that are not in coding region or splice sites, and dbSNP sites that are not annotated as somatic in dbSNP, COSMIC or OMIM)

 

Somatic MAF vs. Protected MAF

Centers will submit to the DCC MAF archives that contain Somatic MAF (named .somatic.maf) for open access data and an all-inclusive Protected MAF (named .protected.maf) that does not filter any data out and represents the original super-set of mutation calls. The files will be formatted using the Mutation Annotation Format (MAF).

The following table lists some of the critical attributes of somatic and protected MAF files and provides a comparison.

Attribute

Somatic MAF

Protected MAF

File naming

Somatic MAFs should be named as *.somatic.maf and cannot contain 'germ' or 'protected' in file name.

Protected MAFs should be named as*.protected.maf and should not contain 'somatic' in the file name.

Mutation category

Somatic MAFs can only contain entries where Mutation_Status is "Somatic". If any other value is assigned to the field, the archive will fail. Experimentally validated or unvalidated (see next row) somatic mutations can be included in the file.

There is no such restriction for protected MAF. The file should contain all mutation calls including those from which .somatic.maf is derived.

Filtering criteria

In order to minimize germline contamination, somatic MAFs can contain unvalidated somatic mutations only from coding regions and splice sites, which implies: 
If Validation_Status is "Unknown", Variant_Classification cannot be 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, or Intron. Variant_Classification can only be \{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame\}.
There is no such constraint for experimentally validated (Validation_Status is "Valid") somatic mutations. 

dbSNP sites that are not annotated as somatic in dbSNP, COSMIC or OMIM must be removed from somatic MAFs.

There are no such constraints for mutations in protected MAF.

Access level

These files are deployed as open access data.

These files are deployed as protected data.

MAF file fields

The format of a MAF file is tab-delimited columns. Those columns are described in Table 1 and are required in every MAF file. The order of the columns will be validated by the DCC. Column headers and values are case sensitive where specified. Columns may allow null values (i.e._ blank cells) and/or have enumerated values. The validator looks for a header stating the version of the specification to validate against (e.g. #version 2.4). If not, validation fails. Any columns that come after the columns described in Table 1 are optional. Optional columns are not validated by the DCC and can be in any order.

 

Table 1 - File column headers

 

Index

MAF Column Header

Description of Values


Example

Case
Sensitive

Null

Enumerated

1

Hugo_Symbol

HUGO symbol for the gene (HUGO symbols are always in all caps). If no gene exists within 3kb enter "Unknown".
Source: http://genenames.org

EGFR

Yes

No

Set or Unknown

2

Entrez_Gene_Id

Entrez gene ID (an integer). If no gene exists within 3kb enter "0".
Source: http://ncbi.nlm.nih.gov/sites/entrez?db=gene

1956

No

No

Set

3

Center

Genome sequencing center reporting the variant. If multiple institutions report the same mutation separate list using semicolons. Non-GSC centers will be also supported if center name is an accepted center name.

hgsc.bcm.edu;genome.wustl.edu

Yes

No

Set

4

NCBI_Build

Any TGCA accepted genome identifier.  Can be string, integer or a float.

hg18, hg19, GRCh37, GRCh37-lite, 36, 36.1, 37,


No

No

Set and Enumerated.

5

Chromosome

Chromosome number without "chr" prefix that contains the gene.

X, Y, M, 1, 2, etc.

Yes

No

Set

6

Start_Position

Lowest numeric position of the reported variant on the genomic reference sequence. Mutation start coordinate (1-based coordinate system).

999

No

No

Set

7

End_Position

Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate (inclusive, 1-based coordinate system).

1000

No

No

Set

8

Strand

Genomic strand of the reported allele. Variants should always be reported on the positive genomic strand. (Currently, only the positive strand is an accepted value).

+

No

No

+

9

Variant_Classification

Translational effect of variant allele.

Missense_Mutation

Yes

No

Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR1 , Intron, RNA, Targeted_Region

10

Variant_Type

Type of mutation. TNP (tri-nucleotide polymorphism) is analogous to DNP but for 3 consecutive nucleotides. ONP (oligo-nucleotide polymorphism) is analogous to TNP but for consecutive runs of 4 or more.

INS

Yes

No

SNP, DNP, TNP, ONP, INS, DEL, or Consolidated2

11

Reference_Allele

The plus strand reference allele at this position. Include the sequence deleted for a deletion, or "-" for an insertion.

A

Yes

No

A,C,G,T and/or -

12

Tumor_Seq_Allele1

Primary data genotype. Tumor sequencing (discovery) allele 1. " -" for a deletion represent a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases.

C

Yes

No

A,C,G,T and/or -

13

Tumor_Seq_Allele2

Primary data genotype. Tumor sequencing (discovery) allele 2. " -" for a deletion represents a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases.

G

Yes

No

A,C,G,T and/or -

14

dbSNP_RS

Latest dbSNP rs ID (dbSNP_ID) or "novel" if there is no dbSNP record. source: http://ncbi.nlm.nih.gov/projects/SNP/

rs12345

Yes

Yes

Set or "novel"

15

dbSNP_Val_Status

dbSNP validation status. Semicolon- separated list of validation statuses.

by2Hit2Allele;byCluster

No

Yes

by1000genomes;by2Hit2Allele; byCluster; byFrequency; byHapMap; byOtherPop; bySubmitter; alternate_allele3 Note that "none" will no longer be an acceptable value.

16

Tumor_Sample_Barcode

BCR aliquot barcode for the tumor sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID.

TCGA-02-0021-01A-01D-0002-04

Yes

No

Set

17

Matched_Norm_Sample_Barcode

BCR aliquot barcode for the matched normal sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID; e.g. TCGA-02-0021-10A-01D-0002-04 (compare portion ID '10A' normal sample, to '01A' tumor sample).

TCGA-02-0021-10A-01D-0002-04

Yes

No

Set

18

Match_Norm_Seq_Allele1

Primary data. Matched normal sequencing allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.

T

Yes

Yes

A,C,G,T and/or -

19

Match_Norm_Seq_Allele2

Primary data. Matched normal sequencing allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.

ACGT

Yes

Yes

A,C,G,T and/or -

20

Tumor_Validation_Allele1

Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.

-

Yes

Yes

A,C,G,T and/or -

21

Tumor_Validation_Allele2

Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.

A

Yes

Yes

A,C,G,T and/or -

22

Match_Norm_Validation_Allele1

Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.

C

Yes

Yes

A,C,G,T and/or -

23

Match_Norm_Validation_Allele2

Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.

G

Yes

Yes

A,C,G,T and/or -

24

Verification_Status4

Second pass results from independent attempt using same methods as primary data source. Generally reserved for 3730 Sanger Sequencing.


Verified

Yes

Yes

Verified, Unknown

25

Validation_Status4

Second pass results from orthogonal technology.

Valid

Yes

No

  • Untested
  • Inconclusive
  • Valid
  • Invalid

26

Mutation_Status

Updated to reflect validation or verification status and to be in agreement with the VCF VLS field. The values allowed in this field are constrained by the value in the Validation_Status field.

Somatic

Yes

No

Validation_Status value

Allowed 5 Mutation_Status values

Untested
  • None
  • Germline
  • Somatic
  • LOH
  • Post-transcriptional modification
  • Unknown
Inconclusive
  • None
  • Germline
  • Somatic
  • LOH
  • Post-transcriptional modification
  • Unknown
Valid
  • Germline
  • Somatic
  • LOH
  • Post-transcriptional modification
  • Unknown
Invalid
  • None

 

 

27

Sequencing_Phase

TCGA sequencing phase. Phase should change under any circumstance that the targets under consideration change.

Phase_I

No

Yes

No

28

Sequence_Source

Molecular assay type used to produce the analytes used for sequencing. Allowed values are a subset of the SRA 1.5 library_strategy field values. This subset matches those used at CGHub.

WGS;WXS

Yes

No

  • Common TCGA values
    • WGS
    • WGA
    • WXS
    • RNA-Seq
    • miRNA-Seq
    • Bisulfite-Seq
    • VALIDATION
    • Other
  • Other allowed values (per SRA 1.5)
    • ncRNA-Seq
    • WCS
    • CLONE
    • POOLCLONE
    • AMPLICON
    • CLONEEND
    • FINISHING
    • ChIP-Seq
    • MNase-Seq
    • DNase-Hypersensitivity
    • EST
    • FL-cDNA
    • CTS
    • MRE-Seq
    • MeDIP-Seq
    • MBD-Seq
    • Tn-Seq
    • FAIRE-seq
    • SELEX
    • RIP-Seq
    • ChIA-PET

29

Validation_Method

The assay platforms used for the validation call. Examples: Sanger_PCR_WGA, Sanger_PCR_gDNA, 454_PCR_WGA, 454_PCR_gDNA; separate multiple entries using semicolons.

Sanger_PCR_WGA;Sanger_PCR_gDNA

No

NO. If Validation_Status = Untested then "none"

No

30

Score

Not in use.

NA

No

Yes

No

31

BAM_File

Not in use.

NA

No

Yes

No

32

Sequencer

Instrument used to produce primary data. Separate multiple entries using semicolons.

Illumina GAIIx;SOLID

Yes

No

  • Illumina GAIIx
  • Illumina HiSeq
  • SOLID
  • 454
  • ABI 3730xl
  • Ion Torrent PGM
  • Ion Torrent Proton
  • PacBio RS
  • Illumina MiSeq
  • Illumina HiSeq 2500
  • 454 GS FLX Titanium
  • AB SOLiD 4 System
33Tumor_Sample_UUIDBCR aliquot UUID for tumor sample

550e8400-e29b-41d4-a716-446655440000

YesNo 
34

Matched_Norm_Sample_UUID

BCR aliquot UUID for matched normal567e8487-e29b-32d4-a716-446655443246YesNo 

Footnotes
Ref Notes
1 Intergenic Region
2 'Consolidated' is used to indicate a site that was initially reported as as variant but subsequently removed from further analysis because it was consolidated into a new variant.  For example, a SNP variant incorporated into a TNP variant.
3 Used when the discovered variant differs from that of dbSNP
4 These MAF headers describe the technology that was used to confirm a mutation, whether the same technology (“verification”) or a different technology (“validation”) is used to prove that a variant is germline or a somatic mutation. [ a b ]
5

Explanation of some Validation Status-Mutation Status combinations

Validation StatusMutation StatusExplanation

Valid

Unknown

a valid variant with unknown somatic status due to lack of data from matched normal tissue.

Invalid

None

validation attempted, tumor and normal are homozygous reference (formerly described as Wildtype)

Inconclusive

Unknown

validation failed, neither the genotype nor its somatic status is certain due to lack of data from matched normal tissue

Inconclusive

None

validation failed, tumor genotype appears to be homozygous reference

Important Criteria

Index column indicates the order in which the columns are expected. All headers are case sensitive. The Case Sensitive column specifies which values are case sensitive. The Null column indicates which MAF columns are allowed to have null values. The Enumerated column indicates which MAF columns have specified values: an Enumerated value of "No" indicates that there are no specified values for that column; other values indicate the specific values listed allowed; a value of "Set" indicates that the MAF column values come from a specified set of known values (e.g. HUGO gene symbols).

 

MAF file checks

The DCC Archive Validator checks the integrity of a MAF file. Validation will fail if any of the below are not true for a MAF file:

  1. Column header text (including case) and order must match specification (Table 1) exactly
  2. Values under column headers listed in the specification (Table 1) as not null must have values
  3. Values that are specified in Table 1 as Case Sensitive must be.
  4. If column headers are listed in the specification as having enumerated values (i.e. a "Yes" in the "Enumerated" column), then the values under those column must come from the enumerated values listed under "Enumerated".
  5. If column headers are listed in the specification as having set values (i.e. a "Set" in the "Enumerated" column), then the values under those column must come from the enumerated values of that domain (e.g. HUGO gene symbols).
  6. All Allele-based columns must contain - (deletion), or a string composed of the following capitalized letters: A, T, G, C.
  7. If Validation_Status == "Untested" then Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2 can be null (depending on Validation_Status).

    1. If Validation_Status == "Inconclusive" then Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2 can be null (depending on Validation_Status).
  8. If Validation_Status == Valid, then Validated_Tumor_Allele1 and Validated_Tumor_Allele2must be populated (one of A, C, G, T, and -)
    1. If Validation_Status  == "Valid" then Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2 cannot  be null 
    2.  If Validation_Status == "Invalid" then Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2 cannot be null AND Tumor_Validation_Allelle1 == Match_Norm_Validation_Allele1 AND Tumor_Validation_Allelle2 == Match_Norm_Validation_Allele2  (Added as a replacement for 8a as a result of breakdown)
  9. Check allele values against Mutation_Status:
    Check allele values against Validation_status:
    1. If Mutation_Status == "Germline" and Validation_Status == "Valid", then Tumor_Validation_Allele1 == Match_Norm_Validation_Allele1 and Tumor_Validation_Allele2 == Match_Norm_Validation_Allele2.

    2. If Mutation_Status == "Somatic" and Validation_Status == "Valid", then Match_Norm_Validation_Allele1 == Match_Norm_Validation_Allele2 == Reference_Allele and (Tumor_Validation_Allele1 or Tumor_Validation_Allele2) != Reference_Allele

    3. If Mutation_Status == "LOH" and Validation_Status=="Valid", then Tumor_Validation_Allele1 == Tumor_Validation_Allele2 and Match_Norm_Validation_Allele1 != Match_Norm_Validation_Allele2 and Tumor_Validation_Allele1 == (Match_Norm_Validation_Allele1 or Match_Norm_Validation_Allele2).

  10. Check that Start_position <= End_position
  11. Check for the Start_position and End_position against Variant_Type:
    1. If Variant_Type == "INS", then (End_position - Start_position + 1 == length (Reference_Allele) or End_position - Start_position == 1) and length(Reference_Allele) <= length(Tumor_Seq_Allele1 and Tumor_Seq_Allele2)
    2. If Variant_Type == "DEL", then End_position - Start_position + 1 == length (Reference_Allele), then length(Reference_Allele) >= length(Tumor_Seq_Allele1 and Tumor_Seq_Allele2)
    3. If Variant_Type == "SNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) ==  1 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) != "-"
    4. If Variant_Type == "DNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) ==  2 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
    5. If Variant_Type == "TNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) ==  3 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
    6. If Variant_Type == "ONP", then length(Reference_Allele) == length(Tumor_Seq_Allele1) == length(Tumor_Seq_Allele2) > 3 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
  12. Validation for UUID-based files:
    1. Column #33 must be Tumor_Sample_UUID containing UUID of the BCR aliquot for tumor sample
    2. Column #34 must be Matched_Norm_Sample_UUID containing UUID of the BCR aliquot for matched normal sample
    3. Metadata represented by Tumor_Sample_Barcode and Matched_Norm_Sample_Barcode should correspond to the UUIDs assigned to Tumor_Sample_UUID and Matched_Norm_Sample_UUID respectively

MAF naming convention

In archives uploaded to the DCC, the MAF file name should relate to the containing archive name in the following way:

If the archive has the name

then a somatic MAF file with the archive should be named according to

and a protected MAF with the archive should be named according to

The <optional_tag> may consist of alphanumeric characters, dash, and underscore; no spaces or periods; or it may be left out altogether. The purpose of the optional tag is to impart some brief annotation.

Example

For the archive

the following are examples of valid maf names

Previous specification versions

Labels