Mutations are discovered by aligning DNA sequences derived from tumor samples to sequences derived from normal samples and a reference sequence. A MAF file identifies, for each sample, the discovered putative or validated mutations and categorizes those mutations (SNP, deletion, or insertion) as somatic (originating in the tissue) or germline (originating from the germline) as well as the annotation for those mutations.
This format is not to be confused with the UCSC Multiple Alignment Format MAF).
MAF File Content and Use
As with trace ID-to-sample relationship files, mutation annotation format (MAF) files contain aliquot UUIDs and associated metadata. Those UUIDs enable researchers to associate sample IDs with assay results.
To create a MAF file, GSCs compare a participant's normal chromosomal sequence with the tumor chromosomal sequence and a template reference sequence. Any abnormal differences between the three sequences are captured in the mutation file.
GSCs transfer mutation annotation data to the DCC in two types of files: those that only contain somatic mutations (frequently having the extension somatic.maf) and those that contain both somatic and germline mutations (frequently having the extension protected.maf). A "protected.maf" file is a super-set of all mutations detected for a given disease by a given GSC (and is available in the controlled access part of the Data Portal). Frequently an accompanying "somatic.maf" file is submitted for a given disease by the GSC; it contains the somatic mutation subset of the partner "protected.maf" file and is available in the open access part of the Data Portal.
A MAF file identifies, for each sample, the discovered putative or validated mutations and categorizes those mutations (SNP, deletion, or insertion) as somatic (originating in the tissue) or germline (originating from the germline). These can be subcategorized as follows:
- Missense and nonsense
- Splice site, defined as SNP within 2 bp of the splice junction
- Silent mutations
- Indels that overlap the coding region or splice site of a gene or the targeted region of a genetic element of interest.
- Frameshift mutations
- Mutations in regulatory regions
- Any germline SNP with validation status "unknown" is included.
- SNPs already validated in dbSNP are not included since they are unlikely to be involved in cancer.
The Mutation Annotation Format (MAF) Specification provides a current and in-depth description of MAF File Validation and Format.