Table of Contents
Introduction
The Genome Analysis Toolkit (GATK), developed by the Broad Institute, provides a set of methodologies for precise variant calling using next-generation sequencing data. The guide describes the GATK Best Practices for germline short variant discovery (single nucleotide polymorphism and insertion/deletions), including data preprocessing, variant calling, and quality assurance for human whole-genome and exome sequencing data.
Why GATK Best Practices Matter ?
GATK workflows minimize false positives/negatives through systematic data preprocessing, statistical modeling, and filtering. Key principles:
- Preprocessing: Alignment artifacts cause 80% of variant calling errors
- Joint genotyping: Improves accuracy across samples
- Machine learning filtering: VQSR outperforms hard filters
- Functional equivalence: Pipelines produce interoperable results across lab
Core GATK Variant Calling Pipeline
Follows Broad's official Germline Short Variant Discovery workflow.
1. Data Preprocessing (Essential)
Convert uBAM → aligned BAM → recalibrated BAM:
↓ MarkDuplicates (Picard)
Coordinate-sorted BAM
↓ Base Quality Score Recalibration (BQSR)
Recalibrated BAM (CRAM)
BQSR Example Commands:
gatk BaseRecalibrator \
-I input.bam \
-R reference.fasta \
--known-sites known_indels.vcf.gz \
-O recal_data.table
gatk ApplyBQSR \
-I input.bam \
-bqsr recal_data.table \
-O recalibrated.bam
2. Variant Calling (HaplotypeCaller)
Active region discovery + haplotype assembly → local de novo calling:
gatk HaplotypeCaller \
-R reference.fasta \
-I recalibrated.bam \
-O output.g.vcf.gz \
--emit-ref-confidence GVCF
3. Joint Genotyping
Combine gVCFs → cohort VCF:
gatk GenomicsDBImport \
-R ref.fasta \
--genomicsdb-workspace-path my_database input/*.g.vcf.gz
gatk GenotypeGVCFs \
-R ref.fasta \
-V gendb://my_database \
-O cohort.vcf.gz
4. Variant Quality Score Recalibration (VQSR)
ML-based filtering using known variant sites:
gatk VariantRecalibrator -R ref.fasta \
-V cohort.vcf.gz \
--resource hapmap,known=false,training=true,truth=true,prior=15.0 \
-an QD -an MQ -an FS -an SOR -an MQRankSum -an ReadPosRankSum \
-mode SNP -O snp_recal.vcf.gz
gatk ApplyVQSR -V cohort.vcf.gz \
--recal-file snp_recal.vcf.gz \
-mode SNP --tranches-file snp_tranches.vcf.gz
Complete Workflow Summary
| Step | Tool | Purpose | Input | Output |
|---|---|---|---|---|
| MarkDuplicates | Picard | Remove PCR duplicates | uBAM | Dedup BAM |
| BQSR | GATK BaseRecalibrator | Correct systematic errors | Dedup BAM | Recal table |
| ApplyBQSR | GATK ApplyBQSR | Apply recalibration | Dedup BAM | Recal BAM |
| HaplotypeCaller | GATK HaplotypeCaller | Local assembly + calling | Recal BAM | gVCF |
| GenomicsDBImport | GATK GenomicsDBImport | Cohort database creation | gVCFs | GenomicsDB |
| GenotypeGVCFs | GATK GenotypeGVCFs | Joint genotyping | GenomicsDB | Raw cohort VCF |
| VQSR | GATK VariantRecalibrator | ML variant filtering | Raw VCF | Filtered VCF |
DRAGEN-GATK Mode (2026 update): Hardware-optimized alignment + recalibration-free calling for functional equivalence.
Performance Optimization Tips
Preprocessing
Efficient preprocessing ensures that your downstream variant calling is accurate and free from technical artifacts. Key steps include:
- Alignment: Use DRAGMAP (DRAGEN) or BWA-MEM to map reads to the reference genome.
- Base Quality Score Recalibration (BQSR): Always run BQSR to correct systematic errors, unless you are using DRAGEN mode which handles this natively.
- Validation: Use
gatk ValidateSamFileto ensure your BAM files are structurally sound before proceeding.
Calling
To maximize speed and accuracy during the variant discovery phase, follow these practices:
- Scatter calling intervals: Break the genome into 100-1000 intervals to process them in parallel.
- GVCF mode: Always output in GVCF format to facilitate efficient joint genotyping later.
- Ploidy settings: Ensure
--sample-ploidy 2is set for standard human diploid samples.
Compute Requirements
Allocating the right resources prevents bottlenecks and "Out of Memory" crashes:
- Preprocessing: Requires approximately 32GB RAM and 16 cores per sample.
- Calling: Requires 16GB RAM and 4-8 cores per scatter interval.
- Joint Genotyping: This stage is computationally expensive and scales linearly with your cohort size.
Expected Metrics
After your pipeline runs, validate your results against these standard genomic benchmarks:
- Ti/Tv ratio: Should be roughly 2.1 for non-coding regions and 3.0 for coding regions.
- Het/Hom ratio: Typically falls between 1.5 and 2.0 for human samples.
- Analysis: Use
bcftools statsto verify transitions, transversions, and phasing quality.
Common Pitfalls and Solutions
| Issue | Cause | Solution |
|---|---|---|
| Low Ti/Tv | Poor alignment | Realign with BWA-MEM/DRAGMAP |
| Mapping bias | Reference mismatch | Use ALT contigs (hg38) |
| Batch effects | Different sequencers | Joint genotype all samples |
| Memory errors | Large BAMs | Scatter by intervals |
SyncBio Bioinformatics Implementation
SyncBio integrates GATK Best Practices into production pipelines to ensure high-quality genomic data processing.
GenomeFlow-Prod Pipeline:
- Snakemake prototyping → Nextflow production
- WGS/WES germline calling (1,000+ samples/month)
- AWS Batch + GPU nodes for ML integration
- DRAGEN-GATK mode for functional equivalence
Key Results:
- 95%+ precision/recall vs. truth sets
- 40% compute cost reduction via scattering
- Reproducible across local HPC/cloud
- Supports VariantML-Pipe ML workflows
SyncBio Workflow Template (Nextflow):
The following process demonstrates the standardized implementation of the GATK HaplotypeCaller within our Nextflow architecture:
process GATK_HAPLOTYPECALLER {
input: path bam, path ref
output: path "*.g.vcf.gz"
script:
"""
gatk HaplotypeCaller -R $ref -I $bam -O cohort.g.vcf.gz \
--emit-ref-confidence GVCF --standard-min-confidence-threshold-for-calling 20
"""
}
This standardized approach powers SyncBio's personalized medicine projects and EU collaborations, delivering publication-ready variant calls.
Need Expert Guidance?
Our team can help you implement these strategies effectively.
Contact Us