Variant Calling GATK - Best practices

Introduction

The Genome Analysis Toolkit (GATK), developed by the Broad Institute, provides a set of methodologies for precise variant calling using next-generation sequencing data. The guide describes the GATK Best Practices for germline short variant discovery (single nucleotide polymorphism and insertion/deletions), including data preprocessing, variant calling, and quality assurance for human whole-genome and exome sequencing data.

Why GATK Best Practices Matter ?

GATK workflows minimize false positives/negatives through systematic data preprocessing, statistical modeling, and filtering. Key principles:

  • Preprocessing: Alignment artifacts cause 80% of variant calling errors
  • Joint genotyping: Improves accuracy across samples
  • Machine learning filtering: VQSR outperforms hard filters
  • Functional equivalence: Pipelines produce interoperable results across lab

Core GATK Variant Calling Pipeline

Follows Broad's official Germline Short Variant Discovery workflow.

1. Data Preprocessing (Essential)

Convert uBAM → aligned BAM → recalibrated BAM:

Unmapped BAM (uBAM)
↓ MarkDuplicates (Picard)
Coordinate-sorted BAM
↓ Base Quality Score Recalibration (BQSR)
Recalibrated BAM (CRAM)

BQSR Example Commands:

gatk BaseRecalibrator \
  -I input.bam \
  -R reference.fasta \
  --known-sites known_indels.vcf.gz \
  -O recal_data.table

gatk ApplyBQSR \
  -I input.bam \
  -bqsr recal_data.table \
  -O recalibrated.bam

2. Variant Calling (HaplotypeCaller)

Active region discovery + haplotype assembly → local de novo calling:

gatk HaplotypeCaller \
  -R reference.fasta \
  -I recalibrated.bam \
  -O output.g.vcf.gz \
  --emit-ref-confidence GVCF

3. Joint Genotyping

Combine gVCFs → cohort VCF:

gatk GenomicsDBImport \
  -R ref.fasta \
  --genomicsdb-workspace-path my_database input/*.g.vcf.gz

gatk GenotypeGVCFs \
  -R ref.fasta \
  -V gendb://my_database \
  -O cohort.vcf.gz

4. Variant Quality Score Recalibration (VQSR)

ML-based filtering using known variant sites:

gatk VariantRecalibrator -R ref.fasta \
  -V cohort.vcf.gz \
  --resource hapmap,known=false,training=true,truth=true,prior=15.0 \
  -an QD -an MQ -an FS -an SOR -an MQRankSum -an ReadPosRankSum \
  -mode SNP -O snp_recal.vcf.gz

gatk ApplyVQSR -V cohort.vcf.gz \
  --recal-file snp_recal.vcf.gz \
  -mode SNP --tranches-file snp_tranches.vcf.gz

Complete Workflow Summary

Step Tool Purpose Input Output
MarkDuplicates Picard Remove PCR duplicates uBAM Dedup BAM
BQSR GATK BaseRecalibrator Correct systematic errors Dedup BAM Recal table
ApplyBQSR GATK ApplyBQSR Apply recalibration Dedup BAM Recal BAM
HaplotypeCaller GATK HaplotypeCaller Local assembly + calling Recal BAM gVCF
GenomicsDBImport GATK GenomicsDBImport Cohort database creation gVCFs GenomicsDB
GenotypeGVCFs GATK GenotypeGVCFs Joint genotyping GenomicsDB Raw cohort VCF
VQSR GATK VariantRecalibrator ML variant filtering Raw VCF Filtered VCF

DRAGEN-GATK Mode (2026 update): Hardware-optimized alignment + recalibration-free calling for functional equivalence.

Performance Optimization Tips

Preprocessing

Efficient preprocessing ensures that your downstream variant calling is accurate and free from technical artifacts. Key steps include:

  • Alignment: Use DRAGMAP (DRAGEN) or BWA-MEM to map reads to the reference genome.
  • Base Quality Score Recalibration (BQSR): Always run BQSR to correct systematic errors, unless you are using DRAGEN mode which handles this natively.
  • Validation: Use gatk ValidateSamFile to ensure your BAM files are structurally sound before proceeding.

Calling

To maximize speed and accuracy during the variant discovery phase, follow these practices:

  1. Scatter calling intervals: Break the genome into 100-1000 intervals to process them in parallel.
  2. GVCF mode: Always output in GVCF format to facilitate efficient joint genotyping later.
  3. Ploidy settings: Ensure --sample-ploidy 2 is set for standard human diploid samples.

Compute Requirements

Allocating the right resources prevents bottlenecks and "Out of Memory" crashes:

  • Preprocessing: Requires approximately 32GB RAM and 16 cores per sample.
  • Calling: Requires 16GB RAM and 4-8 cores per scatter interval.
  • Joint Genotyping: This stage is computationally expensive and scales linearly with your cohort size.

Expected Metrics

After your pipeline runs, validate your results against these standard genomic benchmarks:

  • Ti/Tv ratio: Should be roughly 2.1 for non-coding regions and 3.0 for coding regions.
  • Het/Hom ratio: Typically falls between 1.5 and 2.0 for human samples.
  • Analysis: Use bcftools stats to verify transitions, transversions, and phasing quality.

Common Pitfalls and Solutions

Issue Cause Solution
Low Ti/Tv Poor alignment Realign with BWA-MEM/DRAGMAP
Mapping bias Reference mismatch Use ALT contigs (hg38)
Batch effects Different sequencers Joint genotype all samples
Memory errors Large BAMs Scatter by intervals

SyncBio Bioinformatics Implementation

SyncBio integrates GATK Best Practices into production pipelines to ensure high-quality genomic data processing.

GenomeFlow-Prod Pipeline:

  • Snakemake prototyping → Nextflow production
  • WGS/WES germline calling (1,000+ samples/month)
  • AWS Batch + GPU nodes for ML integration
  • DRAGEN-GATK mode for functional equivalence

Key Results:

  • 95%+ precision/recall vs. truth sets
  • 40% compute cost reduction via scattering
  • Reproducible across local HPC/cloud
  • Supports VariantML-Pipe ML workflows

SyncBio Workflow Template (Nextflow):

The following process demonstrates the standardized implementation of the GATK HaplotypeCaller within our Nextflow architecture:

process GATK_HAPLOTYPECALLER {
  input: path bam, path ref
  output: path "*.g.vcf.gz"
  script:
  """
  gatk HaplotypeCaller -R $ref -I $bam -O cohort.g.vcf.gz \
  --emit-ref-confidence GVCF --standard-min-confidence-threshold-for-calling 20
  """
}

This standardized approach powers SyncBio's personalized medicine projects and EU collaborations, delivering publication-ready variant calls.

Need Expert Guidance?

Our team can help you implement these strategies effectively.

Contact Us