What is snpQT?

snpQT (pronounced snip-cutie) makes your single-nucleotide polymorphisms cute. Also, it provides support for processing human genomic variants to do:

  • human genome build conversion
  • sample quality control
  • population stratification
  • variant quality control
  • pre-imputation quality control
  • local imputation
  • post-imputation quality control
  • genome-wide association studies

within an automated nextflow pipeline. We run a collection of versioned bioinformatics software in Singularity and Docker containers or Anaconda and Environment Modules environments to improve reliability and reproducibility.

Who is snpQT for?

snpQT might be useful for you if:

  • you want a clean genomic dataset using a reproducible, fast and comprehensive pipeline
  • you are interested to identify significant SNP associations to a trait
  • you want to identify and remove outliers based on their ancestry
  • you wish to perform imputation locally
  • you wish to prepare your genomic dataset for imputation in an external server (following a comprehensive QC and a pre-imputation QC preparation)

What do you need to get started?

  • you have already called your variants using human genome build 37 or 38
  • your variants are in VCF or plink bfile format
  • your variants have "rs" ids
  • your samples have either a binary or a quantitative phenotype

If this sounds like you, check out the installation guide.

snpQT definitely won't be useful for you if:

  • you want to do quality control on raw sequence reads
  • you want to call variants from raw sequence reads
  • you are working on family GWAS data
  • you're not working with human genomic data


If you find snpQT useful please cite:

Vasilopoulou C, Wingfield B, Morris AP and Duddy W. snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10:567 https://doi.org/10.12688/f1000research.53821.1

License and third-party software

snpQT is distributed under a MIT license. Our pipeline wouldn't be possible without the following amazing third-party software:

Software Version Reference License
EIGENSOFT 7.2.1 Price, Alkes L., et al. "Principal components analysis corrects for stratification in genome-wide association studies." Nature genetics 38.8 (2006): 904-909. Custom open source
impute5 1.1.4 Rubinacci, Simone, Olivier Delaneau, and Jonathan Marchini. "Genotype imputation using the positional burrows wheeler transform." PLoS Genetics 16.11 (2020): e1009049.APA Academic use only
nextflow 21.04.3 Di Tommaso, Paolo, et al. "Nextflow enables reproducible computational workflows." Nature biotechnology 35.4 (2017): 316-319. GPL3
picard 2.24.0 MIT
PLINK 1.90b6.18 Purcell, Shaun, et al. "PLINK: a tool set for whole-genome association and population-based linkage analyses." The American journal of human genetics 81.3 (2007): 559-575. GPL3
PLINK2 2.00a2.3 Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4. GPL3
samtools 1.11 Danecek, Petr et al. "Twelve years of SAMtools and BCFtools." GigaScience, 10(2), 1-4, 2021 MIT
bcftools 1.9 Danecek, Petr et al. "Twelve years of SAMtools and BCFtools." GigaScience, 10(2), 1-4, 2021 MIT
shapeit4 4.1.3 Delaneau, Olivier, et al. "Accurate, scalable and integrative haplotype estimation." Nature communications 10.1 (2019): 1-10. MIT
snpflip 0.0.6 https://github.com/biocore-ntnu/snpflip MIT

We also use countless other bits of software like R, the R tidyverse, etc.