Workflow descriptions
Build conversion
--convert_build
Build conversion workflow options:
Mandatory:
--vcf
--fam
Optional:
--input_build [38/37]
--output_build [37/38]
--mem
snpQT
assumes that your genomic data are built on human genome build 37. Even though b37 is not the most recent build, it is the most frequently used build among current public, reference genomic datasets (e.g. 1,000 Genome data, Haplotype Reference Concortium panel) and available SNP-array datasets, and for this reason we decided that snpQT
should support b37 for the most of the workflows. However, if you wish to impute your data using an external online server that uses a reference panel in b38 like TOPMed, we provide input parameters to convert your snpQT
clean dataset from b37 back to b38 (using --input_build 37
and --output_build 38
).
This workflow is intented for those users whose data are built on b38 (default) or b37 and they wish to convert them to b37 or b38, accordingly, using Picard
's LiftoverVcf
utility, so that their data are compatible with subsequent snpQT
workflows and also the users can convert their data in their initial build if they wish to. If your data are built neither on b38 nor b37 then visit our public git repo where you can run the code of this workflow yourselves, using alternative .chain.gz files depending on your current build. Lastly, --mem
parameter controls the memory size that the LiftoverVCF
utility can use.
Warning
Be aware that a small portion of SNPs is expected to be removed between conversion. A portion of SNPs are different among different hgb versions (e.g. not present, merged or relabelled).
Quality control
--qc
Quality control workflow options:
Mandatory:
--bed
--bim
--fam
Optional:
--mind [0-1]
--indep_pairwise [e.g. "50 5 0.2"]
--variant_geno [0-1]
--hwe [0-1]
--maf [0-1]
--missingness [0-1]
--sexcheck [true/false]
--rm_missing_pheno [false/true]
--keep_sex_chroms [true/false]
--heterozygosity [true/false]
--king_cutoff [0-1]
--pca_covars [1-20]
--linear [false/true]
Sample quality control
Below we list the checks which are followed in the Sample QC workflow:
-
Missing variant call rate check: Remove very poor quality SNPs based on call rate. This way we avoid removing samples based on very poor quality SNPs (which we would remove later in Variant QC workflow anyway).
-
Missing sample call rate check: Remove samples with lower than 98% call rate and then visualize the distribution for all samples using histograms and scatterplots before and after the applied threshold. This threshold can be modified using the
--mind
parameter. -
Check for sex discrepancies: Remove problematic samples for which (1) pedigree sex does not agree with the predicted sex based on sex chromosome homozygosity or (2) there is no sex phenotype. This step is important to avoid potential sample mix-ups and/or DNA contaminations.
Warning
This step will not run if sex chromosomes are not included in the .bim file. In this case, use the --sexcheck false
parameter to skip this step.
-
Removal of non-autosomal SNPs: The default mode of
snpQT
is to keep the sex chromosomes. If the user wishes to remove the sex chromosomes, use--keep_sex_chroms false
-
Heterozygosity check: Identify and remove heterozygosity outliers (samples that deviate more than 3 units of Standard Deviation from the mean heterozygosity). The distribution of the samples' heterozygosity is vizualized through a histogram and scatterplot. Extreme heterozygosity implies inbreeding and/or DNA contamination. This step can be skipped using
--heterozygosity false
. This can be useful in occasions where the dataset has been already through heterozygosity once, so the user does not wish to remove samples again based on this metric. -
Check for cryptic relatedness and duplicates: Check for cryptic pairs of relatives using
plink2
's relationship-based pruning threshold. Relatedness is defined as 3rd degree or closer. This threshold can be changed using the--king_cutoff
parameter. -
Removal of samples with a missing phenotype: Remove samples with missing phenotypes. As missing phenotype here we refer to case/control status (i.e. the last column in your PLINK .fam file). The default option in
snpQT
is to skip this step.
At the end of Sample QC a .log file is generated listing the number of samples, variants, phenotypes and working directory for each step where the intermediate files are stored, as well as an .html report containing before-and-after the chosen thesholds plots for the vast majority of the aforementioned steps.
Variant quality control
The Variant QC workflow is the second part of the –-qc
utility of snpQT
. It is good practice to first filter low quality samples in order to reduce the risk of removing a potentially high-risk variant during Variant QC. For this reason, the population stratification workflow (if chosen to run by the user), which is essentially a Sample QC step, is designed to run in between Sample QC and Variant QC. Below we list the checks which are followed in the Variant QC workflow:
-
Missing variant call rate check : Remove poor quality SNPs based on a more strict call rate threshold. This threshold can be changed using the parameter
–-variant_geno
. -
Hardy-Weinberg equilibrium (HWE) deviation check: Remove SNPs that significantly deviate from the Hardy-Weinberg equilibrium (HWE), indicating a genotyping error, and visualize the distribution of SNPs with extreme deviation. This threshold can be changed using the parameter
–-hwe
. -
Minor Allele Frequency (MAF) check: Remove SNPs with low MAF and visualize the MAF distribution. This threshold can be modified using the
–-maf
paramater insnpQT
. Rare SNPs (having a very low MAF) are usually considered as false-positives and need to be excluded from further analysis. -
Missingness in case/ control status check: Remove SNPs with a statistically significant association of missingness (low call rate) and case/control status. The threshold can be modified using the parameter
–-missingness
.
Warning
Missingness in case/ control status check can not be performed in quantitative data. If you do not have binary data use the --linear
parameter to skip this step.
- Generate covariates using the first X Principal Components of each sample: The number of Principal Components (PCs) to account for inner population structure can be tailored using the
--pca_covars
parameter which can take as input a number from 1 to 20, with 1 starting from the first Principal Component of the PCA. Output from this step is used in the--gwas
workflow, in order to account for potential inner population structure (it is best used along with--pop_strat
).
At the end of Variant QC we provide an .html report containing the following:
-
Before-and-after applying the chosen thesholds plots for all the aforementioned steps.
-
Three 2D PCA plots (PC1vsPC2, PC1vsPC3 and PC2vsPC3) of the user's data annotated with case/control status.
-
3D interactive PCA plot of only the user's data annotated with case/control status.
We also provide all the figures, logs and binary plink files in separate directories ./qc/figures/
, ./qc/logs/
and ./qc/bfiles/
, respectively.
Population stratification
--pop_strat
Population stratification options:
Mandatory:
--qc
Optional:
--variant_geno [0-1]
--indep-pairwise [e.g. 50 5 0.2]
--popfile [super/sub]
--parfile [false/parfile.txt]
--popcode [EUR/AFR/SAS]
This workflow aims to identify and remove outliers using EBI's phased latest release 1,000 human genome reference panel aligned in human genome build 37, as well as account for potential inner population structure in the later GWAS workflow. Population stratification is an essential step in QC analysis, since it minimizes the possibility that the difference in the allele frequencies is caused by the different ancestry of the samples.
Note
Population stratification is a sample QC step and for this reason it is designed to run between the Sample and Variant QC --qc
workflows.
The first step in --pop_strat
is to prepare and merge 1,000 human genome data with the user's dataset following a number of QC steps using plink2
. The processing steps for 1,000 human genome data include:
-
Removing sex chromosomes
-
Checking that all alleles are on the forward strand
-
Removing duplicated SNPs
-
Removing multi-allelic variants
-
Removing very poorly genotyped variants
-
Removing samples with low call rate (>98%)
-
Removing poorly genotyped variants in a second more stringent threshold (
--variant_geno
) -
Removing rare variants
-
Keeping only highly independent SNPs by excluding regions with high linkage equilibrium and pruning using
--indep-pairwise
.
The next step is to prepare the user's dataset (samples having already been processed through the Sample QC workflow):
-
Removing poorly genotyped variants in a second more stringent threshold (
--variant_geno
) -
Removing rare variants
-
Keeping only highly independent SNPs by excluding regions with high linkage equilibrium and pruning using
--indep-pairwise
. -
Checking that all alleles are on the forward strand
-
Removing ambiguous SNPs
-
Removing duplicated SNPs
When both datasets are prepared, the next step is merging: keeping only mutual SNPs shared by both the user's dataset and the 1,000 human genome data.
Then we make a popfile summarizing the samples of the merged dataset adding a third column labelling the population origin of each sample. User's samples are automatically labelled as "OWN". The population label for 1,000 human genome data is defined by the --popfile
flag, having as a default to use super population labels (e.g. EUR, AFR, AMR). If you want to use subpopulation labels then you should add the --popfile sub
parameter.
When the popfile and the merged dataset are ready, it is time to run Eigensoft's smartpca
software. smartpca
needs a set of parameters in order to run, which are in the form of a file (parfile). We provide the option to change this parfile according to the users' needs (you can have a look at the parameters of the parfile here). Our suggested parameters for the parfile are the following:
genotypename: C6_indep.bed
snpname: C6_indep.bim
indivname: C6_indep.pedind
evecoutname: eigenvec
evaloutname: eigenval
outlieroutname: excluded_outliers.txt
numthreads: 2
poplistname: poplist.txt
numoutlierevec: 7
autoshrink: YES
The user can change/add/remove parameters by passing a file to the --parfile [file]
parameter. However, the following parameters are mandatory for snpQT
to run, and therefore should not be altered or removed by the user:
genotypename: C6_indep.bed
snpname: C6_indep.bim
indivname: C6_indep.pedind
evecoutname: eigenvec
evaloutname: eigenval
outlieroutname: excluded_outliers.txt
poplistname: poplist.txt
A parfile.txt example would look something like this:
numthreads: 10
numoutlierevec: 4
When --pop_strat
has finished, we provide the following PCA results in a .html report:
-
3D interactive PCA plot of merged dataset containing all samples before and after outlier removal.
-
2D PCA plots of merged dataset containing all samples.
The last and most important step in --pop_strat
is the:
- Removal of any samples that are outliers.
We also provide all the figures, logs and binary plink files in separate directories ./pop_strat/figures/
, ./pop_strat/logs/
and ./pop_strat/bfiles/
, respectively.
Pre-Imputation quality control
--pre_impute
Pre-imputation workflow options:
Mandatory:
--qc
In pre-imputation workflow the user's dataset is prepared for phasing and imputation, either locally using snpqt
or for the purpose to upload a clean vcf.gz file to an external imputation server.
The main aims of this workflow are:
- Flip SNPs that are on the reverse strand
- Remove ambiguous SNPs
- Remove one of each pair of duplicated SNPs
- Fix and swap the reference allele using
bcftools
+fixref
plug-in.
A clean and properly prepared for imputation vcf file and a index file are stored in ./pre_imputation/files/
directory.
Warning
- Pre-Imputation workflow can not be combined with
--gwas
, as its purpose is to upload a properly prepared VCF file to an external imputation server - Pre-imputation and post-imputation workflows are run internally when
--impute
is used to run imputation locally. So, you do not have to combine--impute
with--pre_impute
and--post_impute
workflow parameters. If you do,snpQT
is designed to throw an error.
Phasing & Imputation
--impute
Imputation workflow options:
Mandatory:
--qc
Optional:
--impute_maf [0-1]
--info [0-1]
--impute_chroms [1-23]
This workflow has three main parts:
-
Phasing: Perform phasing using
shapeit4
and index all phased chromosomes. -
Imputation: Perfom local imputation using
impute5
, following the steps below:- Convert reference genome (1,000 human genome) into a
.imp5
format for each chromosome, usingimp5Converter
. - Run
impute5
using the converted reference genome, genetic maps and user's prepared phased data. - Using the parameter
--impute_chroms [1-23]
you can control the number of chromosomes that are imputed at the same time. As higher the number of chromosomes is, the more RAM your machine will need to use.
- Convert reference genome (1,000 human genome) into a
Post-imputation quality control
--post_impute
Post-imputation workflow options:
Mandatory:
--vcf
--fam
Optional:
--impute_maf [0-1]
--info [0-1]
Tha main aims of this workflow are:
- Merge all imputed chromosomes with
bcftools
.
Warning
We use -n
parameter during concatenation which makes the process run much faster but assumes that your .vcf.gz files are sorted.
-
Filter variants: We filter all poorly imputed variants based on info score (default is 0.7 - if you want to change the threshold use
--info
) and filter based on MAF (default is 0.01 - if you want to change the threshold use--maf
). -
Annotate missing SNP ids: The annotation of the missing SNPs is in this format Chromosome:Position:Reference_Allele:Alternative_Allele.
-
Handle all categories of "duplicated" SNP ids:
-
Remove exact duplicates
-
Remove multi-allelics
-
Annotate merged SNPs: Merged SNPs are a special category of duplicated SNP ids with different position and ref/alt alleles. You can read more about this category here.
-
Note
Even though Post-Imputation workflow is nested under the --impute
flag, it is also designed to run independently as some users might prefer running imputation using online servers, or have already imputed data and they wish to proceed with a Post-Imputation QC.
Warning
The input --vcf input.vcf.gz
file should contain the same samples as the input --fam input.fam
file. In case, they contain different samples snpQT
will output an error.
Genome-Wide Association Study
--gwas
GWAS workflow options:
Mandatory:
--qc
Optional:
--pop_strat
--impute
--covar_file [false/covar.txt]
--pca_covars [1-20]
--linear [false/true]
The Genome-Wide Association Studies (GWAS) workflow aims to identify markers with a statistically significant association with the trait of interest. This workflow performs logistic or linear regression (depending on the phenotype) with and without covariates (calculated at the end of the ---qc
workflow and passed to the GWAS workflow). The ---gwas
workflow illustrates the results of Generalized Linear Regression (plink2
's --glm
) in the forms of a Manhattan plot and a Q-Q plot. The main processes of this workflow include:
-
Run logistic or linear regression:
-
Adjusting for covariates accounting for fine-scale population structure. The covariates can either be imported by the user using the
--covar_file covar.txt
parameter or can be generated in--qc
workflow (for better results combine with--pop_strat
) using the first X Principal Components of the generated PCA using the--pca_covars
parameter. -
Not adjusting for covariates.
-
Warning
- The
--covar_file
parameter can not be combined with--pca_covars
. - For the format of the covar.txt file please advise plink2.
- Since we are using
plink2
's--glm
, the covariates file can not contain columns for the sex of the samples. Sex is automatically accounted for in--glm
.
We added these two processes for two main reasons. The first one is to make --gwas
useful for users who do not wish to run --pop_strat
or use covariates or even inspect/compare the effect of the used covariates. In this case, the first process will not produce an output (designed so as to not produce an error), but the second process will provide the expected results (along with a Manhattan plot and a Q-Q plot). The second reason is for users that wish to run --pop_strat
, use --pop_strat
covariates or insert their own covariates and it would be helpful for them to compare their GWAS results in the output plots with and without covariates.
-
Illustrate a Q-Q (Quantile-Quantile) plot: A plot to inspect the lambda genomic inflation parameter (calculated as median 1df chi-square stat / 0.456), expressing the relationship between the observed and the expected quantile p-values of the sample cohort under a normal distribution.
-
Illustrate Manhattan plot: A plot where the association p-values are represented on the y-axis, and the positions of the tested variants are represented on the x-axis.
Log steps table
The table below links every step with a log code which can be useful for users who want to inspect the log plots. The logs are generated automatically and show the removed samples and variants for each step.
Code | Step | Flag | Workflow |
---|---|---|---|
A1 | Download and process reference files | --download_db |
Database set up |
A2 | QC in reference files | --download_db |
Database set up |
A3 | Decompress reference files | --download_db |
Database set up |
A4 | Remove duplicate records in reference files | --download_db |
Database set up |
A5 | Unzip genetic maps for phasing | --download_db |
Database set up |
A6 | Annotate variants to a "chr:pos:ref:alt" format | --download_db |
Database set up |
B1 | Create a dictionary file | --convert_build |
Build Conversion |
B2 | Change the chromosome ids | --convert_build |
Build Conversion |
B3 | Run LiftOver to map genome build | --convert_build |
Build Conversion |
B4 | Change the chromosome ids | --convert_build |
Build Conversion |
B5 | Convert VCF to PLINK format and update phenotypes | --convert_build |
Build Conversion |
C1 | Missing variant call rate check | --qc |
Sample QC |
C2 | Missing sample call rate check | --qc |
Sample QC |
C3 | Check for sex discrepancies | --qc |
Sample QC |
C4 | Removal of non-autosomal SNPs | --qc |
Sample QC |
C5 | Heterozygosity check | --qc |
Sample QC |
C6 | Check for cryptic relatedness | --qc |
Sample QC |
C7 | Removal of samples with missing phenotype | --qc |
Sample QC |
E8 | Missing variant call rate check | --qc |
Variant QC |
E9 | Hardy-Weinberg equilibrium (HWE) deviation check | --qc |
Variant QC |
E10 | Minor Allele Frequency (MAF) check | --qc |
Variant QC |
E11 | Missingness in case/ control status check | --qc |
Variant QC |
E12 | Create a covariate file for GWAS and perform PCA | --qc |
Variant QC |
E13 | Combine all .log files into one file | --qc |
Variant QC, Sample QC |
E14 | Create a .html report | --qc |
Variant QC, Sample QC |
D3 | QC and preparation of user's data | --pop_strat |
PopStrat |
D4 | Fix strand errors and remove ambiguous SNPs | --pop_strat |
PopStrat |
D5 | Align reference allele according to reference genome | --pop_strat |
PopStrat |
D6 | Merge user's dataset with 1,000 Human Genome data | --pop_strat |
PopStrat |
D7 | Create a population file | --pop_strat |
PopStrat |
D8 | Principal Component Analysis & smartpca |
--pop_strat |
PopStrat |
D9 | Remove outliers from user's dataset | --pop_strat |
PopStrat |
F1 | Set appropriate chromosome codes | --impute , --pre_impute |
Pre-Imputation |
F2/D4 | Check for strand issues and remove ambiguous SNPs | --impute , --pre_impute |
Pre-Imputation |
F3 | Remove one of each pair of duplicated SNPs | --impute , --pre_impute |
Pre-Imputation |
F4 | Convert Plink file into VCF | --impute , --pre_impute |
Pre-Imputation |
F5 | Convert VCF file into BCF | --impute , --pre_impute |
Pre-Imputation |
F6 | Check and fix the REF allele bcftools +fixref |
--impute , --pre_impute |
Pre-Imputation |
F7 | Sort the BCF, Convert .bcf file to .vcf.gz and index |
--impute , --pre_impute |
Pre-Imputation |
A6 | Annotate user's variants to a "chr:pos:ref:alt" format | --impute |
Imputation |
G1 | Split vcf.gz file in chromosomes and index them | --impute |
Imputation |
G2 | Perform phasing using shapeit4 |
--impute |
Imputation |
G3 | Index phased chromosomes | --impute |
Imputation |
G4 | Tabix reference files | --impute |
Imputation |
G5 | Convert vcf.gz reference genome chromosomes into .imp5 format | --impute |
Imputation |
G6 | Perform imputation using impute5 | --impute |
Imputation |
G7 | Merge all imputed chromosomes | --impute |
Imputation |
E13 | Combine all .log files into one file | --impute |
Imputation |
H1 | Filter all poorly imputed variants using info score | --impute , --post_impute |
Post-Imputation |
H2 | Filter imputed variants using MAF | --impute , --post_impute |
Post-Imputation |
H3 | Remove duplicate variants | --impute , --post_impute |
Post-Imputation |
H4 | Remove multi-allelics variants | --impute , --post_impute |
Post-Imputation |
H5 | Identify merged variants | --impute , --post_impute |
Post-Imputation |
H6 | Update sample ids information | --impute , --post_impute |
Post-Imputation |
H7 | Update phenotype information | --impute , --post_impute |
Post-Imputation |
E13 | Combine all .log files into one file | --impute , --post_impute |
Imputation |
I1 | Perform logistic regression with and without covariates and plot results | --gwas |
GWAS |
E13 | Combine all .log files into one file | --gwas |
GWAS |
E14 | Create a .html report | --gwas |
GWAS |