eQTL-Catalogue/qtlmap

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

eQTL-Catalogue/qtlmap

Portable eQTL analysis and statistical fine mapping workflow used by the eQTL Catalogue

Introduction

eQTL-Catalogue/qtlmap is a bioinformatics analysis pipeline used for QTL Analysis.

The workflow takes phenotype count matrix (normalized and quality controlled) and genotype data as input, and finds associations between them with the help of sample metadata and phenotype metadata files (See Input formats and preparation for required input file details). To map QTLs, pipeline uses QTLTools's PCA and RUN methods. For manipulation of files BcfTools , Tabix and custom Rscript scripts are used.

The pipeline is built using Nextflow , a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.

Documentation

The eQTL-Catalogue/qtlmap pipeline comes with documentation about the pipeline, found in the docs/ directory:

Installation
Pipeline configuration
- Local installation
- Adding your own system
Input formats and preparation
Running the pipeline
Troubleshooting

Pipeline Description

Mapping QTLs is a process of finding statistically significant associations between phenotypes and genetic variants located nearby (within a specific window around phenotype, a.k.a cis window) This pipeline is designed to perform QTL mapping. It is intended to add this pipeline to the nf-core framework in the future. High level representation of the pipeline is shown below:

Results

The output directory of the workflow contains the following subdirectories:

PCA - genotype and gene expression PCA values used as covariates for QTL analysis.
sumstats - QTL summary statistics from nominal and permutation passes.
susie - SuSiE fine mapping credible sets.
susie_full - full set of susie results for all tested variants (very large files).
susie_merged - susie credible sets merged with summary statistics from univariate QTL analysis.

Column names of the output files are explained here .

Contributors

Nurlan Kerimov
Kaur Alasoo
Masahiro Kanai
Ralf Tambets

Code Snippets

"""
csvtk -t cut -f molecular_trait_id $credible_sets | csvtk -t uniq > cc_signals_phenotypes.txt
zcat $qtl_ss | csvtk -t grep -P cc_signals_phenotypes.txt | bgzip > ${qtl_subset}.cc.tsv.gz
tabix -s2 -b3 -e3 -S1 -f ${qtl_subset}.cc.tsv.gz
"""

NextFlow tabix csvtk From line 16 of modules/extract_cc_signal.nf

"""
bcftools view -S $sample_names $genotype_vcf -Oz -o ${sample_names.simpleName}_extract.vcf.gz
bcftools +fill-tags ${sample_names.simpleName}_extract.vcf.gz -Oz -o ${sample_names.simpleName}_extract_filltags.vcf.gz
bcftools view -i 'AN[0]*MAF[0]>5 & MAF[0]>0.01' ${sample_names.simpleName}_extract_filltags.vcf.gz -Oz -o ${sample_names.simpleName}.vcf.gz
tabix -p vcf ${sample_names.simpleName}.vcf.gz
"""

NextFlow BCFtools tabix From line 14 of modules/extract_samples_from_vcf.nf

"""
set +o pipefail; bcftools +fill-tags $vcf | bcftools query -f '%CHROM\\t%POS\\t%ID\\t%REF\\t%ALT\\t%TYPE\\t%AC\\t%AN\\t%MAF\\t%R2\\n' | gzip > ${qtl_subset}.variant_information.txt.gz
"""

NextFlow BCFtools From line 15 of modules/extract_variant_info.nf

"""
set +o pipefail; bcftools +fill-tags $vcf | bcftools query -f '%CHROM\\t%POS\\t%ID\\t%REF\\t%ALT\\t%TYPE\\t%AC\\t%AN\\t%MAF\\tNA\\n' | gzip > ${qtl_subset}.variant_information.txt.gz
"""

NextFlow BCFtools From line 19 of modules/extract_variant_info.nf

"""
QTLtools cis --vcf $vcf --bed $bed --cov $covariate --chunk $batch_index ${params.n_batches} --out ${qtl_subset}.permutation.batch.${batch_index}.${params.n_batches}.txt --window ${params.cis_window} --permute ${params.n_permutations} --grp-best
"""

NextFlow QTL-ALL From line 16 of modules/map_qtls.nf

"""
cat ${batch_file_names.join(' ')} | csvtk space2tab | sort -k11n -k12n > merged.txt
cut -f 1,6,7,8,10,11,12,18,20,21,22 merged.txt | csvtk add-header -t -n molecular_trait_object_id,molecular_trait_id,n_traits,n_variants,variant,chromosome,position,pvalue,beta,p_perm,p_beta | gzip > ${qtl_subset}.permuted.tsv.gz
"""

NextFlow csvtk From line 36 of modules/map_qtls.nf

    """
	fastQTL --vcf $vcf --bed $fastqtl_bed --cov $covariate \\
        --chunk $batch_index ${params.n_batches} \\
        --out ${qtl_subset}.nominal.batch.${batch_index}.${params.n_batches}.txt \\
        --window ${params.cis_window} \\
        --ma-sample-threshold 1
    """

NextFlow FastQTL From line 57 of modules/map_qtls.nf

"""
cat ${batch_file_names.join(' ')} | \\
    csvtk space2tab -T | \\
    csvtk sep -H -t -f 2 -s "_" | \\
    csvtk replace -t -H -f 10 -p ^chr | \\
    csvtk cut -t -f1,10,11,12,13,2,4,5,6,7,8,9 | \\
    bgzip > ${qtl_subset}.nominal.tab.txt.gz
"""

NextFlow csvtk From line 80 of modules/map_qtls.nf

"""
gzip -dc $nominal_merged | LANG=C sort -k2,2 -k3,3n -S${params.sumstat_sort_mem} --parallel=${params.sumstat_sort_cores} | uniq | \\
    bgzip > ${qtl_subset}.nominal.sorted.norsid.tsv.gz
"""

NextFlow From line 106 of modules/map_qtls.nf

"""
Rscript $baseDir/bin/prepare_molecular_traits.R \\
    -p "$phenotype_metadata" \\
    -s "$sample_metadata" \\
    -e "$expression_matrix" \\
    -v "$vcf_variant_info" \\
    -o "." \\
    -c ${params.cis_window} \\
    -m ${params.mincisvariant} \\
    -a ${params.covariates}

#Merge phenotype covariates together
head -n ${params.n_pheno_pcs + 1} phenoPCA.tsv > ${qtl_subset}.pheno_cov.txt
tail -n+2 additional_covariates.tsv >> ${qtl_subset}.pheno_cov.txt
"""

NextFlow From line 20 of modules/prepare_molecular_traits.nf

"""
bgzip ${bed_file} && tabix -p bed ${bed_file}.gz

csvtk cut -C\$ -t -f -strand,-group_id ${bed_file}.gz | bgzip > ${bed_file.baseName}.fastQTL.bed.gz
tabix -p bed ${bed_file.baseName}.fastQTL.bed.gz
"""

NextFlow tabix csvtk From line 49 of modules/prepare_molecular_traits.nf

"""
plink2 --vcf $vcf --vcf-half-call h --indep-pairwise 50000 200 0.05 --out ${qtl_subset}_pruned_variants --threads ${task.cpus} --memory ${task.memory.mega} --const-fid 
plink2 --vcf $vcf --vcf-half-call h --extract ${qtl_subset}_pruned_variants.prune.in --make-bed --out ${qtl_subset}_pruned --const-fid 
plink2 -bfile ${qtl_subset}_pruned --pca ${params.n_geno_pcs} header tabs
cat plink.eigenvec \\
    | sed '1s/IID/genotype_id/' \\
    | sed '1s/PC/geno_PC/g' \\
    | csvtk cut -t -f -"FID" \\
    | csvtk transpose -t > ${qtl_subset}.geno.pca
cat $phenotype_cov > ${qtl_subset}.covariates.txt    
set +o pipefail; tail -n+2 ${qtl_subset}.geno.pca | head -n ${params.n_geno_pcs} >> ${qtl_subset}.covariates.txt
"""

NextFlow plink2 csvtk From line 72 of modules/prepare_molecular_traits.nf

"""
$baseDir/bin/join_variant_info.py \\
    -v $var_info \\
    -r $rsid_map \\
    -o ${var_info.simpleName}.var_info_rsid.tsv.gz
"""

NextFlow From line 14 of modules/reformat_sumstats.nf

"""
$baseDir/bin/join_variant_info.py \
    -s $summ_stats \
    -v $var_info \
    -r $rsid_map \
    -p $phenotype_metadata \
    -m $median_tpm \
    -o ${qtl_subset}.nominal.sorted.tsv.gz
"""

NextFlow From line 33 of modules/reformat_sumstats.nf

"""
zcat $sumstats_file | bgzip > ${qtl_subset}.nominal.sorted.bgzip.tsv.gz
mv ${qtl_subset}.nominal.sorted.bgzip.tsv.gz ${qtl_subset}.all.tsv.gz
tabix -s2 -b3 -e3 -S1 -f ${qtl_subset}.all.tsv.gz
"""

NextFlow tabix From line 59 of modules/reformat_sumstats.nf

"""
Rscript $baseDir/bin/run_susie.R --expression_matrix ${expression_matrix}\
 --phenotype_meta ${phenotype_meta}\
 --sample_meta ${sample_meta}\
 --phenotype_list ${phenotype_list}\
 --covariates ${covariates}\
 --genotype_matrix ${genotype_matrix}\
 --chunk '${batch_index} ${params.n_batches}'\
 --cisdistance ${params.cis_window}\
 --out_prefix '${qtl_subset}.${batch_index}_${params.n_batches}'\
 --eqtlutils null\
 --write_full_susie ${params.write_full_susie}
"""

NextFlow From line 12 of modules/susie.nf

"""
awk 'NR == 1 || FNR > 1{print}' ${in_cs_variant_batch_names.join(' ')} | gzip -c > ${qtl_subset}.txt.gz
awk 'NR == 1 || FNR > 1{print}' ${credible_set_batch_names.join(' ')} | gzip -c > ${qtl_subset}.cred.txt.gz
awk 'NR == 1 || FNR > 1{print}' ${variant_batch_names.join(' ')} | gzip -c > ${qtl_subset}.snp.txt.gz
awk 'NR == 1 || FNR > 1{print}' ${lbf_variable_batch_names.join(' ')} | gzip -c > ${qtl_subset}.lbf_variable.txt.gz
"""

NextFlow From line 41 of modules/susie.nf

"""
gunzip -c ${merged_susie_output} > susie_merged.txt
(head -n 1 susie_merged.txt && tail -n +2 susie_merged.txt | sort -k3 -k4n ) | gzip > ${qtl_subset}.purity_filtered.txt.gz
"""

NextFlow From line 61 of modules/susie.nf

"""
#Extract variant coordinates from the credible set file
csvtk cut -t -T -f chromosome,position ${credible_sets} | tail -n +2 | sort -k1n -k2n | uniq > selected_regions.tsv

#Extract variants from the summary stats file
set +o pipefail; zcat ${qtl_ss} | head -n1 | gzip > header.txt.gz
set +o pipefail; tabix -R selected_regions.tsv ${qtl_ss} | gzip > filtered_sumstats.tsv.gz
set +o pipefail; zcat header.txt.gz filtered_sumstats.tsv.gz | gzip > ${qtl_subset}.extracted_sumstats.tsv.gz
"""

NextFlow tabix csvtk From line 77 of modules/susie.nf

"""
Rscript $baseDir/bin/susie_merge_cs.R --cs_results ${credible_sets}\
 --sumstats ${sumstats}\
 --out ${qtl_subset}.credible_sets.tsv.gz
"""

NextFlow From line 99 of modules/susie.nf

"""
bcftools annotate --set-id 'chr%CHROM\\_%POS\\_%REF\\_%FIRST_ALT' $vcf -Oz -o ${vcf.simpleName}_renamed.vcf.gz
"""

NextFlow BCFtools From line 12 of modules/vcf_set_variant_ids.nf

"""
#Extract header
printf 'CHROM\\nPOS\\nREF\\nALT\\n' > 4_columns.tsv
bcftools query -l ${vcf} > sample_list.tsv
cat 4_columns.tsv sample_list.tsv > header.tsv
csvtk transpose header.tsv -T | gzip > header_row.tsv.gz

#Extract dosage and merge
bcftools query -f "%CHROM\\t%POS\\t%REF\\t%ALT[\\t%DS]\\n" ${vcf} | gzip > dose_matrix.tsv.gz
zcat header_row.tsv.gz dose_matrix.tsv.gz | bgzip > ${vcf.simpleName}.dose.tsv.gz
tabix -s1 -b2 -e2 -S1 ${vcf.simpleName}.dose.tsv.gz
"""

NextFlow BCFtools tabix csvtk From line 12 of modules/vcf_to_dosage.nf

"""
#Extract header
printf 'CHROM\\nPOS\\nREF\\nALT\\n' > 4_columns.tsv
bcftools query -l ${vcf} > sample_list.tsv
cat 4_columns.tsv sample_list.tsv > header.tsv
csvtk transpose header.tsv -T | gzip > header_row.tsv.gz

#Extract dosage and merge
bcftools +dosage ${vcf} -- -t GT | tail -n+2 | gzip > dose_matrix.tsv.gz
zcat header_row.tsv.gz dose_matrix.tsv.gz | bgzip > ${vcf.simpleName}.dose.tsv.gz
tabix -s1 -b2 -e2 -S1 ${vcf.simpleName}.dose.tsv.gz
"""