DNA Methylation Analysis Pipeline with Snakemake

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Use snakefile as main pipeline and defines paths for files and directives

Conda environment

Run pipeline in Conda specified env

Install the conda environment found under envs/rbio.yaml (conda has to be installed on the system):

conda env create -f envs/rbio.yaml

Activate env after installation

conda activate rbio

Start R terminal
Install 1) devtools + MatrixEQTL and 2) annotation template

# matrixEQTL
install.packages("devtools")
devtools::install_github("andreyshabalin/MatrixEQTL")
# annotation package
BiocManager::install("IlluminaHumanMethylationEPICanno.ilm10b2.hg19")

Configuration

You may change the thresholds and cut-offs at configs/workflow.json

maf : the minor allele frequency threshold

Converting impute2 genotypes to dosages.

post_maf : the minor allele frequency threshold to be applied after matrixEQTL has been run.

Useful to define different summaries.

cisDist : the maximal distance between SNP-CpGs pairs until which it counts as a 'cis-pairs'.

Cis-pairs will be reported in a separate file.

pvThresholdCis : The P value threshold to be applied for cis-meQTLs.

Only associations with P value < pvThresholdCis will be reported

pvThresholdTrans : The p-value threshold to be applied for trans-meQTLs.

Only associations with P value < pvThresholdTrans will be reported

significant_pv_cutoff : For final results files.

P value significant threshold e.g.: 1e-14

Execution

Execute snakemake in the project directory to run the full pipeline
Manually call the all_dosage_to_matrixEQTL and the all_methylation_to_matrixEQTL rules

Run snakemake with parallel task processing

./scripts/run_pipeline.sh <num-parallel-tasks>

Configure the script above to your VM capacity with: [scripts/run_pipeline.sh](scripts/run_pipeline.sh)

Covariates

meth ~ geno + sex + age + bmi + wbc + PCs1..20 + houseman estimates (except gran) + plate + batch

Code Snippets

shell:
	"""
	cat {input.snp_pos} > {output}
	"""

SnakeMake From line 38 of main/Snakefile

shell:
	"""
	zcat {input} | ./scripts/impute2dosage.py -m {params.maf} |\
	  sed -E 's/^[a-zA-Z0-9-]+\t/{wildcards.chr}\t/' |\
	  bgzip -c > {output}
	"""

SnakeMake From line 72 of main/Snakefile

script:
	"scripts/annotate_snps_maf.R"

SnakeMake From line 101 of main/Snakefile

script:
	"scripts/prepare_covariates.R"

SnakeMake From line 127 of main/Snakefile

script:
	"scripts/methylation_to_matrixEQTL.R"

SnakeMake From line 145 of main/Snakefile

shell:
	"""
	cat {input.cpg_pos} | grep -v "start" > {output}
	"""

SnakeMake From line 156 of main/Snakefile

	shell:
		"""
		echo "Extracting SNP ids."
		zcat {input.dosage_file} | cut -f 2 > {output.snp_ids}

		echo "Getting SNP locations"
		zcat {input.dosage_file} | cut -d "\t" -f 2,3 | \
                  awk '{{OFS="\t"}} {{print $1,"{wildcards.chr}",$2}}' >> \
		  {output.snp_positions}

		echo "Remove additional columns and order by given ordering, add SNP ids"
		key=$(cat {input.ordering} | tr '\n' ':' | sed 's/.$//')

		zcat {input.dosage_file} | cut -f 1,2,3,4,5 --complement | \
		  awk -v key="$key" -f scripts/reorder.awk | \
		  paste -d "\t" {output.snp_ids} - > {output.genotypes_matrixEQTL}

		echo "Splitting matrixEQTL input files into chunks"
		split -l {params.chunk_size} {output.genotypes_matrixEQTL} {params.geno_split_prefix}
		split -l {params.chunk_size} {output.snp_positions} {params.snp_pos_split_prefix}

		"""

SnakeMake From line 181 of main/Snakefile

script:
	"scripts/run_matrixEQTL.R"

SnakeMake From line 229 of main/Snakefile

	shell:
		"""
		# reset the matrixEQTL header, change 'gene' to 'cpg'

		header='SNP\tcpg\tbeta\tt-stat\tp-value'
                pattern={params.input_dir}{wildcards.chr}

		(echo -e ${{header}} ; awk '{{if($5+0 <= {params.pv_cutoff}) print;}}' ${{pattern}}_*_cis_associations.out) > {output.cis}
		(echo -e ${{header}} ; awk '{{if($5+0 <= {params.pv_cutoff}) print;}}' ${{pattern}}_*_trans_associations.out) > {output.trans}

		"""

SnakeMake From line 264 of main/Snakefile

shell:
	"""
	cat {input.cis} > {output.cis}
	cat {input.trans} > {output.trans}
	"""

SnakeMake From line 283 of main/Snakefile

shell:
	"""
	header="SNP\tcpg\tbeta\tt-stat\tp-value"

	(echo -e ${{header}} ; fgrep -w -f {input.st1} {input.cis} ) > {output.cis_in_st1}
	(echo -e ${{header}} ; fgrep -w -f {input.st1} {input.trans} ) > {output.trans_in_st1}
	(echo -e ${{header}} ; grep -h -v SNP {output.cis_in_st1} {output.trans_in_st1} ) > {output.combined}
	"""

SnakeMake From line 301 of main/Snakefile

shell:
	"""
	header="SNP\tcpg\tbeta\tt-stat\tp-value"

	(echo -e ${{header}} ; awk '{{ if($5+0 <= {params.pv_cutoff} ) print; }}' {input.cis} {input.trans} ) > {output.combined}

	"""

SnakeMake From line 322 of main/Snakefile

shell:
	"""
	cut -f 2 {input.common_probes} > {output.temp_probes}
	header="SNP\tcpg\beta\t-stat\p-value"
	( echo -e ${{header}} ; fgrep -v -w -f {output.temp_probes} {input.combined} ) > {output.epic_specific}
	"""

SnakeMake From line 337 of main/Snakefile

script:
	"scripts/create_full_result_table.R"

SnakeMake From line 368 of main/Snakefile

script:
	"scripts/extract_common_probes.R"

SnakeMake From line 381 of main/Snakefile

script:
	"scripts/compare_450k_to_EPIC.R"

SnakeMake From line 409 of main/Snakefile

shell:
	"""
	fgrep -w -h -f {input.snps} {params.geno_pattern} > {output.geno}
	fgrep -w -h -f {input.snps} {params.snp_pos_pattern} > {output.snp_pos}
	fgrep -w -h -f {input.cpgs} {params.meth_pattern} > {output.meth}
	fgrep -w -h -f {input.cpgs} {params.cpg_pos_pattern} > {output.cpg_pos}
	"""

SnakeMake From line 429 of main/Snakefile

script:
	"scripts/run_matrixEQTL.R"

SnakeMake From line 462 of main/Snakefile

	shell:
		"""
		 (head -n 1 {input.cis} ; \
                   fgrep -w -f {input.cis_pairs} {input.cis} ; \
                   fgrep -w -f {input.longrange_pairs} {input.longrange} ; \
                   fgrep -w -f {input.trans_pairs} {input.trans} ) > {output}
		"""

SnakeMake From line 479 of main/Snakefile

script:
        "scripts/create_full_result_table.R"

SnakeMake From line 512 of main/Snakefile

script:
	"scripts/create_effect_comparison_plot.R"

SnakeMake From line 533 of main/Snakefile

shell:
	"""
	zcat {input} | sort -k1,1 -k3,3n | bgzip > {output} 
	# index using tabix, 1: chr; 3: pos
	tabix -s 1 -b 3 -e 3 {output}
	"""

SnakeMake tabix From line 545 of main/Snakefile

script:
	"scripts/residualize_methylation.R"

SnakeMake From line 567 of main/Snakefile

script:
	"scripts/match_epic_with_sentinels.R"

SnakeMake From line 593 of main/Snakefile

	shell:
		"""
		fgrep -w -h -f {input.cosmo_pairs} {input.epic_associations_cis} {input.epic_associations_trans} > {output.matched}

		cut -f 1,2 {output.matched} > {output.matched_pairs}

                fgrep -w -v -f {output.matched_pairs} {input.cosmo_pairs} > {output.missing}
		"""