Bisulphite-sequencing based methylation pipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Bisulphite-sequencing methylation pipeline . This is the home of the pipeline, methyl-seek. Its long-term goals: to accurately call CpG sites within whole genome and cell-free DNA fragments, to perform deconvolution for CpG calls within cell-free DNA fragments, and to boldly identify differentially methylated regions like no pipeline before!

Overview

Welcome to methyl-seek's documentation! This guide is the main source of documentation for users that are getting started with the methlyation pipeline .

The ./methyl-seek pipeline is composed several inter-related pipelines to setup and run different types of analysis. Each of the available pipelines perform different functions:

methyl-seek run : Identify CpG sites from whole genome or cell-free DNA bisulphite sequencing data.
methyl-seek dmr : Determine differentially methylated regions populated with CpG sites.
methyl-seek dcv : Identify cells/tissue of origin for cell-free DNA.

methyl-seek is a comprehensive bisulphite-sequencing based methylation pipeline. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster.

The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As inputs , it accepts a set of FastQ files and can be run locally on a compute instance or on-premise using a cluster. A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM.

Before getting started, we highly recommend reading through the usage section of each available sub command.

For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .

Contribute

This site is a living document, created for and by members like you. methyl-seek is maintained by the members of OpenOmics and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository :octicons-heart-fill-24:{ .heart } .

References

1. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2. Koster, J. and S. Rahmann (2018). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 34(20): 3600.

Code Snippets

shell:
  """
  mkdir -p {params.dir}
  ln -s {input} {output}
  """

SnakeMake From line 194 of main/Snakefile

shell:
  """
  module load fastqc/0.11.9
  mkdir -p {params.dir}
  fastqc -o {params.dir} -f fastq --threads {threads} --extract {input}
  """

SnakeMake FastQC From line 212 of main/Snakefile

shell:
  """
  module load singularity
  mkdir -p {params.dir}
  singularity exec -B {params.workdir} /data/OpenOmics/SIFs/trimgalore_v0.1.0.sif trim_galore --paired --cores {threads} {params.command} --basename {params.tag} --output_dir {params.dir} --fastqc_args "--outdir {params.fastqcdir}"  {input.F1} {input.F2}
  """

SnakeMake Trim_Galore Singularity Hub From line 238 of main/Snakefile

shell: """
  # Get encoding of Phred Quality Scores
  module load python
  encoding=$(python {params.script_dir}/phred_encoding.py {input.R1})
  echo "Detected Phred+${{encoding}} ASCII encoding"

  module load bbtools/38.87
  bbtools bbmerge-auto in1={input.R1} in2={input.R2} qin=${{encoding}} \
  ihist={output} k=62 extend2=200 rem ecct -Xmx900G
      """

SnakeMake From line 255 of main/Snakefile

shell:
  """
  module load fastq_screen/0.14.1
  module load bowtie/2-2.3.4
  module load perl/5.24.3
  fastq_screen --conf {params.fastq_screen_config} --outdir {params.outdir} \
    --threads {threads} --subset 1000000 \
    --aligner bowtie2 --force {input.file1} {input.file2}
  fastq_screen --conf {params.fastq_screen_config2} --outdir {params.outdir2} \
    --threads {threads} --subset 1000000 \
    --aligner bowtie2 --force {input.file1} {input.file2}
  """

SnakeMake Bowtie 2 From line 286 of main/Snakefile

shell:
  """
  module load kraken
  module load kronatools/2.7
  mkdir -p {params.dir}
  cd /lscratch/$SLURM_JOBID;
  cp -rv {params.bacdb} /lscratch/$SLURM_JOBID/;

  kdb_base=$(basename {params.bacdb})
  kraken2 --db /lscratch/$SLURM_JOBID/${{kdb_base}} \
    --threads {threads} --report {output.krakentaxa} \
    --output {output.krakenout} \
    --gzip-compressed \
    --paired {input.fq1} {input.fq2}
  # Generate Krona Report
  cut -f2,3 {output.krakenout} | ktImportTaxonomy - -o {output.kronahtml}

  #kdb_base=$(basename {params.bacdb})
  #kraken --db /lscratch/$SLURM_JOBID/`echo {params.bacdb}|awk -F "/" '{{print \$NF}}'` --fastq-input --gzip-compressed --threads {threads} --output /lscratch/$SLURM_JOBID/{params.prefix}.krakenout --preload--paired {input.fq1} {input.fq2}
  #kraken-translate --mpa-format --db /lscratch/$SLURM_JOBID/`echo {params.bacdb}|awk -F "/" '{{print \$NF}}'` /lscratch/$SLURM_JOBID/{params.prefix}.krakenout |cut -f2|sort|uniq -c|sort -k1,1nr > /lscratch/$SLURM_JOBID/{params.prefix}.krakentaxa
  #cut -f 2,3 /lscratch/$SLURM_JOBID/{params.prefix}.krakenout | ktImportTaxonomy - -o /lscratch/$SLURM_JOBID/{params.prefix}.kronahtml
  """

SnakeMake kraken2 Krona Kraken From line 314 of main/Snakefile

shell:
  """
  module load bismark/0.23.0
  mkdir -p {params.dir}
  cd {params.dir}
  #cp reference_genome {params.dir}/genome.fa
  bismark_genome_preparation --verbose --parallel {threads} {params.dir} #--single_fasta
  """

SnakeMake Bismark From line 355 of main/Snakefile

shell:
  """
  module load bismark/0.23.0 samtools
  mkdir -p {params.dir}
  bismark --multicore {threads} --temp_dir /lscratch/$SLURM_JOBID/ {params.command} --output_dir {params.dir} --genome {params.genome_dir} -1 {input.F1} -2 {input.F2}
  mv {params.outbam} {output.B1}
  mv {params.R1} {params.R2}
  samtools flagstat -@ {threads} {output.B1} > {output.B2}
  """

SnakeMake SAMtools Bismark From line 386 of main/Snakefile

shell:
  """
  module load bismark/0.23.0
  module load samtools/1.15
  cd {params.dir}
  samtools view -hb {input.F1} | samtools sort -n -@ {threads} -O BAM -o {output.T1}
  deduplicate_bismark --paired --bam --outfile {params.prefix} {output.T1}
  samtools view -T {params.genome} -@ 16 -h -C {output.B1} > {output.C1}
  samtools flagstat -@ {threads} {output.B1} > {output.B2}
  """

SnakeMake SAMtools Bismark From line 411 of main/Snakefile

shell:
  """
  mkdir -p {params.outdir}
  module load bismark samtools bowtie
  bismark_methylation_extractor --paired-end --no_overlap --multicore 8 --gzip --report --bedGraph --counts --buffer_size 100G --no_header --cytosine_report --output {params.outdir} --scaffolds --genome_folder {params.bismark_index} {input.bam}
  """

SnakeMake SAMtools Bowtie 2 Bismark From line 433 of main/Snakefile

shell:
  """
  module load bismark/0.23.0
  mkdir -p {params.dir}
  cd {params.dir}
  cp reference_genome {params.dir}/genome.fa
  bismark_genome_preparation --verbose --parallel {threads} {params.dir} #--single_fasta
  """

SnakeMake Bismark From line 452 of main/Snakefile

shell:
  """
  module load bismark/0.23.0
  mkdir -p {params.dir}
  bismark --multicore {threads} --temp_dir /lscratch/$SLURM_JOBID/ {params.command} --output_dir {params.dir} --genome {params.genome_dir} -1 {input.F1} -2 {input.F2}
  """

SnakeMake Bismark From line 475 of main/Snakefile

shell:
  """
  module load rseqc/4.0.0
  mkdir -p {params.dir}
  infer_experiment.py -r {params.bedref} -i {input.file1} -s 1000000 > {output.out1}
  read_distribution.py -i {input.file1} -r {params.bedref} > {output.out2}
  """

SnakeMake From line 493 of main/Snakefile

shell:
  """
  module load rseqc/4.0.0
  mkdir -p {params.dir}
  inner_distance.py -i {input.bam} -r {params.genemodel} -k 10000000 -o {params.prefix}
  """

SnakeMake From line 511 of main/Snakefile

shell:
  """
  module load python/3.8 samtools/1.15 picard/2.26.9
  java -Xmx110g -jar ${{PICARDJARPATH}}/picard.jar CollectRnaSeqMetrics REF_FLAT={params.refflat} I={input.file1} O={output.outstar1} RIBOSOMAL_INTERVALS={params.rrnalist} STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND TMP_DIR=/lscratch/$SLURM_JOBID  VALIDATION_STRINGENCY=SILENT;
  sed -i 's/CollectRnaSeqMetrics/picard.analysis.CollectRnaSeqMetrics/g' {output.outstar1}
  samtools flagstat {input.file1} > {output.outstar2};
  ## python3 {params.statscript} {input.file1} >> {output.outstar2} ## does require "Rstat" not available in biowulf
  """

SnakeMake SAMtools Picard From line 530 of main/Snakefile

shell:
  """
  module load multiqc/1.9 bismark
  cd {params.bis_dir}
  bismark2report
  bismark2summary
  mv bismark_summary_report.html {params.dir}/
  mv bismark_summary_report.txt {params.dir}/
  cd {params.dir}
  multiqc --ignore '*/.singularity/*' -f --interactive .
  """

SnakeMake MultiQC Bismark From line 555 of main/Snakefile

shell:
  """
  mkdir -p {params.dir1}
  mkdir -p {params.dir2}
  module load R
  Rscript {params.script_dir}/get_methy.R {input} {wildcards.samples} {params.cutoff} {output}
  """

SnakeMake From line 582 of main/Snakefile

run:
  df_ref=pd.read_csv(params.map_table,sep='\t',header=None)
  df_ref.columns=['chromosome','start','end','cgid']
  df_ref=df_ref.loc[(df_ref['chromosome'].isin(CHRS)),]
  dfm=pd.read_csv(input[0])
  dfm=pd.merge(df_ref,dfm,on=['chromosome','start','end'],how='inner')
  dfm=dfm.drop(labels=['chromosome','start','end'],axis=1)
  dfm=dfm.set_index('cgid')
  dfm.to_csv(output[0])

SnakeMake From line 598 of main/Snakefile

shell:
  """
  module load python
  cd {params.dir}
  python {params.script_dir}/deconvolve2.py --atlas_path {params.ref} --residuals {input}  > {output}  2>&1
  """

SnakeMake From line 618 of main/Snakefile

run:
  dfm=pd.read_csv(input[0])
  for f in input[1:]:
  	df=pd.read_csv(f)
  	dfm=pd.merge(dfm,df,on='cgid',how='outer')
  dfm.to_csv(output[0],index=False)

SnakeMake From line 632 of main/Snakefile

shell:
  """
  module load python
  cd {params.dir}
  python {params.script_dir}/deconvolve2.py --atlas_path {params.ref} --residuals {input}
  """

SnakeMake From line 650 of main/Snakefile

shell:
  """
  module load bedtools
  zcat {input.bed} | tail -n +2 | bedtools sort -i - > {output.bed}
  """

SnakeMake BEDTools From line 666 of main/Snakefile

shell:
  """
  module load bedtools crossmap
  crossmap bed {params.lift_file} {input.bed} {output.graph}
  """

SnakeMake BEDTools From line 680 of main/Snakefile

shell:
  """
  module load bedtools
  bedtools sort -i {input.graph} | bedtools intersect -wo -a {params.markers} -b stdin -sorted | awk '$6-$5==1 {print $0}' | awk 'NF{NF-=1};1' > {output.sort}
  """

SnakeMake BEDTools From line 694 of main/Snakefile

shell:
  """
  module load R
  Rscript {params.script_dir}/aggregate_over_regions.R {input.sort} {output.tsv}
  """

SnakeMake From line 708 of main/Snakefile

shell:
  """
  module load R
  Rscript ${params.script_dir}/tissues_of_origin_v2.R \
  {input.bed}  \
  {params.reference_markers} \
  {output.tsv} \
  {params.reference_IDs} \
  {params.sampleName} \
  TRUE \
  FALSE \
  TRUE \
  colon_5/hsc_5/hsc_2/spleen_1/spleen_2/spleen_3/spleen_4/
  """

SnakeMake From line 725 of main/Snakefile

shell:
  """
  module load R
  mkdir -p {params.dir}
  Rscript {params.script_dir}/bsseq_lm.R {params.chr} {input.bizfile} {output.bed1} {params.sample_prop} {params.cov}
  """

SnakeMake From line 758 of main/Snakefile

shell:
  """
  mkdir -p {params.dir}
  module load combp bedtools
  comb-p pipeline -c 4 --dist {params.dist} --step {params.step} --seed 0.01 -p {params.dir}/{params.groups} --region-filter-p 0.05 --region-filter-n 3 {input}

  module load homer
  mkdir -p {params.homer_dir}
  cp {params.annStat} {output.homerAnn}
  cat {output.RP} | sed "1,1d" | awk '{{print$1"\t"$2"\t"$3"\t"$1"_"$2"_"$3"\t"$4"|"$5"|"$6"|"$7"\t""*"}}' > {output.homerInput}
  annotatePeaks.pl {output.homerInput} hg38 -annStats {output.homerAnn} > {output.homerOutput}
  awk 'NR==FNR{{a[$4]=$5;next}}NR!=FNR{{c=$1; if(c in a){{print $0"\t"a[c]}}}}' {output.homerInput} {output.homerOutput} > {output.homerOutput2}
  """

SnakeMake BEDTools homer From line 782 of main/Snakefile

shell:
  """
  mkdir -p {params.dir}
  module load R/4.1
  Rscript {params.script_dir}/mvp_plots.R {input.dir} {params.fdr_cutoff} {params.highlight} {output.miami} {output.violin}
  ## {output.pie_cpg} {output.pie_genic}
  """

SnakeMake From line 811 of main/Snakefile

shell:
  """
  module load R
  mkdir -p {params.dir}
  zcat {params.prefix}*.regions-p.bed.gz | grep -v "^#" > {output.p1}
  Rscript {params.script_dir}/combp_Manhatten.R {output.p1} {output.f3}
  Rscript {params.script_dir}/combp_GREAT.R {output.p1} {output.f1} {output.o1} {output.o2}
  """

SnakeMake From line 833 of main/Snakefile

shell:
  """
  mkdir -p {params.dir}
  ls {params.dir}/{wildcards.group}_*_betas_pval.bed > {output.LIST}
  module load R
  Rscript {params.script_dir}/bsseq_Manhatten.R {output.LIST} {output.MAN}
  """