Bisulphite-sequencing based methylation pipeline

public public 1yr ago 0 bookmarks

Bisulphite-sequencing methylation pipeline . This is the home of the pipeline, methyl-seek. Its long-term goals: to accurately call CpG sites within whole genome and cell-free DNA fragments, to perform deconvolution for CpG calls within cell-free DNA fragments, and to boldly identify differentially methylated regions like no pipeline before!

Overview

Welcome to methyl-seek's documentation! This guide is the main source of documentation for users that are getting started with the methlyation pipeline .

The ./methyl-seek pipeline is composed several inter-related pipelines to setup and run different types of analysis. Each of the available pipelines perform different functions:

  • methyl-seek run : Identify CpG sites from whole genome or cell-free DNA bisulphite sequencing data.

  • methyl-seek dmr : Determine differentially methylated regions populated with CpG sites.

  • methyl-seek dcv : Identify cells/tissue of origin for cell-free DNA.

methyl-seek is a comprehensive bisulphite-sequencing based methylation pipeline. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster.

The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As inputs , it accepts a set of FastQ files and can be run locally on a compute instance or on-premise using a cluster. A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM.

Before getting started, we highly recommend reading through the usage section of each available sub command.

For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .

Contribute

This site is a living document, created for and by members like you. methyl-seek is maintained by the members of OpenOmics and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository :octicons-heart-fill-24:{ .heart } .

References

1. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2. Koster, J. and S. Rahmann (2018). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 34(20): 3600.

Code Snippets

194
195
196
197
198
shell:
  """
  mkdir -p {params.dir}
  ln -s {input} {output}
  """
SnakeMake From line 194 of main/Snakefile
212
213
214
215
216
217
shell:
  """
  module load fastqc/0.11.9
  mkdir -p {params.dir}
  fastqc -o {params.dir} -f fastq --threads {threads} --extract {input}
  """
238
239
240
241
242
243
shell:
  """
  module load singularity
  mkdir -p {params.dir}
  singularity exec -B {params.workdir} /data/OpenOmics/SIFs/trimgalore_v0.1.0.sif trim_galore --paired --cores {threads} {params.command} --basename {params.tag} --output_dir {params.dir} --fastqc_args "--outdir {params.fastqcdir}"  {input.F1} {input.F2}
  """
255
256
257
258
259
260
261
262
263
264
shell: """
  # Get encoding of Phred Quality Scores
  module load python
  encoding=$(python {params.script_dir}/phred_encoding.py {input.R1})
  echo "Detected Phred+${{encoding}} ASCII encoding"

  module load bbtools/38.87
  bbtools bbmerge-auto in1={input.R1} in2={input.R2} qin=${{encoding}} \
  ihist={output} k=62 extend2=200 rem ecct -Xmx900G
      """
SnakeMake From line 255 of main/Snakefile
286
287
288
289
290
291
292
293
294
295
296
297
shell:
  """
  module load fastq_screen/0.14.1
  module load bowtie/2-2.3.4
  module load perl/5.24.3
  fastq_screen --conf {params.fastq_screen_config} --outdir {params.outdir} \
    --threads {threads} --subset 1000000 \
    --aligner bowtie2 --force {input.file1} {input.file2}
  fastq_screen --conf {params.fastq_screen_config2} --outdir {params.outdir2} \
    --threads {threads} --subset 1000000 \
    --aligner bowtie2 --force {input.file1} {input.file2}
  """
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
shell:
  """
  module load kraken
  module load kronatools/2.7
  mkdir -p {params.dir}
  cd /lscratch/$SLURM_JOBID;
  cp -rv {params.bacdb} /lscratch/$SLURM_JOBID/;

  kdb_base=$(basename {params.bacdb})
  kraken2 --db /lscratch/$SLURM_JOBID/${{kdb_base}} \
    --threads {threads} --report {output.krakentaxa} \
    --output {output.krakenout} \
    --gzip-compressed \
    --paired {input.fq1} {input.fq2}
  # Generate Krona Report
  cut -f2,3 {output.krakenout} | ktImportTaxonomy - -o {output.kronahtml}

  #kdb_base=$(basename {params.bacdb})
  #kraken --db /lscratch/$SLURM_JOBID/`echo {params.bacdb}|awk -F "/" '{{print \$NF}}'` --fastq-input --gzip-compressed --threads {threads} --output /lscratch/$SLURM_JOBID/{params.prefix}.krakenout --preload--paired {input.fq1} {input.fq2}
  #kraken-translate --mpa-format --db /lscratch/$SLURM_JOBID/`echo {params.bacdb}|awk -F "/" '{{print \$NF}}'` /lscratch/$SLURM_JOBID/{params.prefix}.krakenout |cut -f2|sort|uniq -c|sort -k1,1nr > /lscratch/$SLURM_JOBID/{params.prefix}.krakentaxa
  #cut -f 2,3 /lscratch/$SLURM_JOBID/{params.prefix}.krakenout | ktImportTaxonomy - -o /lscratch/$SLURM_JOBID/{params.prefix}.kronahtml
  """
355
356
357
358
359
360
361
362
shell:
  """
  module load bismark/0.23.0
  mkdir -p {params.dir}
  cd {params.dir}
  #cp reference_genome {params.dir}/genome.fa
  bismark_genome_preparation --verbose --parallel {threads} {params.dir} #--single_fasta
  """
386
387
388
389
390
391
392
393
394
shell:
  """
  module load bismark/0.23.0 samtools
  mkdir -p {params.dir}
  bismark --multicore {threads} --temp_dir /lscratch/$SLURM_JOBID/ {params.command} --output_dir {params.dir} --genome {params.genome_dir} -1 {input.F1} -2 {input.F2}
  mv {params.outbam} {output.B1}
  mv {params.R1} {params.R2}
  samtools flagstat -@ {threads} {output.B1} > {output.B2}
  """
411
412
413
414
415
416
417
418
419
420
shell:
  """
  module load bismark/0.23.0
  module load samtools/1.15
  cd {params.dir}
  samtools view -hb {input.F1} | samtools sort -n -@ {threads} -O BAM -o {output.T1}
  deduplicate_bismark --paired --bam --outfile {params.prefix} {output.T1}
  samtools view -T {params.genome} -@ 16 -h -C {output.B1} > {output.C1}
  samtools flagstat -@ {threads} {output.B1} > {output.B2}
  """
433
434
435
436
437
438
shell:
  """
  mkdir -p {params.outdir}
  module load bismark samtools bowtie
  bismark_methylation_extractor --paired-end --no_overlap --multicore 8 --gzip --report --bedGraph --counts --buffer_size 100G --no_header --cytosine_report --output {params.outdir} --scaffolds --genome_folder {params.bismark_index} {input.bam}
  """
452
453
454
455
456
457
458
459
shell:
  """
  module load bismark/0.23.0
  mkdir -p {params.dir}
  cd {params.dir}
  cp reference_genome {params.dir}/genome.fa
  bismark_genome_preparation --verbose --parallel {threads} {params.dir} #--single_fasta
  """
475
476
477
478
479
480
shell:
  """
  module load bismark/0.23.0
  mkdir -p {params.dir}
  bismark --multicore {threads} --temp_dir /lscratch/$SLURM_JOBID/ {params.command} --output_dir {params.dir} --genome {params.genome_dir} -1 {input.F1} -2 {input.F2}
  """
493
494
495
496
497
498
499
shell:
  """
  module load rseqc/4.0.0
  mkdir -p {params.dir}
  infer_experiment.py -r {params.bedref} -i {input.file1} -s 1000000 > {output.out1}
  read_distribution.py -i {input.file1} -r {params.bedref} > {output.out2}
  """
SnakeMake From line 493 of main/Snakefile
511
512
513
514
515
516
shell:
  """
  module load rseqc/4.0.0
  mkdir -p {params.dir}
  inner_distance.py -i {input.bam} -r {params.genemodel} -k 10000000 -o {params.prefix}
  """
SnakeMake From line 511 of main/Snakefile
530
531
532
533
534
535
536
537
shell:
  """
  module load python/3.8 samtools/1.15 picard/2.26.9
  java -Xmx110g -jar ${{PICARDJARPATH}}/picard.jar CollectRnaSeqMetrics REF_FLAT={params.refflat} I={input.file1} O={output.outstar1} RIBOSOMAL_INTERVALS={params.rrnalist} STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND TMP_DIR=/lscratch/$SLURM_JOBID  VALIDATION_STRINGENCY=SILENT;
  sed -i 's/CollectRnaSeqMetrics/picard.analysis.CollectRnaSeqMetrics/g' {output.outstar1}
  samtools flagstat {input.file1} > {output.outstar2};
  ## python3 {params.statscript} {input.file1} >> {output.outstar2} ## does require "Rstat" not available in biowulf
  """
555
556
557
558
559
560
561
562
563
564
565
shell:
  """
  module load multiqc/1.9 bismark
  cd {params.bis_dir}
  bismark2report
  bismark2summary
  mv bismark_summary_report.html {params.dir}/
  mv bismark_summary_report.txt {params.dir}/
  cd {params.dir}
  multiqc --ignore '*/.singularity/*' -f --interactive .
  """
582
583
584
585
586
587
588
shell:
  """
  mkdir -p {params.dir1}
  mkdir -p {params.dir2}
  module load R
  Rscript {params.script_dir}/get_methy.R {input} {wildcards.samples} {params.cutoff} {output}
  """
SnakeMake From line 582 of main/Snakefile
598
599
600
601
602
603
604
605
606
run:
  df_ref=pd.read_csv(params.map_table,sep='\t',header=None)
  df_ref.columns=['chromosome','start','end','cgid']
  df_ref=df_ref.loc[(df_ref['chromosome'].isin(CHRS)),]
  dfm=pd.read_csv(input[0])
  dfm=pd.merge(df_ref,dfm,on=['chromosome','start','end'],how='inner')
  dfm=dfm.drop(labels=['chromosome','start','end'],axis=1)
  dfm=dfm.set_index('cgid')
  dfm.to_csv(output[0])
SnakeMake From line 598 of main/Snakefile
618
619
620
621
622
623
shell:
  """
  module load python
  cd {params.dir}
  python {params.script_dir}/deconvolve2.py --atlas_path {params.ref} --residuals {input}  > {output}  2>&1
  """
SnakeMake From line 618 of main/Snakefile
632
633
634
635
636
637
run:
  dfm=pd.read_csv(input[0])
  for f in input[1:]:
  	df=pd.read_csv(f)
  	dfm=pd.merge(dfm,df,on='cgid',how='outer')
  dfm.to_csv(output[0],index=False)
SnakeMake From line 632 of main/Snakefile
650
651
652
653
654
655
shell:
  """
  module load python
  cd {params.dir}
  python {params.script_dir}/deconvolve2.py --atlas_path {params.ref} --residuals {input}
  """
SnakeMake From line 650 of main/Snakefile
666
667
668
669
670
shell:
  """
  module load bedtools
  zcat {input.bed} | tail -n +2 | bedtools sort -i - > {output.bed}
  """
680
681
682
683
684
shell:
  """
  module load bedtools crossmap
  crossmap bed {params.lift_file} {input.bed} {output.graph}
  """
694
695
696
697
698
shell:
  """
  module load bedtools
  bedtools sort -i {input.graph} | bedtools intersect -wo -a {params.markers} -b stdin -sorted | awk '$6-$5==1 {print $0}' | awk 'NF{NF-=1};1' > {output.sort}
  """
708
709
710
711
712
shell:
  """
  module load R
  Rscript {params.script_dir}/aggregate_over_regions.R {input.sort} {output.tsv}
  """
SnakeMake From line 708 of main/Snakefile
725
726
727
728
729
730
731
732
733
734
735
736
737
738
shell:
  """
  module load R
  Rscript ${params.script_dir}/tissues_of_origin_v2.R \
  {input.bed}  \
  {params.reference_markers} \
  {output.tsv} \
  {params.reference_IDs} \
  {params.sampleName} \
  TRUE \
  FALSE \
  TRUE \
  colon_5/hsc_5/hsc_2/spleen_1/spleen_2/spleen_3/spleen_4/
  """
SnakeMake From line 725 of main/Snakefile
758
759
760
761
762
763
shell:
  """
  module load R
  mkdir -p {params.dir}
  Rscript {params.script_dir}/bsseq_lm.R {params.chr} {input.bizfile} {output.bed1} {params.sample_prop} {params.cov}
  """
SnakeMake From line 758 of main/Snakefile
782
783
784
785
786
787
788
789
790
791
792
793
794
shell:
  """
  mkdir -p {params.dir}
  module load combp bedtools
  comb-p pipeline -c 4 --dist {params.dist} --step {params.step} --seed 0.01 -p {params.dir}/{params.groups} --region-filter-p 0.05 --region-filter-n 3 {input}

  module load homer
  mkdir -p {params.homer_dir}
  cp {params.annStat} {output.homerAnn}
  cat {output.RP} | sed "1,1d" | awk '{{print$1"\t"$2"\t"$3"\t"$1"_"$2"_"$3"\t"$4"|"$5"|"$6"|"$7"\t""*"}}' > {output.homerInput}
  annotatePeaks.pl {output.homerInput} hg38 -annStats {output.homerAnn} > {output.homerOutput}
  awk 'NR==FNR{{a[$4]=$5;next}}NR!=FNR{{c=$1; if(c in a){{print $0"\t"a[c]}}}}' {output.homerInput} {output.homerOutput} > {output.homerOutput2}
  """
811
812
813
814
815
816
817
shell:
  """
  mkdir -p {params.dir}
  module load R/4.1
  Rscript {params.script_dir}/mvp_plots.R {input.dir} {params.fdr_cutoff} {params.highlight} {output.miami} {output.violin}
  ## {output.pie_cpg} {output.pie_genic}
  """
SnakeMake From line 811 of main/Snakefile
833
834
835
836
837
838
839
840
shell:
  """
  module load R
  mkdir -p {params.dir}
  zcat {params.prefix}*.regions-p.bed.gz | grep -v "^#" > {output.p1}
  Rscript {params.script_dir}/combp_Manhatten.R {output.p1} {output.f3}
  Rscript {params.script_dir}/combp_GREAT.R {output.p1} {output.f1} {output.o1} {output.o2}
  """
SnakeMake From line 833 of main/Snakefile
852
853
854
855
856
857
858
shell:
  """
  mkdir -p {params.dir}
  ls {params.dir}/{wildcards.group}_*_betas_pval.bed > {output.LIST}
  module load R
  Rscript {params.script_dir}/bsseq_Manhatten.R {output.LIST} {output.MAN}
  """
SnakeMake From line 852 of main/Snakefile
ShowHide 21 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://openomics.github.io/methyl-seek/
Name: methyl-seek
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...