Snakemake workflow: biseps

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is a snakemake pipeline for bisulfite sequencing data, it implements:

Adapter trimming and quality check
Quality reports and statistics ( fastqc + multiqc )
Methylation extraction with bismark ( bowtie2 / hisat2 as aligners)
DMR identification with Methylkit (in all contexts)

Authors

Skander Hatira (@skanderhatira)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

git clone https://forgemia.inra.fr/irhs-bioinfo/biseps.git

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv , units.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --profile config/profiles/local -n

See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can create a self-contained interactive .html report with all results via:

snakemake --report report.html

Build docker image

A docker image of this workflow can be built from the repository by running this command:

docker build -t biseps --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) --build-arg USERNAME=$USER .

To run this container with your data you need to bind volumes specyfing raw data, configuration files, and necessary resources

Testing

The .test directory contains subsampled .fastq files for two samples (multi-lane + biological replicates) and a .fasta file containing genome sequence from NCBI . It also contains CX reports from DMRcaller to test the DMR Identification Pipeline

To test the pipeline you have to be on a conda available on your machine $PATH :

Testing Alignment :

You can either use the preconfigured .yaml profile :

snakemake --profile config/profiles/test
Or enter the command line arguments as such :

snakemake --cores $N --use-conda --configfile .test/config/config.yaml

Testing differential methylation extraction :

You can either use the preconfigured .yaml profile :

snakemake --profile config/profiles/testComparison
Or enter the command line arguments as such :

snakemake --cores $N --use-conda --configfile .test/comparison/config.yaml

or a docker enabled machine to build and run the image with a mounted folder containing necessary data and configuration files pointing to that data:

docker run --mount type=bind,src="$(pwd)/.test",dst=/biseps/.test,readonly biseps \
--profile config/profiles/test

Note that your output data won't be accessible as it isn't mounted/stored in a docker volume , refer to docker documentation on best practices to persist data in running containers.

Code Snippets

shell:
	"bismark --score_min {params.score_min} -N {params.N} -L {params.L} --{params.aligner}  --bam {params.aligner_options} {params.flags} {params.extra} "
	"--temp_dir {output.tempdir}  -o {params.outdir} --parallel {params.instances} {params.genome} -1 {input.r1} -2 {input.r2} 2> {log}; "

SnakeMake Bismark From line 37 of rules/alignment.smk

shell:
	"mv {input.bam} {output.bam}; "
	"mv {input.report} {output.report}; "
	"mv {input.nucl} {output.nucl}; "

SnakeMake From line 50 of rules/alignment.smk

shell:
	"samtools view -h -@ {resources.cpus} -o {output.sam} {input} ;"

SnakeMake SAMtools From line 66 of rules/alignment.smk

shell:
	"deduplicate_bismark -p {input} -o {params.basename} --output_dir {params.outdir} --sam 2> {log};"
	"samtools sort {output.dedup} -o {output.sort_bam} ;"
	"samtools index {output.sort_bam}"

SnakeMake SAMtools Bismark From line 93 of rules/alignment.smk

shell:
	"bamCoverage -b {input.bam} -o {output} 2> {log}"

SnakeMake DeepTools From line 109 of rules/alignment.smk

shell:
	"bash workflow/scripts/parallel_commands.sh \'bismark_genome_preparation  {params.genome} --{params.aligner} --parallel {resources.cpus} --genomic_composition {params.extra}\' \'samtools faidx {params.genome_file} \' 2> {log} "

SnakeMake SAMtools Bismark From line 23 of rules/genome_preparation.smk

shell:
	"bismark_methylation_extractor  {input}  --genome_folder {params.genome} {params.flags} -p  --parallel {params.instances} -o {params.out_dir}  {params.extra} 2> {log} "

SnakeMake Bismark From line 33 of rules/methylation_extraction.smk

shell:
	"python workflow/scripts/methcalls2cgmap.py -n {input} -f bismark &> {log}"

SnakeMake Bismark From line 48 of rules/methylation_extraction.smk

shell:
	"sort -k1,1 -k2,2n {input.cx} > {output.sort};"
	"python3 workflow/scripts/bismark_to_bigwig_pe.py {input.index} {output.sort} &> {log}"				

SnakeMake From line 64 of rules/methylation_extraction.smk

shell:
	"gunzip {input.bedgraph};"
	"sed -i '1d' {output.unzip};"
	"sort -k1,1 -k2,2n {output.unzip} > {output.sort}; "
	"bedGraphToBigWig   {output.sort} {input.index} {output.bw}  2> {log}"				

SnakeMake bedGraphToBigWig From line 82 of rules/methylation_extraction.smk

shell:
	"mv {input.cg} {output.cg};"
	"mv {input.chg} {output.chg};"
	"mv {input.chh} {output.chh};"

SnakeMake From line 96 of rules/methylation_extraction.smk

shell:
	"mkdir -p {output} ; "
	"fastqc {input} -o {output} {params.extra} 2> {log}"

SnakeMake FastQC From line 13 of rules/quality.smk

shell:
	"multiqc {input.fqc} {params.aligndir} {params.methdir} -n {output.file} 2> {log}"

SnakeMake MultiQC From line 33 of rules/quality.smk

shell:
	"bismark2report --alignment_report {input.align} "
	"--dedup_report {input.dedup} "
	"--nucleotide_report {input.nucl} "
	"--splitting_report {input.split} "
	"--mbias_report {input.mbias} --dir {params.outdir} -o {params.filename} 2> {log}"	

SnakeMake From line 56 of rules/quality.smk

run:
	if os.path.splitext(input.r1)[1] == '.gz':
		shell("gunzip -c {input.r1} > {output.r1}; ")
	else:
		shell("ln -sr {input.r1} {output.r1}; ")
	if os.path.splitext(input.r2)[1] == '.gz':
		shell("gunzip -c {input.r2} > {output.r2}; ")
	else:
		shell("ln -sr {input.r2} {output.r2}; ")

SnakeMake From line 7 of rules/subsample.smk

shell:
	"cat {input.r1} > {output.r1};"
	"cat {input.r2} > {output.r2}"

SnakeMake From line 24 of rules/subsample.smk

shell:
	"seqtk sample -s {params.seed} {input.r1} {params.size} > {output.r1} ; "
	"seqtk sample -s {params.seed} {input.r2} {params.size} > {output.r2}   "

SnakeMake seqtk From line 44 of rules/subsample.smk

shell:
	"trimmomatic PE -phred33 -threads {resources.cpus} "
	" {input} "
	" {output} "
	" {params.trimmer}:{params.adapters}:{params.trimmeropts} {params.extra}"

SnakeMake Trimmomatic From line 23 of rules/trimmomatic.smk

import sys, math, multiprocessing, subprocess, os

# Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-p=num_proc] <chrm_sizes>  <allC_file> [allC_file]*

# NOTE: allc file contains the methylation information for all chromosomes

# Steps:
# 1. allC to bedGraph
# 2. sort bedGraph if necessary
# 3. bedGraph to BigWig
# 4. remove temporary files

NUMPROC=1

def processInputs( allCFileAr, chrmFileStr, keepTmp, labelsAr, outID, numProc, isSort, useAll ):
	print( 'Keep temp files: {:s}; Sort bedGraph: {:s}; Use all positions: {:s}'.format( str( keepTmp), str (isSort), str(useAll)))
	print( 'Begin processing files with {:d} processors'.format( numProc ) )
	pool = multiprocessing.Pool( processes=numProc )
	results = [ pool.apply_async( processFile, args=(allCFileAr[i], chrmFileStr, labelsAr[i], outID, keepTmp, isSort, useAll) ) for i in range(len(allCFileAr)) ]
	suc = [ p.get() for p in results ]

	print( 'Done' )


def processFile( allCFileStr, chrmFileStr, label, outID, keepTmp, isSort, useAll ):
	if outID == None and label == None:
		outID = allCFileStr.replace( '.tsv','' ).replace( 'allc_','' )
	elif outID == None:
		outID = label
	elif label == None:
		outID += '_' + allCFileStr.replace( '.tsv','' ).replace( 'allc_','' )
	else:
		outID += '_' + label

	print( 'Reading allC file {:s}'.format( allCFileStr ) )
	# allC to bedGraphs
	bedGraphStr =  outID + '.bedGraph'
	bedGraphAr = [bedGraphStr + '.' + x for x in ['cg','chg','chh'] ]
	readAllC( allCFileStr, bedGraphAr, useAll )

	if isSort:
		print( 'Sorting bedGraph files' )
		for b in bedGraphAr:
			sortBedFile( b )

	print( 'Converting {:s} files to BigWig'.format(bedGraphStr ) )
	# bedGraph to bigWig
	for b in bedGraphAr:
		processBedGraph( b, chrmFileStr )

	# remove temporary
	if keepTmp == False:
		print( 'Removing temporary files...' )
		for b in bedGraphAr:
			os.remove( b )
	print( 'BigWig finished for {:s}.bw.*'.format( outID ) )

def readAllC( allCFileStr, outFileAr, useAll ):

	allCFile = open( allCFileStr, 'r' )
	outFileAr = [open( outFileStr, 'w' ) for outFileStr in outFileAr]

	mTypes = [ 'CG', 'CHG', 'CHH' ]

	for line in allCFile:
		lineAr = line.rstrip().split('\t')
		# (0) chr (1) pos (2) strand (3) mC (4) C (5) Cctxt
		# (6) trintctxt
		if len(lineAr) < 7:
			continue
		elif ( useAll or int(lineAr[3])> 0 ):
			chrm = lineAr[0]
			pos = int( lineAr[1] ) - 1
			methType = decodeMethType( lineAr[5] )
			try:
				mInd = mTypes.index( methType )
			except ValueError:
				continue
			value = float( lineAr[3] ) / (float( lineAr[4] ) + float( lineAr[3]))
			# adjust for negative strand
			if lineAr[2] == '-':
				value = value * -1

			# (0) chrm (1) start (2) end (3) value
			outStr = '{:s}\t{:d}\t{:d}\t{:f}\n'.format( chrm, pos, pos+1, value )
			outFile = outFileAr[mInd]
			outFile.write( outStr )
		# end if
	# end for
	allCFile.close()
	[ outFile.close() for outFile in outFileAr ]

def decodeMethType( mStr ):

	if mStr.startswith( 'CG' ):
		return 'CG'
	elif mStr.endswith( 'G' ):
		return 'CHG'
	elif mStr == 'CNN':
		return 'CNN'
	else:
		return 'CHH'

def sortBedFile( bedFileStr ):
	command = 'bedSort {:s} {:s}'.format( bedFileStr, bedFileStr )
	subprocess.call( command, shell=True )

def processBedGraph( bedGraphStr, chrmFileStr ):

	bigWigStr = bedGraphStr.replace( '.bedGraph', '.bw' )
	#print( bigWigStr )
	# bedGraphToBigWig in.bedGraph chrom.sizes out.bw
	command = 'bedGraphToBigWig {:s} {:s} {:s}'.format( bedGraphStr, chrmFileStr, bigWigStr )
	subprocess.call( command, shell=True)


def parseInputs( argv ):
	# Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-p=num_proc] <chrm_sizes>  <allC_file> [allC_file]*

	keepTmp = False
	labelsAr = []
	numProc = NUMPROC
	isSort = False
	useAll = False
	outID = None
	startInd = 0

	for i in range( min(5, len(argv)-2) ):
		if argv[i] == '-keep':
			keepTmp = True
			startInd += 1
		elif argv[i] == '-sort':
			isSort = True
			startInd += 1
		elif argv[i] == '-all':
			useAll = True
			startInd += 1
		elif argv[i].startswith( '-L=' ):
			labelsAr = argv[i][3:].split( ',' )
			startInd += 1
		elif argv[i].startswith( '-o=' ):
			outID = argv[i][3:]
			startInd += 1
		elif argv[i].startswith( '-p=' ):
			try:
				numProc = int( argv[i][3:] )
				startInd += 1
			except ValueError:
				print( 'ERROR: number of processors must be an integer' )
				exit()
		elif argv[i] in [ '-h', '--help', '-help']:
			printHelp()
			exit()
		elif argv[i].startswith( '-' ):
			print( 'ERROR: {:s} is not a valid parameter'.format( argv[i] ) )
			exit()

	chrmFileStr = argv[startInd]
	allCFileAr = []
	for j in range(startInd+1, len(argv) ):
		allCFileAr += [ argv[j] ]

	if len(labelsAr) == 0:
		labelsAr = [None] * len(allCFileAr)
	elif len(labelsAr) != len(allCFileAr):
		print( "ERROR: number of labels doesn't match number of allC files" )
		exit()

	processInputs( allCFileAr, chrmFileStr, keepTmp, labelsAr, outID, numProc, isSort, useAll )

def printHelp():
	print ("Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-o=out_id] [-p=num_proc] <chrm_sizes>  <bismark_file> [bismark_file]*")
	print( 'Converts bismark files to context-specific BigWig files' )
	print( 'Note: bedGraphToBigWig and bedSort programs must be in the path' )
	print( 'Required:' )
	print( 'chrm_file\ttab-delimited file with chromosome names and lengths,\n\t\ti.e. fasta index file' )
	print( 'bismark_file\tbismark file with all chrms and contexts' )
	print( 'Optional:' )
	print( '-keep\t\tkeep intermediate files' )
	print( '-sort\t\tcalls bedSort; add this option if bigwig conversion fails' )
	print( '-all\t\tuse all covered positions including 0s [default only includes mC > 1]' )
	print( '-L=labels\tcomma-separated list of labels to use for the allC files;\n\t\tdefaults to using information from the allc file name' )
	print( '-o=out_id\toptional identifier to be added to the output file names' )
	print( '-p=num_proc\tnumber of processors to use [default 1]' )

if __name__ == "__main__":
	if len(sys.argv) < 3 :
		printHelp()
	else:
		parseInputs( sys.argv[1:] )

Python bigWig bedGraphToBigWig Bismark From line 1 of scripts/bismark_to_bigwig_pe.py

import pandas as pd
import numpy as np
import traceback
import argparse


def get_parser():
    """
    Create a parser and add arguments
    """
    parser = argparse.ArgumentParser()
    group1 = parser.add_argument_group('Input format')
    group1.add_argument("-n", "--filename", type=str, help="the file name that the users want to convert to CGMap format")
    group1.add_argument("-f", "--format", default="bismark",choices=["bismark", "bsmap", "methylpy", "methimpute"], type=str, help="the type of file to CGmap")
    return parser


def readfile(filename):
    """
    read files: allow the format with or without gz compressed
    """
    if filename[-3:] == ".gz":
        inputfile = pd.read_csv(filename, header = None, sep="\t", compression='gzip')
    else:
        inputfile = pd.read_csv(filename, header = None, sep="\t")
    return inputfile



def trinuc(threeletter):
    """
    function to acquire the methylation context (CG, CHG, CHH) 
    """
    if threeletter[1] == "G":
        context = "CG"
    if threeletter[1] != "G":
        if threeletter[2] == "G":
            context = "CHG"
        if threeletter[2]!="G":
            context = "CHH"
    return context


def bismark2cgmap(bismarkfile):
	"""
	CX report file to CGmap.gz
	"""
	name_bismark = bismarkfile
	bismark = readfile(name_bismark)
	try:
		int(bismark.iloc[0,1])
		header = "No"
	except:
		header = "Yes"
	if header == "Yes":
		bismark = bismark.iloc[1:,:]
	print(bismark.head())
	bismark["nuc"] = np.where(bismark[2]=="+","C","G")
	bismark["dinuc"] = [i[:2] for i in list(bismark[6])]
	bismark["mc_nc"] = bismark[3] + bismark[4]
	bismark["ratio"] = np.round(bismark[3].astype("float")/(bismark[3] + bismark[4]).astype("float"),2)
	cgmap_format = bismark[[0,"nuc",1,5,"dinuc","ratio",3,"mc_nc"]]
	cgmap_format_finish = cgmap_format.loc[(cgmap_format["mc_nc"]!=0)]
	cgmap_format_finish.to_csv(name_bismark + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip')


def bsmap2cgmap(bsmapfile):
	"""
	the methylation calls generated by methratio.py in BSMAP (v2.73) to CGmap.gz
	"""
	name_bsmap = bsmapfile 
	bsmap = readfile(name_bsmap)
	# the file with or without header can be properly processed
	try:
		int(bsmap.iloc[0,1])
		header = "No"
	except:
		header = "Yes"
	if header == "Yes":
		bsmap = bsmap.iloc[1:,:]
	# to get the items in CGmap
	bsmap["nuc"] = np.where(bsmap[2]=="+","C","G")
	bsmap["dinuc"] = [i[:2] for i in list(bsmap[3])]
	cgmap_format = bsmap[[0,"nuc",1,3,"dinuc",4,6,5]]
	cgmap_format_finish = cgmap_format.loc[(cgmap_format[5]!=0)]
	cgmap_format_finish.to_csv(name_bsmap[:-3] + "CGmap.gz", header = False, index = False, sep="\t", compression='gzip')



def methylpy2cgmap(methylpyfile):
	"""
	the allc files by methylpy to CGmap.gz
	"""
	name_methylpy = methylpyfile
	methylpy = readfile(name_methylpy)
	methylpy["nuc"] = np.where(methylpy[2]=="+","C","G")
	methylpy["dinuc"] = [i[:2] for i in list(methylpy[3])]
	methylpy["context"] = methylpy[3].apply(trinuc)
	methylpy["ratio"] = np.round(methylpy[4].astype("float")/methylpy[5].astype("float"),2)
	cgmap_format = methylpy[[0,"nuc",1,"context","dinuc","ratio",4,5]]
	cgmap_format_finish = cgmap_format.loc[(cgmap_format[5]!=0)]
	cgmap_format_finish.to_csv(name_methylpy + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip')





def methimpute2cgmap(methimputefile):
	"""
	TSV files exported from the methimpute to CGmap.gz
	"""
	name_methimpute = methimputefile
	methimpute = readfile(name_methimpute)
	methimpute["nuc"] = np.where(methimpute[2]=="+","C","G")
	methimpute["dinuc"] = [i[:2] for i in list(methimpute[3])]
	methimpute["ratio"] = round(methimpute[4].astype("float")/methimpute[5].astype("float"), 2)
	cgmap_format = methimpute[[0, "nuc", 1, 3, "dinuc", "ratio", 4, 5]]
	cgmap_format_finish = cgmap_format.loc[(cgmap_format["mc_nc"]!=0)]
	cgmap_format_finish.to_csv(name_bismark + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip')




def main():
	parser = get_parser()
	args = parser.parse_args()
	try:
		if args.format == "bismark":
			bismark2cgmap(args.filename)
		if args.format == "bsmap":
			bsmap2cgmap(args.filename)
		if args.format == "methylpy":
			methylpy2cgmap(args.filename)
		if args.format =="methimpute":
			methimpute2cgmap(args.filename)
	except Exception as e:
		print(e)
		print(traceback.format_exc())
		print("please check the input format or the parameter, see methcalls2cgmap.py -h")




if __name__ == '__main__':
    main()

Python Pandas numpy Bismark From line 5 of scripts/methcalls2cgmap.py

for cmd in "$@"; do {
  echo "Process \"$cmd\" started";
  $cmd & pid=$!
  PID_LIST+=" $pid";
} done

trap "kill $PID_LIST" SIGINT

echo "Parallel processes have started";

wait $PID_LIST

echo
echo "All processes have completed";