Snakemake workflow: biseps

public public 1yr ago 0 bookmarks

This is a snakemake pipeline for bisulfite sequencing data, it implements:

  1. Adapter trimming and quality check

  2. Quality reports and statistics ( fastqc + multiqc )

  3. Methylation extraction with bismark ( bowtie2 / hisat2 as aligners)

  4. DMR identification with Methylkit (in all contexts)

Authors

  • Skander Hatira (@skanderhatira)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

git clone https://forgemia.inra.fr/irhs-bioinfo/biseps.git

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv , units.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --profile config/profiles/local -n

See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can create a self-contained interactive .html report with all results via:

snakemake --report report.html

Build docker image

A docker image of this workflow can be built from the repository by running this command:

docker build -t biseps --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) --build-arg USERNAME=$USER .

To run this container with your data you need to bind volumes specyfing raw data, configuration files, and necessary resources

Testing

The .test directory contains subsampled .fastq files for two samples (multi-lane + biological replicates) and a .fasta file containing genome sequence from NCBI . It also contains CX reports from DMRcaller to test the DMR Identification Pipeline

To test the pipeline you have to be on a conda available on your machine $PATH :

Testing Alignment :

  • You can either use the preconfigured .yaml profile :

    snakemake --profile config/profiles/test

  • Or enter the command line arguments as such :

    snakemake --cores $N --use-conda --configfile .test/config/config.yaml

Testing differential methylation extraction :

  • You can either use the preconfigured .yaml profile :

    snakemake --profile config/profiles/testComparison

  • Or enter the command line arguments as such :

    snakemake --cores $N --use-conda --configfile .test/comparison/config.yaml

or a docker enabled machine to build and run the image with a mounted folder containing necessary data and configuration files pointing to that data:

docker run --mount type=bind,src="$(pwd)/.test",dst=/biseps/.test,readonly biseps \
--profile config/profiles/test

Note that your output data won't be accessible as it isn't mounted/stored in a docker volume , refer to docker documentation on best practices to persist data in running containers.

Code Snippets

37
38
39
shell:
	"bismark --score_min {params.score_min} -N {params.N} -L {params.L} --{params.aligner}  --bam {params.aligner_options} {params.flags} {params.extra} "
	"--temp_dir {output.tempdir}  -o {params.outdir} --parallel {params.instances} {params.genome} -1 {input.r1} -2 {input.r2} 2> {log}; "
50
51
52
53
shell:
	"mv {input.bam} {output.bam}; "
	"mv {input.report} {output.report}; "
	"mv {input.nucl} {output.nucl}; "
66
67
shell:
	"samtools view -h -@ {resources.cpus} -o {output.sam} {input} ;"
93
94
95
96
shell:
	"deduplicate_bismark -p {input} -o {params.basename} --output_dir {params.outdir} --sam 2> {log};"
	"samtools sort {output.dedup} -o {output.sort_bam} ;"
	"samtools index {output.sort_bam}"
109
110
shell:
	"bamCoverage -b {input.bam} -o {output} 2> {log}"
23
24
shell:
	"bash workflow/scripts/parallel_commands.sh \'bismark_genome_preparation  {params.genome} --{params.aligner} --parallel {resources.cpus} --genomic_composition {params.extra}\' \'samtools faidx {params.genome_file} \' 2> {log} "
33
34
shell:
	"bismark_methylation_extractor  {input}  --genome_folder {params.genome} {params.flags} -p  --parallel {params.instances} -o {params.out_dir}  {params.extra} 2> {log} "
48
49
shell:
	"python workflow/scripts/methcalls2cgmap.py -n {input} -f bismark &> {log}"
64
65
66
shell:
	"sort -k1,1 -k2,2n {input.cx} > {output.sort};"
	"python3 workflow/scripts/bismark_to_bigwig_pe.py {input.index} {output.sort} &> {log}"				
82
83
84
85
86
shell:
	"gunzip {input.bedgraph};"
	"sed -i '1d' {output.unzip};"
	"sort -k1,1 -k2,2n {output.unzip} > {output.sort}; "
	"bedGraphToBigWig   {output.sort} {input.index} {output.bw}  2> {log}"				
96
97
98
99
shell:
	"mv {input.cg} {output.cg};"
	"mv {input.chg} {output.chg};"
	"mv {input.chh} {output.chh};"
13
14
15
shell:
	"mkdir -p {output} ; "
	"fastqc {input} -o {output} {params.extra} 2> {log}"
33
34
shell:
	"multiqc {input.fqc} {params.aligndir} {params.methdir} -n {output.file} 2> {log}"
56
57
58
59
60
61
shell:
	"bismark2report --alignment_report {input.align} "
	"--dedup_report {input.dedup} "
	"--nucleotide_report {input.nucl} "
	"--splitting_report {input.split} "
	"--mbias_report {input.mbias} --dir {params.outdir} -o {params.filename} 2> {log}"	
 7
 8
 9
10
11
12
13
14
15
run:
	if os.path.splitext(input.r1)[1] == '.gz':
		shell("gunzip -c {input.r1} > {output.r1}; ")
	else:
		shell("ln -sr {input.r1} {output.r1}; ")
	if os.path.splitext(input.r2)[1] == '.gz':
		shell("gunzip -c {input.r2} > {output.r2}; ")
	else:
		shell("ln -sr {input.r2} {output.r2}; ")
24
25
26
shell:
	"cat {input.r1} > {output.r1};"
	"cat {input.r2} > {output.r2}"
44
45
46
shell:
	"seqtk sample -s {params.seed} {input.r1} {params.size} > {output.r1} ; "
	"seqtk sample -s {params.seed} {input.r2} {params.size} > {output.r2}   "
23
24
25
26
27
shell:
	"trimmomatic PE -phred33 -threads {resources.cpus} "
	" {input} "
	" {output} "
	" {params.trimmer}:{params.adapters}:{params.trimmeropts} {params.extra}"
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
import sys, math, multiprocessing, subprocess, os

# Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-p=num_proc] <chrm_sizes>  <allC_file> [allC_file]*

# NOTE: allc file contains the methylation information for all chromosomes

# Steps:
# 1. allC to bedGraph
# 2. sort bedGraph if necessary
# 3. bedGraph to BigWig
# 4. remove temporary files

NUMPROC=1

def processInputs( allCFileAr, chrmFileStr, keepTmp, labelsAr, outID, numProc, isSort, useAll ):
	print( 'Keep temp files: {:s}; Sort bedGraph: {:s}; Use all positions: {:s}'.format( str( keepTmp), str (isSort), str(useAll)))
	print( 'Begin processing files with {:d} processors'.format( numProc ) )
	pool = multiprocessing.Pool( processes=numProc )
	results = [ pool.apply_async( processFile, args=(allCFileAr[i], chrmFileStr, labelsAr[i], outID, keepTmp, isSort, useAll) ) for i in range(len(allCFileAr)) ]
	suc = [ p.get() for p in results ]

	print( 'Done' )


def processFile( allCFileStr, chrmFileStr, label, outID, keepTmp, isSort, useAll ):
	if outID == None and label == None:
		outID = allCFileStr.replace( '.tsv','' ).replace( 'allc_','' )
	elif outID == None:
		outID = label
	elif label == None:
		outID += '_' + allCFileStr.replace( '.tsv','' ).replace( 'allc_','' )
	else:
		outID += '_' + label

	print( 'Reading allC file {:s}'.format( allCFileStr ) )
	# allC to bedGraphs
	bedGraphStr =  outID + '.bedGraph'
	bedGraphAr = [bedGraphStr + '.' + x for x in ['cg','chg','chh'] ]
	readAllC( allCFileStr, bedGraphAr, useAll )

	if isSort:
		print( 'Sorting bedGraph files' )
		for b in bedGraphAr:
			sortBedFile( b )

	print( 'Converting {:s} files to BigWig'.format(bedGraphStr ) )
	# bedGraph to bigWig
	for b in bedGraphAr:
		processBedGraph( b, chrmFileStr )

	# remove temporary
	if keepTmp == False:
		print( 'Removing temporary files...' )
		for b in bedGraphAr:
			os.remove( b )
	print( 'BigWig finished for {:s}.bw.*'.format( outID ) )

def readAllC( allCFileStr, outFileAr, useAll ):

	allCFile = open( allCFileStr, 'r' )
	outFileAr = [open( outFileStr, 'w' ) for outFileStr in outFileAr]

	mTypes = [ 'CG', 'CHG', 'CHH' ]

	for line in allCFile:
		lineAr = line.rstrip().split('\t')
		# (0) chr (1) pos (2) strand (3) mC (4) C (5) Cctxt
		# (6) trintctxt
		if len(lineAr) < 7:
			continue
		elif ( useAll or int(lineAr[3])> 0 ):
			chrm = lineAr[0]
			pos = int( lineAr[1] ) - 1
			methType = decodeMethType( lineAr[5] )
			try:
				mInd = mTypes.index( methType )
			except ValueError:
				continue
			value = float( lineAr[3] ) / (float( lineAr[4] ) + float( lineAr[3]))
			# adjust for negative strand
			if lineAr[2] == '-':
				value = value * -1

			# (0) chrm (1) start (2) end (3) value
			outStr = '{:s}\t{:d}\t{:d}\t{:f}\n'.format( chrm, pos, pos+1, value )
			outFile = outFileAr[mInd]
			outFile.write( outStr )
		# end if
	# end for
	allCFile.close()
	[ outFile.close() for outFile in outFileAr ]

def decodeMethType( mStr ):

	if mStr.startswith( 'CG' ):
		return 'CG'
	elif mStr.endswith( 'G' ):
		return 'CHG'
	elif mStr == 'CNN':
		return 'CNN'
	else:
		return 'CHH'

def sortBedFile( bedFileStr ):
	command = 'bedSort {:s} {:s}'.format( bedFileStr, bedFileStr )
	subprocess.call( command, shell=True )

def processBedGraph( bedGraphStr, chrmFileStr ):

	bigWigStr = bedGraphStr.replace( '.bedGraph', '.bw' )
	#print( bigWigStr )
	# bedGraphToBigWig in.bedGraph chrom.sizes out.bw
	command = 'bedGraphToBigWig {:s} {:s} {:s}'.format( bedGraphStr, chrmFileStr, bigWigStr )
	subprocess.call( command, shell=True)


def parseInputs( argv ):
	# Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-p=num_proc] <chrm_sizes>  <allC_file> [allC_file]*

	keepTmp = False
	labelsAr = []
	numProc = NUMPROC
	isSort = False
	useAll = False
	outID = None
	startInd = 0

	for i in range( min(5, len(argv)-2) ):
		if argv[i] == '-keep':
			keepTmp = True
			startInd += 1
		elif argv[i] == '-sort':
			isSort = True
			startInd += 1
		elif argv[i] == '-all':
			useAll = True
			startInd += 1
		elif argv[i].startswith( '-L=' ):
			labelsAr = argv[i][3:].split( ',' )
			startInd += 1
		elif argv[i].startswith( '-o=' ):
			outID = argv[i][3:]
			startInd += 1
		elif argv[i].startswith( '-p=' ):
			try:
				numProc = int( argv[i][3:] )
				startInd += 1
			except ValueError:
				print( 'ERROR: number of processors must be an integer' )
				exit()
		elif argv[i] in [ '-h', '--help', '-help']:
			printHelp()
			exit()
		elif argv[i].startswith( '-' ):
			print( 'ERROR: {:s} is not a valid parameter'.format( argv[i] ) )
			exit()

	chrmFileStr = argv[startInd]
	allCFileAr = []
	for j in range(startInd+1, len(argv) ):
		allCFileAr += [ argv[j] ]

	if len(labelsAr) == 0:
		labelsAr = [None] * len(allCFileAr)
	elif len(labelsAr) != len(allCFileAr):
		print( "ERROR: number of labels doesn't match number of allC files" )
		exit()

	processInputs( allCFileAr, chrmFileStr, keepTmp, labelsAr, outID, numProc, isSort, useAll )

def printHelp():
	print ("Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-o=out_id] [-p=num_proc] <chrm_sizes>  <bismark_file> [bismark_file]*")
	print( 'Converts bismark files to context-specific BigWig files' )
	print( 'Note: bedGraphToBigWig and bedSort programs must be in the path' )
	print( 'Required:' )
	print( 'chrm_file\ttab-delimited file with chromosome names and lengths,\n\t\ti.e. fasta index file' )
	print( 'bismark_file\tbismark file with all chrms and contexts' )
	print( 'Optional:' )
	print( '-keep\t\tkeep intermediate files' )
	print( '-sort\t\tcalls bedSort; add this option if bigwig conversion fails' )
	print( '-all\t\tuse all covered positions including 0s [default only includes mC > 1]' )
	print( '-L=labels\tcomma-separated list of labels to use for the allC files;\n\t\tdefaults to using information from the allc file name' )
	print( '-o=out_id\toptional identifier to be added to the output file names' )
	print( '-p=num_proc\tnumber of processors to use [default 1]' )

if __name__ == "__main__":
	if len(sys.argv) < 3 :
		printHelp()
	else:
		parseInputs( sys.argv[1:] )
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import pandas as pd
import numpy as np
import traceback
import argparse


def get_parser():
    """
    Create a parser and add arguments
    """
    parser = argparse.ArgumentParser()
    group1 = parser.add_argument_group('Input format')
    group1.add_argument("-n", "--filename", type=str, help="the file name that the users want to convert to CGMap format")
    group1.add_argument("-f", "--format", default="bismark",choices=["bismark", "bsmap", "methylpy", "methimpute"], type=str, help="the type of file to CGmap")
    return parser


def readfile(filename):
    """
    read files: allow the format with or without gz compressed
    """
    if filename[-3:] == ".gz":
        inputfile = pd.read_csv(filename, header = None, sep="\t", compression='gzip')
    else:
        inputfile = pd.read_csv(filename, header = None, sep="\t")
    return inputfile



def trinuc(threeletter):
    """
    function to acquire the methylation context (CG, CHG, CHH) 
    """
    if threeletter[1] == "G":
        context = "CG"
    if threeletter[1] != "G":
        if threeletter[2] == "G":
            context = "CHG"
        if threeletter[2]!="G":
            context = "CHH"
    return context


def bismark2cgmap(bismarkfile):
	"""
	CX report file to CGmap.gz
	"""
	name_bismark = bismarkfile
	bismark = readfile(name_bismark)
	try:
		int(bismark.iloc[0,1])
		header = "No"
	except:
		header = "Yes"
	if header == "Yes":
		bismark = bismark.iloc[1:,:]
	print(bismark.head())
	bismark["nuc"] = np.where(bismark[2]=="+","C","G")
	bismark["dinuc"] = [i[:2] for i in list(bismark[6])]
	bismark["mc_nc"] = bismark[3] + bismark[4]
	bismark["ratio"] = np.round(bismark[3].astype("float")/(bismark[3] + bismark[4]).astype("float"),2)
	cgmap_format = bismark[[0,"nuc",1,5,"dinuc","ratio",3,"mc_nc"]]
	cgmap_format_finish = cgmap_format.loc[(cgmap_format["mc_nc"]!=0)]
	cgmap_format_finish.to_csv(name_bismark + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip')


def bsmap2cgmap(bsmapfile):
	"""
	the methylation calls generated by methratio.py in BSMAP (v2.73) to CGmap.gz
	"""
	name_bsmap = bsmapfile 
	bsmap = readfile(name_bsmap)
	# the file with or without header can be properly processed
	try:
		int(bsmap.iloc[0,1])
		header = "No"
	except:
		header = "Yes"
	if header == "Yes":
		bsmap = bsmap.iloc[1:,:]
	# to get the items in CGmap
	bsmap["nuc"] = np.where(bsmap[2]=="+","C","G")
	bsmap["dinuc"] = [i[:2] for i in list(bsmap[3])]
	cgmap_format = bsmap[[0,"nuc",1,3,"dinuc",4,6,5]]
	cgmap_format_finish = cgmap_format.loc[(cgmap_format[5]!=0)]
	cgmap_format_finish.to_csv(name_bsmap[:-3] + "CGmap.gz", header = False, index = False, sep="\t", compression='gzip')



def methylpy2cgmap(methylpyfile):
	"""
	the allc files by methylpy to CGmap.gz
	"""
	name_methylpy = methylpyfile
	methylpy = readfile(name_methylpy)
	methylpy["nuc"] = np.where(methylpy[2]=="+","C","G")
	methylpy["dinuc"] = [i[:2] for i in list(methylpy[3])]
	methylpy["context"] = methylpy[3].apply(trinuc)
	methylpy["ratio"] = np.round(methylpy[4].astype("float")/methylpy[5].astype("float"),2)
	cgmap_format = methylpy[[0,"nuc",1,"context","dinuc","ratio",4,5]]
	cgmap_format_finish = cgmap_format.loc[(cgmap_format[5]!=0)]
	cgmap_format_finish.to_csv(name_methylpy + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip')





def methimpute2cgmap(methimputefile):
	"""
	TSV files exported from the methimpute to CGmap.gz
	"""
	name_methimpute = methimputefile
	methimpute = readfile(name_methimpute)
	methimpute["nuc"] = np.where(methimpute[2]=="+","C","G")
	methimpute["dinuc"] = [i[:2] for i in list(methimpute[3])]
	methimpute["ratio"] = round(methimpute[4].astype("float")/methimpute[5].astype("float"), 2)
	cgmap_format = methimpute[[0, "nuc", 1, 3, "dinuc", "ratio", 4, 5]]
	cgmap_format_finish = cgmap_format.loc[(cgmap_format["mc_nc"]!=0)]
	cgmap_format_finish.to_csv(name_bismark + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip')




def main():
	parser = get_parser()
	args = parser.parse_args()
	try:
		if args.format == "bismark":
			bismark2cgmap(args.filename)
		if args.format == "bsmap":
			bsmap2cgmap(args.filename)
		if args.format == "methylpy":
			methylpy2cgmap(args.filename)
		if args.format =="methimpute":
			methimpute2cgmap(args.filename)
	except Exception as e:
		print(e)
		print(traceback.format_exc())
		print("please check the input format or the parameter, see methcalls2cgmap.py -h")




if __name__ == '__main__':
    main()
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
for cmd in "$@"; do {
  echo "Process \"$cmd\" started";
  $cmd & pid=$!
  PID_LIST+=" $pid";
} done

trap "kill $PID_LIST" SIGINT

echo "Parallel processes have started";

wait $PID_LIST

echo
echo "All processes have completed";
79
80
shell:
	"touch {output}"
ShowHide 12 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/SkanderHatira/biseps_workflow
Name: biseps_workflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...