Files and methodology pertaining to the sequencing and analysis of SARS-CoV-2, causative agent of COVID-19.

public public 1yr ago Version: v1.6.2 0 bookmarks

This is a complete standardized workflow the assembly and subsequent analysis for short-read viral sequencing. This core workflow is compatible with the illumina artic nf pipeline and produces consensus and variant calls using iVar (1.3) (Grubaugh, 2019) and Freebayes (Garrison, 2012) . However, it performs far more extensive quality control and visualisation of results including an interactive HTML summary of run results.

Briefly, raw reads undergo qc using fastqc (Andrews) before removal of host-related reads by competitive mapping against a composite host and human reference with BWA-MEM (0.7.5) (Li, 2013) , samtools (1.9) (Li, 2009) , and a custom script . This is to ensure raw as data as possible can be deposited in central databases. After this, reads undergo adapter trimming and further qc with trim-galore (0.6.5) (Martin) . Residual truseq sequencing adapters are then removed through another custom script . Reads are then mapped to the viral reference with BWA-MEM , and amplicon primer sequences trimmed using ivar (1.3) (Grubaugh, 2019) . Fastqc is then used to perform a QC check on the reads that map to the viral reference. After this, iVar is used to generate a consensus genome and variants are called using both ivar variants and breseq (0.35) (Deatherage, 2014) . Additionally, Freebayes may be run in addition to iVar which will generate a second set of consensus genome(s) and variant call(s) with comparisons made between iVar and FreeBayes to highlight differences in mutation calls. Coverage statistics are calculated using bedtools before a final QC via quast and a kraken2 taxonomic classification of mapped reads. Finally, data from all samples are collated via a post-processing script into an interactive summary for exploration of results and quality control. Optionally, users can run ncov-tools to generate additional quality control and summary plots and statistics.

If you use this software please cite :

Nasir, Jalees A., Robert A. Kozak, Patryk Aftanas, Amogelang R. Raphenya, Kendrick M. Smith, Finlay Maguire, Hassaan Maan et al. "A Comparison of Whole Genome Sequencing of SARS-CoV-2 Using Amplicon-Based Sequencing, Random Hexamers, and Bait Capture." Viruses 12, no. 8 (2020): 895.
https://doi.org/10.3390/v12080895

Contents:

Setup:

0. Clone the git repository ( --recursive only needed to run ncov-tools postprocessing)

 git clone --recursive https://github.com/jaleezyy/covid-19-signal

1. Install conda and snakemake (version >5) e.g.

 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
 bash Miniconda3-latest-Linux-x86_64.sh # follow instructions
 source $(conda info --base)/etc/profile.d/conda.sh
 conda create -n signal -c conda-forge -c bioconda -c defaults snakemake pandas conda mamba
 conda activate signal

There are some issues with conda failing to install newer versions of snakemake so alternatively install mamba and use that (snakemake has beta support for it within the workflow)

 conda install -c conda-forge mamba
 mamba create -c conda-forge -c bioconda -n signal snakemake pandas conda mamba
 conda activate signal
 # mamba activate signal is equivalent

Additional software dependencies are managed directly by snakemake using conda environment files:

  • trim-galore 0.6.5 ( docs )

  • kraken2 2.1.1 ( docs )

  • quast 5.0.2 ( docs )

  • bwa 0.7.17 ( docs )

  • samtools 1.7/1.9 ( docs )

  • bedtools 2.26.0 ( docs )

  • breseq 0.35.0 ( docs )

  • ivar 1.3 ( docs )

  • freebayes 1.3.2 ( docs )

  • pangolin (latest; version can be specified by user) (docs)

  • pangolin-data (latest; version can be specified by user; required for Pangolin v4+) (docs)

  • pangolearn (latest; version can be specified by user) (docs)

  • constellations (latest; version can be specified by user) (docs)

  • scorpio (latest; version can be specified by user) (docs)

  • pango-designation (latest; version can be specified by user) (docs)

  • nextclade (v1.11.0) (docs)

  • ncov-tools postprocessing scripts require additional dependencies (see file ).

SIGNAL Help Screen:

Using the provided signalexe.py script, the majority of SIGNAL functions can be accessed easily.

To display the help screen:

python signalexe.py -h
usage: signalexe.py [-h] [-c CONFIGFILE] [-d DIRECTORY] [--cores CORES] [--config-only] [--remove-freebayes] [--add-breseq] [-neg NEG_PREFIX] [--dependencies] [--data DATA] [-ri] [-ii] [--unlock]
 [-F] [-n] [--quiet] [--verbose] [-v]
 [all ...] [postprocess ...] [ncov_tools ...] [install ...]
SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL) aims to take Illumina short-read sequences and perform consensus assembly + variant calling for ongoing surveillance and research efforts towards
the emergent coronavirus: Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2).
positional arguments:
 all Run SIGNAL with all associated assembly rules. Does not include postprocessing '--configfile' or '--directory' required. The latter will automatically generate a
 configuration file and sample table. If both provided, then '--configfile' will take priority
 postprocess Run SIGNAL postprocessing on completed SIGNAL run. '--configfile' is required but will be generated if '--directory' is provided
 ncov_tools Generate configuration file and filesystem setup required and then execute ncov-tools quality control assessment. Requires 'ncov-tools' submodule! '--configfile' is required
 but will be generated if '--directory' is provided
 install Install individual rule environments and ensure SIGNAL is functional. The only parameter operable will be '--data'. Will override other operations!
optional arguments:
 -h, --help show this help message and exit
 -c CONFIGFILE, --configfile CONFIGFILE
 Configuration file (i.e., config.yaml) for SIGNAL analysis
 -d DIRECTORY, --directory DIRECTORY
 Path to directory containing reads. Will be used to generate sample table and configuration file
 --cores CORES Number of cores. Default = 1
 --config-only Generate sample table and configuration file (i.e., config.yaml) and exit. '--directory' required
 --remove-freebayes Configuration file generator parameter. Set flag to DISABLE freebayes variant calling (improves overall speed)
 --add-breseq Configuration file generator parameter. Set flag to ENABLE optional breseq step (will take more time for analysis to complete)
 -neg NEG_PREFIX, --neg-prefix NEG_PREFIX
 Configuration file generator parameter. Comma-separated list of negative control sample name(s) or prefix(es). For example, 'Blank' will cover Blank1, Blank2, etc.
 Recommended if running ncov-tools. Will be left empty, if not provided
 --dependencies Download data dependencies (under a created 'data' directory) required for SIGNAL analysis and exit. Note: Will override other parameters! (~10 GB storage required)
 --data DATA SIGNAL install and data dependencies parameter. Set location for data dependancies. If '--dependancies' is run, a folder will be created in the specified directory. If '--
 config-only' or '--directory' is used, the value will be applied to the configuration file. (Upcoming feature): When used with 'SIGNAL install', any tests run will use the
 dependencies located at this directory. Default = 'data'
 -ri, --rerun-incomplete
 Snakemake parameter. Re-run any incomplete samples from a previously failed run
 -ii, --ignore-incomplete
 Snakemake parameter. Do not check for incomplete output files
 --unlock Snakemake parameter. Remove a lock on the working directory after a failed run
 -F, --forceall Snakemake parameter. Force the re-run of all rules regardless of prior output
 -n, --dry-run Snakemake parameter. Do not execute anything and only display what would be done
 --quiet Snakemake parameter. Do not output any progress or rule information. If used with '--dry-run`, it will just display a summary of the DAG of jobs
 --verbose Snakemake parameter. Display snakemake debugging output
 -v, --version Display version number

Summary:

signalexe.py simplies the execution of all functions of SIGNAL. At its simplest, SIGNAL can be run with one line, provided only the directory of sequencing reads.

# Download dependances (only needs to be run once; ~10GB of storage required)
# --data flag allows you to rename and relocate dependencies directory
python signalexe.py --data data --dependencies
# Generate configuration file and sample table
# --neg_prefix can be used to note negative controls
# --data can be used to specify location of data dependencies
python signalexe.py --config-only --directory /path/to/reads
# Execute pipeline (step-by-step; --cores defaults to 1 if not provided)
# --data can be used to specify location of data dependencies
python signalexe.py --configfile config.yaml --cores NCORES all
python signalexe.py --configfile config.yaml --cores NCORES postprocess
python signalexe.py --configfile config.yaml --cores NCORES ncov_tools
# ALTERNATIVE
# Execute pipeline (one line)
# --data can be used to specify location of data dependencies
python signalexe.py --configfile config.yaml --cores NCORES all postprocess ncov_tools
# ALTERNATIVE
# Execute pipeline (one line; no prior configuration file or sample table steps)
# --directory can be used in place of --configfile to automatically generate a configuration file
# --data can be used to specify location of data dependencies
python signalexe.py --directory /path/to/reads --cores NCORES all postprocess ncov_tools

Each of the steps in SIGNAL can be run manually by accessing the individual scripts or running snakemake.

# Download dependances (only needs to be run once; ~10GB of storage required)
bash scripts/get_data_dependencies.sh -d data -a MN908947.3
# Generate sample table
# Modify existing 'example_config.yaml' for your configuration file
bash scripts/generate_sample_table.sh -d /path/to/reads -n sample_table.csv
# Execute pipeline (step-by-step)
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda all
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda postprocess
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda ncov_tools

Detailed setup and execution:

1. Download necessary database files:

The pipeline requires:

  • Amplicon primer scheme sequences

  • SARS-CoV2 reference fasta

  • SARS-CoV2 reference gbk

  • SARS-CoV2 reference gff3

  • kraken2 viral database

  • Human GRCh38 reference fasta (for composite human-viral BWA index)

     python signalexe.py --dependencies
     # defaults to a directory called `data` in repository root
     # --data can be used to rename and relocate the resultant directory
     OR
     bash scripts/get_data_dependencies.sh -d data -a MN908947.3
     # allows you to rename and relocate the resultant directory
    

    Note: Downloading the database files requires ~10GB of storage, with up to ~35GB required for all temporary downloads!

1.5. Prepare per-rule conda environments (optional, but recommended):

SIGNAL uses controlled conda environments for individual steps in the workflow. These are generally produced upon first execution of SIGNAL with input data; however, an option to install the per-rule environments is available through the signalexe.py script.

 python signalexe.py install
 # Will install per-rule environments
 # Later versions of SIGNAL will include a testing module with curated data to ensure function

2. Generate configuration file:

You can use the --config-only flag to generate both config.yaml and sample_table.csv . The directory provided will be used to auto-generate a name for the run.

python signalexe.py --config-only --directory /path/to/reads
# Outputs: 'reads_config.yaml' and 'reads_sample_table.csv'
# --data can be used to specify the location of data dependancies

You can also create the configuration file through modifying the example_config.yaml to suit your system.

Note: Regardless of method, double-check your configuraation file to ensure the information is correct!

3. Specify your samples in CSV format:

See the example table example_sample_table.csv for an idea of how to organise this table.

Using the --config-only flag, both configuration file and sample table will be generated (see above in step 2) from a given directory path to reads.

Alternatively, you can attempt to use generate_sample_table.sh to circumvent manual creation of the table.

bash scripts/generate_sample_table.sh
Output:
You must specify a data directory containing fastq(.gz) reads.
ASSUMES FASTQ FILES ARE NAMED AS <sample_name>_L00#_R{1,2}*.fastq(.gz)
Flags:
 -d : Path to directory containing sample fastq(.gz) files (Absolute paths preferred for consistency, but can use relative paths)
 -n : Name or file path for final sample table (with extension) (default: 'sample_table.csv') - will overwrite if file exists
 -e : Name or file path for an existing sample table - will append to the end of the provided table
Select one of '-n' (new sample table) or '-e' (existing sample table).
If neither provided, a new sample table called 'sample_table.csv' will be created (or overwritten) by default.

General usage:

# Create new sample table 'sample_table.csv' given path to reads directory
bash scripts/generate_sample_table.sh -d /path/to/reads -n sample_table.csv
# Append to existing sample table 'sample_table.csv' given path to a directory with additional reads
bash scripts/generate_sample_table.sh -d /path/to/more/reads -e sample_table.csv

4. Execute pipeline:

For the main signalexe.py script, positional arguments inform the rules of the pipeline to execute with flags supplementing input parameters.

The main rules of the pipeline are as followed:

  • all = Sequencing pipeline. i.e., take a set of paired reads, perform reference-based assembly to generate a consensus, run lineage assignment, etc.

  • postprocess = Summarize the key results including pangolin lineage, specific mutations, etc, after running all

  • ncov_tools = Create the required conda environment, generate the necessary configuration file, and link needed result files within the ncov-tools directory. ncov-tools will then be executed with output found within the SIGNAL directory.

The generated configuration file from the above steps can be used as input. To run the general pipeline:

python signalexe.py --configfile config.yaml --cores 4 all

is equivalent to running

snakemake -kp --configfile config.yaml --cores 4 --use-conda --conda-prefix=$PWD/.snakemake/conda all

You can run the snakemake command as written above, but note that if the --conda-prefix is not set as this (i.e., $PWD/.snakemake/conda ), then all envs will be reinstalled for each time you change the results_dir in the config.yaml .

Alternatively, you can skip the above configuration and sample table generation steps by simply providing the directory of reads to the main script (see step 2):

python signalexe.py --directory /path/to/reads --cores 4 all

A configuartion file and sample table will automatically be generated prior to running SIGNAL all .

FreeBayes variant calling and BreSeq mutational analysis are technically optional tools within the workflow. Using the --directory flag, by default, FreeBayes will run and BreSeq will not . These can be changed by using the --remove-freebayes and --add-breseq flags, respectively.

5. Postprocessing analyses:

As with the general pipeline, the generated configuration file from the above steps can be used as input. To run postprocess which summarizes the SIGNAL results:

python signalexe.py --configfile config.yaml --cores 1 postprocess

is equivalent to running

snakemake -kp --configfile config.yaml --cores 1 --use-conda --conda-prefix=$PWD/.snakemake/conda postprocess

After postprocessing finishes, you'll see the following summary files:

 - summary.html top-level summary, with links to per-sample summaries
 - {sample_name}/sample.html per-sample summaries, with links for more detailed info
 - {sample_name}/sample.txt per-sample summaries, in text-file format instead of HTML
 - summary.zip zip archive containing all of the above summary files.

Note that the pipeline postprocessing ( snakemake postprocess ) is separated from the rest of the pipeline ( snakemake all ). This is because in a multi-sample run, it's likely that at least one pipeline stage will fail. The postprocessing script should handle failed pipeline stages gracefully, by substituting placeholder values when expected pipeline output files are absent. However, this confuses snakemake's dependency tracking, so there seems to be no good alternative to separating piepline processing and postprocessing into all and postprocess targets.

Related: because pipeline stages can fail, we run (and recommend running if using the snakemake command to run SIGNAL) snakemake all with the -k flag ("Go on with independent jobs if a job fails").

Additionally, SIGNAL can prepare output and execute @jts' ncov-tools to generate phylogenies and alternative summaries.

python signalexe.py --configfile config.yaml --cores 1 ncov_tools

is equivalent to running

snakemake -kp --configfile config.yaml --cores 1 --use-conda --conda-prefix=$PWD/.snakemake/conda ncov_tools

SIGNAL manages installing the dependencies (within the conda_prefix ) and will generate the necessary hard links to required input files from SIGNAL for ncov-tools if it has been cloned as a sub-module (if not found, the script will attempt to pull the submodule) and a fasta containing sequences to include in the tree has been specified using phylo_include_seqs: in the main SIGNAL config.yaml . If run_freebayes is set to True , then SIGNAL will attempt to link the FreeBayes consensus FASTA and variant files, if found. Otherwise, the corresponding iVar files will be used instead.

SIGNAL will then execute ncov-tools and the output will be found within the SIGNAL results directory, specified in SIGNAL's configuration file, under ncov-tools-results . Refer to the ncov-tools documentation for information regarding specific output.

Multiple operations:

Using signalexe.py positional arguments, you can specify SIGNAL to perform multiple rules in succession.

python signalexe.py --configfile config.yaml --cores NCORES all postprocess ncov_tools

In the above command, SIGNAL all , postprocess , and ncov_tools will run using the provided configuration file as input, which links to a sample table.

Note: Regardless of order for positional arguments, or placement of other parameter flags, SIGNAL will always run in the set order priority: all > postprocess > ncov_tools !

Note: If install is provided as input, it will override all other positional arguments!

If no configuration file or sample table was generated for a run, you can provide --directory with the path to sequencing reads and SIGNAL will auto-generate both required inputs prior to running any rules.

python signalexe.py --directory /path/to/reads --cores NCORES all postprocess ncov_tools

Overall, this simplifies executing SIGNAL to one line!

Docker:

Alternatively, the pipeline can be deployed using Docker (see resources/Dockerfile_pipeline for specification). To pull from dockerhub:

 docker pull finlaymaguire/signal

Download data dependencies into a data directory that already contains your reads ( data is this example but whatever name you wish to use):

 mkdir -p data && docker run -v $PWD/data:/data finlaymaguire/signal:latest bash scripts/get_data_dependencies.sh -d /data

Generate your config.yaml and sample_table.csv (with paths to the readsets underneath /data ) and place them into the data directory:

 cp config.yaml sample_table.csv $PWD/data

WARNING result_dir in config.yaml must be within /data e.g. /data/results to automatically be copied to your host system. Otherwise they will be automatically deleted when the container finishes running (unless docker is run interactively).

Then execute the pipeline:

 docker run -v $PWD/data:/data finlaymaguire/signal conda run -n snakemake snakemake --configfile /data/config.yaml --use-conda --conda-prefix /covid-19-signal/.snakemake/conda --cores 8 all

Data Summaries:

Convenient extraction script:

SIGNAL produces several output files and directories on its own alongside the output for ncov-tools. Select files from the output can be copied or transferred for easier parsing using a provided convenience bash script:

bash scripts/get_signal_results.sh
Usage:
bash get_signal_results.sh -s <SIGNAL_results_dir> -d <destination_dir> [-m] [-c]
This scripts aims to copy (rsync by default, or cp) or move (mv) select output from SIGNAL 'all', 'postprocess', and 'ncov_tools'.
The following files will be transferred over to the specified destination directory (if found):
SIGNAL 'all' & 'postprocess':
-> signal-results/<sample>/<sample>_sample.txt
-> signal-results/<sample>/core/<sample>.consensus.fa
-> signal-results/<sample>/core/<sample>_ivar_variants.tsv
-> signal-results/<sample>/freebayes/<sample>.consensus.fasta
-> signal-results/<sample>/freebayes/<sample>.variants.norm.vcf
SIGNAL 'ncov_tools':
-> ncov_tools-results/qc_annotation/<sample>.ann.vcf
-> ncov-tools-results/qc_reports/<run_name>_ambiguous_position_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_mixture_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_ncov_watch_variants.tsv
-> ncov-tools-results/qc_reports/<run_name>_negative_control_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_summary_qc.tsv
Flags:
 -s : SIGNAL results directory
 -d : Directory where summary will be outputted
 -m : Invoke 'mv' move command instead of 'rsync' copying of results. Optional
 -c : Invoke 'cp' copy command instead of 'rsync' copying of results. Optional

The script uses rsync to provide accurate copies of select output files organized into signal-results and ncov-tools-results within a provided destination directory (that must exist). If the -c is provided, cp will be used instead of rsync to produce copies. Similarly, if -m is provided, mv will be used instead ( WARNING: Any interruption during mv could result in data loss. )

Pipeline details:

For a step-by-step walkthrough of the pipeline, see pipeline/README.md .

A diagram of the workflow is shown below (update pending).

Workflow Version 8

Code Snippets

  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
import os, sys
import shutil
import subprocess
import fileinput
import glob

def link_ivar(root, neg, failed, replace=False):
	print("Linking iVar files to ncov-tools!")

	for variants in snakemake.input['variants']:
		sample = variants.split('/')[0]
		if (sample in failed) and (sample not in neg):
			continue
		ln_path = f"{root}/{sample}.variants.tsv"
		try:
			if (not os.path.exists(ln_path)) or (replace is True):
				# remove equivalent link for Freebayes VCF (namely for replace=True)
				if os.path.exists(f"{root}/{sample}.variants.vcf"):
					os.remove(f"{root}/{sample}.variants.vcf")
				os.link(variants, ln_path)
		except FileExistsError: # equal link ~ error
			if replace:
				os.remove(ln_path)
				os.link(variants, ln_path)
			else:
				pass

	for consensus in snakemake.input['consensus']:
		sample = consensus.split('/')[0]
		if (sample in failed) and (sample not in neg):
			continue
		ln_path = f"{root}/{sample}.consensus.fasta"
		try:
			if (not os.path.exists(ln_path)) or (replace is True):
				os.link(consensus, ln_path)
		except FileExistsError: # more likely given same link path
			if replace:
				os.remove(ln_path)
				os.link(variants, ln_path)
			else:
				pass
		for line in fileinput.input(ln_path, inplace=True):
			if line.startswith(">"):
				new_header = str(">"+sample)
				new_line = line.replace(line, new_header)
				print(new_line, end='\n')
			else:
				print(line, end='\n')

# take sample name from iVar results, redirect to where corresponding FreeBayes should be
# if FreeBayes file cannot be found, break from loop, replace all with iVar
def link_freebayes(root, neg, failed):
	print("Linking FreeBayes files to ncov-tools!")

	for variants in snakemake.input['variants']:
		sample = variants.split('/')[0]
		if (sample in failed) and (sample not in neg):
			continue
		expected_path = os.path.join(sample, 'freebayes', sample+'.variants.norm.vcf')
		if not os.path.exists(expected_path):
			print("Missing FreeBayes variant file! Switching to iVar input!")
			link_ivar(root, neg, failed, replace=True)
			vcf_method = False
			break
		else:
			vcf_method = True
			ln_path = f"{root}/{sample}.variants.vcf"
			# redundant check however, it may play a roll in re-runs
			if os.path.exists(f"{root}/{sample}.variants.tsv"):
				os.remove(f"{root}/{sample}.variants.tsv")
			if not os.path.exists(ln_path):
				os.link(expected_path, ln_path)

	if vcf_method:
		for consensus in snakemake.input['consensus']:
			sample = consensus.split('/')[0]
			if (sample in failed) and (sample not in neg):
				continue
			expected_path = os.path.join(sample, 'freebayes', sample+'.consensus.fasta')
			if not os.path.exists(expected_path):
				print("Missing FreeBayes variant file! Switching to iVar input!")
				link_ivar(root, neg, failed, replace=True)
				break
			else:
				ln_path = f"{root}/{sample}.consensus.fasta"
				if not os.path.exists(ln_path):
					os.link(expected_path, ln_path)
				for line in fileinput.input(ln_path, inplace=True):
					if line.startswith(">"):
						new_header = str(">"+sample)
						new_line = line.replace(line, new_header)
						print(new_line, end='\n')
					else:
						print(line, end='\n')

	return vcf_method

def set_up():
	print("Writing config.yaml for ncov-tools to ncov-tools/config.yaml")
	#vcf_variant = True # set default

	exec_dir = snakemake.params['exec_dir']
	result_dir = os.path.basename(snakemake.params['result_dir']) # basename of SIGNAL result directory

	### Pull pangolin version number (i.e., version '3' or '4')
	pangolin = str(snakemake.params['pangolin']).split(".")[0].lower().strip("v")
	try:
		assert pangolin == "3" or pangolin == "4" # directly supported versions
	except AssertionError:
		# import urllib.request as web
		# commit_url = web.urlopen(f"https://github.com/cov-lineages/pangolin/releases/latest").geturl()
		# pangolin = commit_url.split("/")[-1].split(".")[0].lower().strip("v") 
		# latest version (should ensure temporary compatibility)
		installed_versions = subprocess.run(["pangolin", "--all-versions"],
								check=True,
								stdout=subprocess.PIPE)
		installed_versions = installed_versions.stdout.decode('utf-8')
		installed_ver_dict = {}
		for dep_ver in map(str.strip, installed_versions.split('\n')):
		# skip empty line at end
			if len(dep_ver) == 0:
				continue
			try:
				dependency, version = dep_ver.split(': ')
			except ValueError:
				continue
			if dependency == 'pangolin':
				pangolin = str(version).split(".",1)[0].strip('v')
				break

### Create data directory within ncov-tools
	data_root = os.path.abspath(os.path.join(exec_dir, 'ncov-tools', "%s" %(result_dir)))
	if os.path.exists(data_root):
		shutil.rmtree(data_root)
	os.mkdir(data_root)

### Pull negative samples (based on common identifiers)
	#neg_names = ("Negative", "NEG", "PCR-NEG", "UTM", "BLANK", "Blank", "blank")
	neg_names = tuple(snakemake.params['negative_control_prefix'])
	neg_samples = set()
	with open(snakemake.params['sample_csv_filename']) as fh:
		fh.readline()
		for line in fh:
			id = line.split(",")[0]
			if id.startswith(neg_names):
				neg_samples.add(id)
	neg_list = list(neg_samples)
	print("Negative control samples found include: %s" %(neg_list))

### Pull failed samples (SIGNAL log file: failed_samples.log)
	if os.path.exists(snakemake.params['failed']):
		with open(snakemake.params['failed']) as fail:
			failed_list = [i.strip() for i in fail.readlines()[1:]]
	else:
		failed_list = []
	print("Failed samples found include: %s" %(failed_list))

### config.yaml parameters
	config = {'data_root': f"'{data_root}'",
			  'run_name': f"'{result_dir}'", # name ncov-tools output files with name of SIGNAL results directory (default: "default")
			  'amplicon_bed': f"'{snakemake.params['amplicon_bed']}'", #grab from signal snakemake config
			  'reference_genome': f"'{snakemake.params['viral_reference_genome']}'", #grab from signal snakemake config
			  'platform': 'illumina',
			  'primer_bed': f"'{snakemake.params['primer_bed']}'",
			  'bed_type': "unique_amplicons",
			  'offset': 0,
			  'completeness_threshold': 0.9,
			  'bam_pattern': "'{data_root}/{sample}.bam'", # symlink files following this
			  'primer_trimmed_bam_pattern': "'{data_root}/{sample}.mapped.primertrimmed.sorted.bam'",
			  'consensus_pattern': "'{data_root}/{sample}.consensus.fasta'", # symlink files following this
			  'variants_pattern': "'{data_root}/{sample}.variants.tsv'",
			  'assign_lineages': 'true',
			  'tree_include_consensus': f"'{snakemake.params['phylo_include_seqs']}'",
			  'negative_control_samples': f"{neg_list}",
			  'mutation_set': 'spike_mutations',
			  'output_directory': f"{result_dir}_ncovresults",
			  'pangolin_version': f"'{pangolin}'",
			  'pango_analysis_mode': f"'{snakemake.params['mode']}'"
			  }

	print("Linking alignment BAMs to ncov-tools!")
	for bam in snakemake.input['bams']:
		sample = bam.split('/')[0]
		# if sample failed and not a negative, skip linking
		if (sample in failed_list) and (sample not in neg_list):
			continue
		ln_path = f"{data_root}/{sample}.bam"
		if not os.path.exists(ln_path):
			os.link(bam, ln_path)

	for primer_trimmed_bam in snakemake.input['primertrimmed_bams']:
		sample = primer_trimmed_bam.split('/')[0]
		if (sample in failed_list) and (sample not in neg_list):
			continue
		ln_path = f"{data_root}/{sample}.mapped.primertrimmed.sorted.bam"
		if not os.path.exists(ln_path):
			os.link(primer_trimmed_bam, ln_path)

	if snakemake.params['freebayes_run']:
		vcf_variant = link_freebayes(data_root, neg_list, failed_list)
		if vcf_variant:
			config['variants_pattern'] = "'{data_root}/{sample}.variants.vcf'"
		else:
			pass # keep as TSV
	else:
		link_ivar(data_root, neg_list, failed_list, replace=False)

	with open(os.path.join(exec_dir, 'ncov-tools', 'config.yaml'), 'w') as fh:
		for key, value in config.items():
			fh.write(f"{key}: {value}\n")

	return exec_dir, result_dir, data_root

def run_all():
	os.system(f"snakemake -s workflow/Snakefile --cores {snakemake.threads} all")

def move(cwd, dest, prefix):
	if os.path.exists(os.path.join(cwd, "ncov-tools", "plots")):
		extensions = ["_amplicon_coverage_heatmap.pdf", "_amplicon_covered_fraction.pdf", "_depth_by_position.pdf", "_tree_snps.pdf"]
		for ext in extensions:
			for file in glob.glob(os.path.join(cwd, "ncov-tools", "plots", "default"+ext)):
				try:
					shutil.copy(file, os.path.join(dest, "plots", prefix+ext))
				except IOError:
					print("Missing file %s" %(prefix+ext))
	else:
		print("Missing ncov-tools 'plots' directory")

	if os.path.exists(os.path.join(cwd, "ncov-tools", "lineages")):
		for file in glob.glob(os.path.join(cwd, "ncov-tools", "lineages", "default_lineage_report.csv")):
			try:
				shutil.copy(file, os.path.join(dest, "lineages", prefix+"_lineage_report.csv"))
			except IOError:
				print("Missing file %s_lineage_report.csv" %(prefix))
	else:
		print("Missing ncov-tools 'lineages' directory")

	if os.path.exists(os.path.join(cwd, "ncov-tools", "qc_analysis")):
		extensions = ["_aligned-delim.fasta.log", "_aligned-delim.iqtree.log", "_aligned.fasta", "_aligned.fasta.insertions.csv", \
					"_aligned.fasta.log", "_alleles.tsv", "_tree.nwk", "_tree_raw.nwk", "_consensus.fasta"]
		for ext in extensions:
			for file in glob.glob(os.path.join(cwd, "ncov-tools", "qc_analysis", "default"+ext)):
				try:
					shutil.copy(file, os.path.join(dest, "qc_analysis", prefix+ext))
				except IOError:
					print("Missing file %s" %(prefix+ext))
	else:
		print("Missing ncov-tools 'qc_analysis' directory")

if __name__ == '__main__':
	exec_dir, result_dir, data_root = set_up()
	run_script = os.path.join(exec_dir, 'scripts', 'run_ncov_tools.sh')
	#print("Don't forget to update the config.yaml file as needed prior to running ncov-tools.")
	print("Running ncov-tools using %s cores!" %(snakemake.threads))

	subprocess.run([run_script, '-c', str(snakemake.threads), '-s', str(result_dir)])

	# clean up
	shutil.rmtree(data_root)

	#run_all()
	#move(exec_dir, result_root, result_dir)
226
227
shell:
    '{params.postprocess_script_path} {params.sample_csv_filename}'
SnakeMake From line 226 of master/Snakefile
252
script: "scripts/ncov-tools.py"
SnakeMake From line 252 of master/Snakefile
264
265
266
267
268
shell:
    """
    cp {input.origin_config} {output.config}
    cp {input.origin_sample_table} {output.sample_table}
    """
SnakeMake From line 264 of master/Snakefile
278
279
shell:
    'ln -s {input} {output}'
SnakeMake From line 278 of master/Snakefile
289
290
shell:
    'if [ $(echo {input} | wc -w) -gt 1 ]; then zcat -f {input} | paste - - - - | sort -k1,1 -t " " | tr "\\t" "\\n" | gzip > {output}; else ln -s {input} {output}; fi'
SnakeMake From line 289 of master/Snakefile
307
308
309
310
shell:
    """
    fastqc -o {params.output_prefix} {input} 2> {log}
    """
331
332
333
334
shell:
    '(bwa mem -t {threads} {params.composite_index} '
    '{input.raw_r1} {input.raw_r2} | '
    "{params.script_path} -c {params.viral_contig_name} > {output}) 2> {log} || echo '' > {output}"
SnakeMake From line 331 of master/Snakefile
350
351
352
353
354
shell:
    """
    samtools view -b {input} | samtools sort -n -@{threads} > {output.bam} 2> {log}
    samtools fastq -1 {output.r1} -2 {output.r2} -s {output.s} {output.bam} 2>> {log} 
    """
379
380
381
382
shell:
    'trim_galore --quality {params.min_qual} --length {params.min_len} '
    ' -o {params.output_prefix} --cores {threads} --fastqc '
    "--paired {input.raw_r1} {input.raw_r2} 2> {log} || (echo -e 'Total reads processed:  0\nReads written (passing filters):  0 (0.0%)\nTotal basepairs processed:  0 bp\nTotal written (filtered):  0 bp (0.0%)' >> {log}; touch {output})"
397
398
399
400
shell:
    """
    python {params.script_path} --input_R1 {input.r1} --input_R2 {input.r2}
    """
SnakeMake From line 397 of master/Snakefile
415
416
shell:
    'bwa index -p {params.output_prefix} {input} >{log} 2>&1'
435
436
437
438
shell:
    '(bwa mem -t {threads} {params.ref_prefix} '
    '{input.r1} {input.r2} | '
    'samtools view -bS | samtools sort -@{threads} -o {output}) 2> {log}'
460
461
462
463
464
465
466
467
468
shell:
    'samtools view -F4 -o {output.mapped_bam} {input}; '
    'samtools index {output.mapped_bam}; '
    'ivar trim -e -i {output.mapped_bam} -b {params.scheme_bed} '
    '-m {params.min_len} -q {params.min_qual} '
    '{params.primer_pairs} '
    '-p {params.ivar_output_prefix} 2> {log}; '
    'samtools sort -o {output.sorted_trimmed_mapped_bam} '
    '{output.trimmed_mapped_bam}'
485
486
487
488
shell:
    """
    fastqc -o {params.output_prefix} {input} 2> {log}
    """
504
505
506
507
508
shell:
    """
    samtools sort -n {input} -o {output.bam} 2> {log}
    samtools fastq -1 {output.r1} -2 {output.r2} -s {output.s} {output.bam} 2>> {log} 
    """
526
527
528
529
530
shell:
    '(samtools mpileup -aa -A -d {params.mpileup_depth} -Q0 {input} | '
    'ivar consensus -t {params.ivar_freq_threshold} '
    '-m {params.ivar_min_coverage_depth} -n N -p {params.output_prefix}) '
    '2>{log}'
543
544
shell:
    'samtools faidx {input}'
566
567
568
569
570
571
572
shell:
    '(samtools mpileup -aa -A -d 0 --reference {input.reference} -B '
        '-Q 0 {input.read_bam} | '
    'ivar variants -r {input.reference} -m {params.ivar_min_coverage_depth} '
    '-p {params.output_prefix} -q {params.ivar_min_variant_quality} '
    '-t {params.ivar_min_freq_threshold} -g {input.viral_reference_gff}) 2> {log} || '
    " (echo -e 'REGION\tPOS\tREF\tALT\tREF_DP\tREF_RV\tREF_QUAL\tALT_DP\tALT_RV\tALT_QUAL\tALT_FREQ\tTOTAL_DP\tPVAL\tPASS\tGFF_FEATURE\tREF_CODON\tREF_AA\tALT_CODON\tALT_AA' > {output}) "
592
593
594
595
shell:
    """
    breseq --reference {params.ref} --num-processors {threads} --polymorphism-prediction --brief-html-output --output {params.outdir} {input} > {log} 2>&1 || touch {output}
    """
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
shell:
    """
    mkdir -p $(dirname {params.out})
    # the sed is to fix the header until a release is made with this fix
    # https://github.com/freebayes/freebayes/pull/549
    freebayes -p 1 \
              -f {input.reference} \
              -F 0.2 \
              -C 1 \
              --pooled-continuous \
              --min-coverage {params.freebayes_min_coverage_depth} \
              --gvcf --gvcf-dont-use-chunk true {input.read_bam} | sed s/QR,Number=1,Type=Integer/QR,Number=1,Type=Float/ > {params.out}.gvcf

    # make depth mask, split variants into ambiguous/consensus
    # NB: this has to happen before bcftools norm or else the depth mask misses any bases exposed during normalization
    python {params.script_path} -d {params.freebayes_min_coverage_depth} \
                    -l {params.freebayes_min_freq_threshold} \
                    -u {params.freebayes_freq_threshold} \
                    -m {params.out}.mask.txt \
                    -v {params.out}.variants.vcf \
                    -c {params.out}.consensus.vcf {params.out}.gvcf 

    # normalize variant records into canonical VCF representation
    bcftools norm -f {input.reference} {params.out}.variants.vcf > {output.variants} || (echo -e '##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tunknown' > {output.variants})
    bcftools norm -f {input.reference} {params.out}.consensus.vcf > {params.out}.consensus.norm.vcf

    # split the consensus sites file into a set that should be IUPAC codes and all other bases, using the ConsensusTag in the VCF
    for vt in "ambiguous" "fixed"; do
        cat {params.out}.consensus.norm.vcf | awk -v vartag=ConsensusTag=$vt '$0 ~ /^#/ || $0 ~ vartag' > {params.out}.$vt.norm.vcf
        bgzip -f {params.out}.$vt.norm.vcf  
        tabix -f -p vcf {params.out}.$vt.norm.vcf.gz 
    done

    # apply ambiguous variants first using IUPAC codes. this variant set cannot contain indels or the subsequent step will break
    bcftools consensus -f {input.reference} -I {params.out}.ambiguous.norm.vcf.gz > {params.out}.ambiguous.fasta 
    # apply remaninng variants, including indels
    bcftools consensus -f {params.out}.ambiguous.fasta -m {params.out}.mask.txt {params.out}.fixed.norm.vcf.gz | sed s/MN908947\.3.*/{wildcards.sn}/ > {output.consensus}
    """
666
667
668
669
shell:
    """
    python {params.script_path} -g {input.ivar} -r {input.freebayes} -o vcf > {output}
    """
SnakeMake From line 666 of master/Snakefile
681
682
shell:
    'bedtools genomecov -d -ibam {input} > {output}'
692
693
shell:
    "python {params.script_path} {input} {output}"
SnakeMake From line 692 of master/Snakefile
716
717
718
719
720
721
722
723
724
725
726
727
shell:
    'cd {params.outdir} '
    '&& kraken2'
        ' --db {params.db}'
        ' --threads {threads}'
        ' --quick --unclassified-out "{params.labelled_unclassified_reads}"'
        ' --classified-out "{params.labelled_classified_reads}"'
        ' --output {params.labelled_output}'
        ' --paired --gzip-compressed'
        ' ../../{input.r1} ../../{input.r2}'
        ' --report {params.labelled_report}'
        ' 2>../../{log} && (cd ../.. && touch {output})'
752
753
754
shell:
     'quast {input} -r {params.genome} -g {params.fcoords} --output-dir {params.outdir} --threads {threads} >{log} && '
     'for f in {params.unlabelled_reports}; do mv $f ${{f/report/{params.sample_name}}}; done'
773
774
775
shell:
     'quast {input} -r {params.genome} -g {params.fcoords} --output-dir {params.outdir} --threads {threads} >{log} && '
     'for f in {params.unlabelled_reports}; do mv $f ${{f/report/{params.sample_name}}}; done'
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
shell:
    """
    cat {input} > {output}
    sample=''
    count=''
    echo "Samples that failed to assemble:" > failed_samples.log
    while read -r line;
    do
        if [[ $line =~ '>' ]]; then
            sample=$(echo $line | cut -d'.' -f1 | cut -d'_' -f2)
        else
            count=$(echo $line | wc -c)
            if [[ $count -eq 1 ]]; then
                echo $sample >> failed_samples.log
            else
                continue
            fi
        fi
    done < {output}
    """
SnakeMake From line 782 of master/Snakefile
825
826
827
828
shell:
    "echo -e 'pangolin: {params.pangolin_ver}\nconstellations: {params.constellations_ver}\nscorpio: {params.scorpio_ver}\npangolearn: {params.pangolearn_ver}\npango-designation: {params.designation_ver}\npangolin-data: {params.data_ver}' > {output.pango_ver_out} && "
    "echo -e 'nextclade: {params.nextclade_ver}\nnextclade-dataset: {params.nextclade_data}\nnextclade-include-recomb: {params.nextclade_recomb}' > {output.nextclade_ver_out} && "
    '{params.assignment_script_path} -i {input} -t {threads} -o {output.lin_out} -p {output.pango_ver_out} -n {output.nextclade_ver_out} --mode {params.analysis_mode}'
SnakeMake From line 825 of master/Snakefile
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
shell:
    """
    cat {input} > {output}
    sample=''
    seq=''
    count=''
    out=''
    if [[ -f 'failed_samples.log' ]]; then
        out='.failed_freebayes_samples.tmp'
        cat failed_samples.log | sed 1,1d > $out
        echo "Samples that failed to assemble:" > failed_samples.log
    else
        out='failed_samples.log'
        echo "Samples that failed to assemble:" > $out
    fi
    while read -r line;
    do
        if [[ $line =~ '>' ]]; then
            if [[ $(echo $seq | wc -c) -eq 1 ]]; then # check if new seq
                count=$(echo $seq | grep -vc 'N')
                if [[ $count -eq 0 ]]; then
                    echo $sample >> $out
                fi
                sample=$(echo $line | cut -d'>' -f2) # start new seq
                seq=''
            else
                sample=$(echo $line | cut -d'>' -f2) # first seq
            fi
        else
            seq+=$line # append seq
        fi
    done < {output}

    if [[ ! $out == 'failed_samples.log' ]]; then
        sort -b -d -f $out | uniq >> failed_samples.log
        rm $out
    fi
    """
SnakeMake From line 835 of master/Snakefile
886
887
shell:
    '{params.assignment_script_path} -i {input.consensus} -t {threads} -o {output} -p {input.p_vers} -n {input.n_vers} --mode {params.analysis_mode} --skip'
SnakeMake From line 886 of master/Snakefile
ShowHide 20 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/jaleezyy/covid-19-signal
Name: covid-19-signal
Version: v1.6.2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...