V-pipe is a pipeline designed for analysing NGS data of short viral genomes
V-pipe is a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. It produces a number of results in a curated format (e.g., consensus sequences, SNV calls, local/global haplotypes). V-pipe is written using the Snakemake workflow management system.
Usage
Different ways of initializing V-pipe are presented below. We strongly encourage you to deploy it using the quick install script , as this is our preferred method.
To configure V-pipe refer to the documentation present in config/README.md .
V-pipe expects the input samples to be organized in a
two-level
directory hierarchy,
and the sequencing reads must be provided in a sub-folder named
raw_data
. Further details can be found on the
website
.
Check the utils subdirectory for
mass-importers tools
that can assist you in generating this hierarchy.
We provide virus-specific base configuration files which contain handy defaults for, e.g., HIV and SARS-CoV-2. Set the virus in the general section of the configuration file:
general:
virus_base_config: hiv
Also see snakemake's documentation to learn more about the command-line options available when executing the workflow.
Tutorials introducing usage of V-pipe are available in the docs/ subdirectory.
Using quick install script
To deploy V-pipe, use the installation script with the following parameters:
curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
./quick_install.sh -w work
This script will download and install miniconda, checkout the V-pipe git repository (use
-b
to specify which branch/tag) and setup a work directory (specified with
-w
) with an executable script that will execute the workflow:
cd work
# edit config.yaml and provide samples/ directory
./vpipe --jobs 4 --printshellcmds --dry-run
Using Docker
Note: the docker image is only setup with components to run the workflow for HIV and SARS-CoV-2 virus base configurations. Using V-pipe with other viruses or configurations might require internet connectivity for additional software components.
Create
config.yaml
or
vpipe.config
and then populate the
samples/
directory.
For example, the following config file could be used:
general:
virus_base_config: hiv
output:
snv: true
local: true
global: false
visualization: true
QA: true
Then execute:
docker run --rm -it -v $PWD:/work ghcr.io/cbg-ethz/v-pipe:master --jobs 4 --printshellcmds --dry-run
Using Snakedeploy
First install mamba , then create and activate an environment with Snakemake and Snakedeploy:
mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy
conda activate snakemake
Snakemake's official workflow installer Snakedeploy can now be used:
snakedeploy deploy-workflow https://github.com/cbg-ethz/V-pipe --tag master .
# edit config/config.yaml and provide samples/ directory
snakemake --use-conda --jobs 4 --printshellcmds --dry-run
Dependencies
-
Conda is a cross-platform package management system and an environment manager application. Snakemake uses mamba as a package manager.
-
Snakemake is the central workflow and dependency manager of V-pipe. It determines the order in which individual tools are invoked and checks that programs do not exit unexpectedly.
-
VICUNA is a de novo assembly software designed for populations with high mutation rates. It is used to build an initial reference for mapping reads with ngshmmalign aligner when a
references/cohort_consensus.fasta
file is not provided. Further details can be found in the wiki pages.
Computational tools
Other dependencies are managed by using isolated conda environments per rule, and below we list some of the computational tools integrated in V-pipe:
-
FastQC gives an overview of the raw sequencing data. Flowcells that have been overloaded or otherwise fail during sequencing can easily be determined with FastQC.
-
Trimming and clipping of reads is performed by PRINSEQ. It is currently the most versatile raw read processor with many customization options.
-
We perform the alignment of the curated NGS data using our custom ngshmmalign that takes structural variants into account. It produces multiple consensus sequences that include either majority bases or ambiguous bases.
-
In order to detect specific cross-contaminations with other probes, the Burrows-Wheeler aligner is used. It quickly yields estimates for foreign genomic material in an experiment. Additionally, It can be used as an alternative aligner to ngshmmalign.
-
To standardise multiple samples to the same reference genome (say HXB2 for HIV-1), the multiple sequence aligner MAFFT is employed. The multiple sequence alignment helps in determining regions of low conservation and thus makes standardisation of alignments more robust.
-
The Swiss Army knife of alignment postprocessing and diagnostics. bcftools is also used to generate consensus sequence with indels.
-
We perform genomic liftovers to standardised reference genomes using our in-house developed python library of utilities for rewriting alignments.
-
ShoRAh performs SNV calling and local haplotype reconstruction by using bayesian clustering.
-
LoFreq (version 2) is SNVs and indels caller from next-generation sequencing data, and can be used as an alternative engine for SNV calling.
-
SAVAGE and Haploclique
We use HaploClique or SAVAGE to perform global haplotype reconstruction for heterogeneous viral populations by using an overlap graph.
Citation
If you use this software in your research, please cite:
Posada-Céspedes S., Seifert D., Topolsky I., Jablonski K.P., Metzner K.J., and Beerenwinkel N. 2021. "V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data." Bioinformatics , January. doi: 10.1093/bioinformatics/btab015 .
Contributions
-
Tobias Marschall
* software maintainer ; ** group leader
Contact
We encourage users to use the issue tracker . For further enquiries, you can also contact the V-pipe Dev Team [email protected] .
Code Snippets
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" source {params.FUNCTIONS} ERRFILE=$(basename {log.errfile}) OUTFILE=$(basename {log.outfile}) # 1. copy initial reference for bwa rm -rf {params.WORK_DIR}/ mkdir -p {params.WORK_DIR}/ cp {input.global_ref} {params.WORK_DIR}/consensus.fasta cd {params.WORK_DIR} # 2. create bwa index {params.BWA} index consensus.fasta 2> >(tee $ERRFILE >&2) # 3. create initial alignment if [[ {params.PAIRED_BOOL} == "true" ]]; then {params.BWA} mem -t {threads} consensus.fasta ../preprocessed_data/R{{1,2}}.fastq > first_aln.sam 2> >(tee -a $ERRFILE >&2) else {params.BWA} mem -t {threads} consensus.fasta ../preprocessed_data/R1.fastq > first_aln.sam 2> >(tee -a $ERRFILE >&2) fi rm consensus.fasta.* # 4. remove unmapped reads {params.SAMTOOLS} view -b -F 4 first_aln.sam > mapped.bam 2> >(tee -a $ERRFILE >&2) rm first_aln.sam # 5. extract reads mkdir -p cleaned SamToFastq {params.PICARD} I=mapped.bam FASTQ=cleaned/R1.fastq {params.PAIRED} VALIDATION_STRINGENCY=SILENT 2> >(tee -a $ERRFILE >&2) rm mapped.bam # 6. create config file # NOTE: Tabs are required below if [[ {params.PAIRED_BOOL} == "true" ]]; then cat > vicuna_config.txt <<- _EOF_ minMSize 9 maxOverhangSize 2 Divergence 8 max_read_overhang 2 max_contig_overhang 10 pFqDir cleaned/ batchSize 100000 LibSizeLowerBound 100 LibSizeUpperBound 800 min_output_contig_len 1000 outputDIR ./ _EOF_ else cat > vicuna_config.txt <<- _EOF_ minMSize 9 maxOverhangSize 2 Divergence 8 max_read_overhang 2 max_contig_overhang 10 npFqDir cleaned/ batchSize 100000 min_output_contig_len 1000 outputDIR ./ _EOF_ fi # 7. VICUNA OMP_NUM_THREADS={threads} {params.VICUNA} vicuna_config.txt > $OUTFILE 2> >(tee -a $ERRFILE >&2) rm vicuna_config.txt rm -r cleaned/ # 8. fix broken header sed -e 's:>dg-\([[:digit:]]\+\)\s.*:>dg-\1:g' contig.fasta > contig_clean.fasta # 9. InDelFixer + ConsensusFixer to polish up consensus for i in {{1..3}} do mv consensus.fasta old_consensus.fasta indelFixer {params.INDELFIXER} -i contig_clean.fasta -g old_consensus.fasta >> $OUTFILE 2> >(tee -a $ERRFILE >&2) sam2bam {params.SAMTOOLS} reads.sam >> $OUTFILE 2> >(tee $ERRFILE >&2) consensusFixer {params.CONSENSUSFIXER} -i reads.bam -r old_consensus.fasta -mcc 1 -mic 1 -d -pluralityN 0.01 >> $OUTFILE 2> >(tee $ERRFILE >&2) done sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" consensus.fasta echo "" >> consensus.fasta # 10. finally, move into place mkdir -p ../references mv {{,../references/vicuna_}}consensus.fasta """ |
158 159 160 161 162 163 164 165 | shell: """ cat {input} > initial_ALL.fasta {params.MAFFT} --nuc --preservecase --maxiterate 1000 --localpair --thread {threads} initial_ALL.fasta > references/initial_aln.fasta 2> >(tee {log.errfile} >&2) rm initial_ALL.fasta {params.REMOVE_GAPS} references/initial_aln.fasta -o {output} -p 0.5 > {log.outfile} 2> >(tee -a {log.errfile} >&2) """ |
181 182 183 184 185 186 187 188 189 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" mkdir -p {wildcards.dataset}/references/ {params.EXTRACT_SEQ} {input} -o {output} -s "${{CONSENSUS_NAME}}" """ |
201 202 203 204 205 206 207 208 209 210 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" mkdir -p {wildcards.dataset}/references/ cp {input} {output} sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" {output} """ |
222 223 224 225 226 227 228 229 230 231 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" mkdir -p {wildcards.dataset}/references/ cp {input} {output} sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" {output} """ |
281 282 283 284 | shell: """ {params.GUNZIP} -c {input} > {output} 2> >(tee {log.errfile} >&2) """ |
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" # 1. clean previous run rm -rf {wildcards.dataset}/alignments rm -f {wildcards.dataset}/references/ref_ambig.fasta rm -f {wildcards.dataset}/references/ref_majority.fasta mkdir -p {wildcards.dataset}/alignments mkdir -p {wildcards.dataset}/references # 2. perform alignment # -l = leave temps {params.NGSHMMALIGN} -v {params.EXTRA} -R {input.initial_ref} -o {output.good_aln} -w {output.reject_aln} -t {threads} -N "${{CONSENSUS_NAME}}" {params.LEAVE_TEMP} {input.FASTQ} > {log.outfile} 2> >(tee {log.errfile} >&2) # 3. move references into place mv {wildcards.dataset}/{{alignments,references}}/ref_ambig.fasta mv {wildcards.dataset}/{{alignments,references}}/ref_majority.fasta """ |
365 366 367 368 369 370 | shell: """ cat {input} > ALL_{wildcards.kind}.fasta {params.MAFFT} --nuc --preservecase --maxiterate 1000 --localpair --thread {threads} ALL_{wildcards.kind}.fasta > {output} 2> >(tee {log.errfile} >&2) rm ALL_{wildcards.kind}.fasta """ |
407 408 409 410 | shell: """ {params.CONVERT_REFERENCE} -t {params.REF_NAME} -m {input.REF_ambig} -i {input.BAM} -o {output} > {log.outfile} 2> >(tee {log.errfile} >&2) """ |
441 442 443 444 445 446 447 | shell: """ echo "Writing BAM file" rm -f '{params.sort_tmp}'.[0-9]*.bam {params.SAMTOOLS} sort -T "{params.sort_tmp}" -o "{output.BAM}" "{input}" > {log.outfile} 2> >(tee {log.errfile} >&2) {params.SAMTOOLS} index "{output.BAM}" >> {log.outfile} 2> >(tee -a {log.errfile} >&2) """ |
475 476 477 478 | shell: """ {params.BWA} index {input} 2> >(tee {log.errfile} >&2) """ |
521 522 523 524 525 526 | shell: """ {params.BWA} mem -t {threads} {params.EXTRA} -o "{output.TMP_SAM}" "{input.REF}" {input.FASTQ} 2> >(tee {log.errfile} >&2) # Filter alignments: (1) remove unmapped reads (single-end) or keep only reads mapped in proper pairs (paired-end), (2) remove supplementary aligments {params.SAMTOOLS} view -h {params.FILTER} -F 2048 -o "{output.REF}" "{output.TMP_SAM}" 2> >(tee -a {log.errfile} >&2) """ |
551 552 553 554 | shell: """ {params.BOWTIE} {input} {input} 2> >(tee {log.errfile} >&2) """ |
588 589 590 591 592 593 | shell: """ {params.BOWTIE} -x {input.REF} -1 {input.R1} -2 {input.R2} {params.PHRED} {params.PRESET} -X {params.MAXINS} {params.EXTRA} -p {threads} -S {output.TMP_SAM} 2> >(tee {log.errfile} >&2) # Filter alignments: (1) keep only reads mapped in proper pairs, and (2) remove supplementary aligments {params.SAMTOOLS} view -h -f 2 -F 2048 -o "{output.REF}" "{output.TMP_SAM}" 2> >(tee -a {log.errfile} >&2) """ |
625 626 627 628 629 630 | shell: """ {params.BOWTIE} -x {input.REF} -U {input.R1} {params.PHRED} {params.PRESET} {params.EXTRA} -p {threads} -S {output.TMP_SAM} 2> >(tee {log.errfile} >&2) # Filter alignments: (1) remove unmapped reads, and (2) remove supplementary aligments {params.SAMTOOLS} view -h -F 4 -F 2048 -o "{output.REF}" "{output.TMP_SAM} 2> >(tee -a {log.errfile} >&2) """ |
11 12 13 14 | shell: """ rm -rf {params.DIR}/*/*/extracted_data """ |
20 21 22 23 | shell: """ rm -rf {params.DIR}/*/*/preprocessed_data """ |
29 30 31 32 33 34 35 36 37 | shell: """ rm -rf {params.DIR}/*/*/initial_consensus rm -rf {params.DIR}/*/*/references/vicuna_consensus.fasta rm -rf {params.DIR}/*/*/references/initial_consensus.fasta rm -rf references/initial_aln.fasta rm -rf references/initial_aln_gap_removed.fasta rm -rf references/MAFFT_initial_aln.* """ |
41 42 43 44 45 | shell: """ rm -rf references/ALL_aln_*.fasta rm -rf references/MAFFT_*_cohort.* """ |
51 52 53 54 55 56 57 58 | shell: """ rm -rf {params.DIR}/*/*/alignments rm -rf {params.DIR}/*/*/QA_alignments rm -rf {params.DIR}/*/*/references/ref_ambig.fasta rm -rf {params.DIR}/*/*/references/ref_majority.fasta rm -rf {params.DIR}/*/*/references/initial_consensus.fasta """ |
66 67 68 69 70 71 72 | shell: """ rm -f {input} rm -rf {params.DIR}/*/*/alignments rm -rf {params.DIR}/*/*/references/ref_ambig*.fasta rm -rf {params.DIR}/*/*/references/ref_majority*.fasta """ |
80 81 82 83 84 85 86 | shell: """ rm -f {input} rm -rf {params.DIR}/*/*/alignments rm -rf {params.DIR}/*/*/references/ref_ambig*.fasta rm -rf {params.DIR}/*/*/references/ref_majority*.fasta """ |
92 93 94 95 | shell: """ rm -rf {params.DIR}/*/*/variants/SNVs """ |
101 102 103 104 105 | shell: """ rm -rf {params.DIR}/*/*/variants/global/contigs_stage_?.fasta rm -rf {params.DIR}/*/*/variants/global/stage_? """ |
111 112 113 114 | shell: """ rm {params.DIR}/*/*/variants/global/quasispecies.* """ |
120 121 122 123 | shell: """ rm -rf {params.DIR}/*/*/variants/global/predicthaplo/ """ |
129 130 131 132 | shell: """ rm -rf {params.DIR}/*/*/visualization """ |
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | shell: """ {params.bcftools} mpileup \ --threads {threads} \ -Ou \ -f {input.fname_ref} \ --max-depth {params.max_coverage} \ --max-idepth {params.max_coverage} \ --annotate FORMAT/AD,FORMAT/DP,INFO/AD \ {input.fname_bam} \ | {params.bcftools} call \ --threads {threads} \ -Ou \ -mv \ --keep-alts \ | {params.bcftools} norm \ --threads {threads} \ -Ou \ -f {input.fname_ref} \ | {params.bcftools} filter \ --threads {threads} \ -e 'TYPE="INDEL" & INFO/AD[1]<INFO/AD[0]' \ -Ob \ --output {output.fname_temp_bcf} \ 2> >(tee {log.errfile} >&2) #bcftools csq -f {input.fname_ref} -g wheretogetthis.gff3.gz in.vcf -Ob -o out.bcf {params.gunzip} -c {input.fname_cov} | tail -n +2 \ | awk -v base={params.tsvbased} \ '$3 < {params.mask_coverage_threshold} {{printf "%s\\t%d\\t%d\\n", $1, $2 - base, $2 - base + 1}}' \ > {output.fname_mask_lowcoverage} \ 2> >(tee -a {log.errfile} >&2) # preparations {params.enhance_bcf} \ {output.fname_temp_bcf} \ {output.fname_bcf} \ {params.ambiguous_base_coverage_threshold} \ 2> >(tee -a {log.errfile} >&2) {params.bcftools} index {output.fname_bcf} 2> >(tee -a {log.errfile} >&2) common_consensus_params="--fasta-ref {input.fname_ref} --mark-del - --mask {output.fname_mask_lowcoverage} --mask-with n" # majority bases {params.bcftools} consensus \ --output {output.fname_fasta} \ --chain {output.fname_chain} \ $common_consensus_params \ -H A \ -i "INFO/AD[0]<INFO/AD[*]" \ {output.fname_bcf} \ 2> >(tee -a {log.errfile} >&2) # ambiguous bases {params.bcftools} consensus \ --output {output.fname_fasta_ambig} \ --chain {output.fname_chain_ambig} \ $common_consensus_params \ -H I --iupac-codes \ {output.fname_bcf} \ 2> >(tee -a {log.errfile} >&2) """ |
140 141 142 143 144 145 146 147 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" {params.EXTRACT_CONSENSUS} -i {input.BAM} -f {input.REF} -c {params.MIN_COVERAGE} -n {params.N_COVERAGE} -q {params.QUAL_THRD} -a {params.MIN_FREQ} -N "${{CONSENSUS_NAME}}" -o {params.OUTDIR} > {log.outfile} 2> >(tee -a {log.errfile} >&2) """ |
172 173 174 175 176 177 178 179 180 | shell: """ if tail -n +2 {input.REF_majority_dels} | grep -qE '[^n]'; then {params.MATCHER} -asequence {input.REF} -bsequence {input.REF_majority_dels} -outfile {output.REF_matcher} 2> >(tee {log.errfile} >&2) else touch {output.REF_matcher} echo "pure 'nnnn...' consensus, no possible alignement" | tee {log.outfile} fi """ |
210 211 212 213 | shell: """ {params.FRAMESHIFT_DEL_CHECKS} -i {input.BAM} -c {input.REF_majority_dels} -f {input.REF_NAME} -g {input.GENES_GFF} --english=true -o {output.FRAMESHIFT_DEL_CHECK_TSV} 2> >(tee {log.errfile} >&2) """ |
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | shell: """ # 1. cleanup old run rm -f {output.SAM} {output.MSA} # 2. concatenate references mkdir -p {wildcards.dataset}/QA_alignments cat {input.patient_ref} {input.virusmix_ref} > {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta # 3. indexing {params.BWA} index {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta 2> >(tee {log.errfile} >&2) # 4. align {params.BWA} mem -t {threads} {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta {input.FASTQ} > {output.SAM} 2> >(tee -a {log.errfile} >&2) # 5. MSA {params.MAFFT} --nuc --preservecase --maxiterate 1000 --localpair --thread {threads} {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta > {output.MSA} 2> >(tee -a {log.errfile} >&2) # 6. cleanup BWA indices rm -f {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta.* """ |
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | shell: """ CONSENSUS_NAME={wildcards.dataset} CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}" CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}" # 1. clean previous run rm -f {output} mkdir -p {wildcards.dataset}/QA_alignments # 2. collect coverage stats # we only collect statistics in the loop regions # of HIV-1 in order {params.COV_STATS} -t {params.TARGET} -i {input.BAM} -o {output} -m {input.MSA} --select "${{CONSENSUS_NAME}}" > {log.outfile} 2> >(tee {log.errfile} >&2) """ |
35 36 37 38 39 40 41 42 43 44 45 46 47 | shell: """ echo "Keep reject -----------------------------------------------------" echo {params.SAMTOOLS} bam2fq -@ {threads} \ -F 2 \ -1 {output.reject_1} \ -2 {output.reject_2} \ {input.reject_aln} \ 2> >(tee {log.errfile} >&2) echo """ |
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | shell: """ echo "Filter out virus' reads -----------------------------------------" echo {params.BWA} mem -t {threads} \ -o {output.tmp_aln} \ {input.global_ref} {input.fastq} \ > {log.outfile} 2> >(tee {log.errfile} >&2) echo echo "Keep reject -----------------------------------------------------" echo {params.SAMTOOLS} bam2fq -@ {threads} \ -F 2 \ -1 {output.reject_1} \ -2 {output.reject_2} \ {output.tmp_aln} \ 2> >(tee -a {log.errfile} >&2) echo """ |
108 109 110 111 | shell: """ curl --output "{output.host_ref}" "{params.host_ref_url}" """ |
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 | shell: """ # using zcat FILENAME.gz causes issues on Mac, see # https://serverfault.com/questions/570024/ # redirection fixes this: unpack_rawreads() {{ for fq in "${{@}}"; do zcat -f < "${{fq}}" done }} echo echo "Count aligned reads ---------------------------------------------" echo count=$({params.SAMTOOLS} view -@ {threads} -c -f {params.F} -F 2304 {input.host_aln} | tee {output.filter_count} 2> >(tee {log.errfile} >&2) ) if (( count > 0 )); then echo echo "-----------------------------------------------------------------" echo "Needs special care: ${{count}} potential human reads found" echo "-----------------------------------------------------------------" echo echo "Removing identified host reads from raw reads -------------------" echo # get list {params.SAMTOOLS} view -@ {threads} \ -f {params.F} \ {input.host_aln} \ | cut -f 1 > {output.filter_list} 2> >(tee -a {log.errfile} >&2) unpack_rawreads {input.R1:q} \ | {params.remove_reads_script} {output.filter_list} \ | gzip \ > {output.filtered_1} 2> >(tee -a {log.errfile} >&2) & unpack_rawreads {input.R2:q} \ | {params.remove_reads_script} {output.filter_list} \ | gzip \ > {output.filtered_2} 2> >(tee -a {log.errfile} >&2) & wait if (( {params.keep_host} )); then # keep the rejects for further analysis echo echo "Keeping host-aligned virus' rejects ------------------------------" echo # (we compress reference-less, because the reference size is larger # than the contaminant reads) rm -f '{params.sort_tmp}'.[0-9]*.bam FMT=cram,no_ref,use_bzip2,use_lzma,level=9,seqs_per_slice=1000000 {params.SAMTOOLS} view -@ {threads} \ -h -f 2 \ {input.host_aln} \ | {params.SAMTOOLS} sort -@ {threads} \ -T {params.sort_tmp} \ --output-fmt ${{FMT}} \ -o {params.host_aln_cram} \ 2> >(tee -a {log.errfile} >&2) echo echo "Compressing host-depleted raw reads ------------------------------" echo {params.SAMTOOLS} index -@ {threads} {params.host_aln_cram} 2> >(tee -a {log.errfile} >&2) fi else echo echo "No potential human reads found -----------------------------------" echo "Copy raw reads file" echo unpack_rawreads {input.R1} | gzip > {output.filtered_1} 2> >(tee -a {log.errfile} >&2) & unpack_rawreads {input.R2} | gzip > {output.filtered_2} 2> >(tee -a {log.errfile} >&2) & wait touch {output.filter_list} fi echo """ |
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 | shell: """ echo "Compress filtered sequences --------------------------------------" echo {params.BWA} mem -t {threads} \ -C \ -o {output.cram_sam} \ {input.global_ref} {input.filtered_1} {input.filtered_2} \ > {log.outfile} 2> >(tee {log.errfile} >&2) # HACK handle incompatibilities between: # - Illumina's 'bcl2fastq', which write arbitrary strings # - 'bwa mem' which keep comments in the SAM file verbatim as in the FASTQ file # - 'samtools' which expects comment to be properly marked as 'BC:Z:' # as per SAM format specs REGEXP=\'s{{(?<=\\t)([[:digit:]]:[[:upper:]]:[[:digit:]]:([ATCGN]+(\+[ATCGN]+)?|[[:digit:]]+))$}}{{BC:Z:\\1}}\' FMT=cram,embed_ref,use_bzip2,use_lzma,level=9,seqs_per_slice=1000000 rm -f '{params.sort_tmp}'.[0-9]*.bam perl -p -e ${{REGEXP}} {output.cram_sam} \ | {params.SAMTOOLS} sort -@ {threads} \ -T {params.sort_tmp} \ -M \ --reference {input.global_ref} \ --output-fmt ${{FMT}} \ -o {output.final_cram} \ 2> >(tee -a {log.errfile} >&2) {params.checksum_type}sum {output.final_cram} > {output.checksum} 2> >(tee -a {log.errfile} >&2) echo echo DONE ------------------------------------------------------------- echo """ |
22 23 | script: "../scripts/compute_diversity_measures.py" |
53 54 | script: "../scripts/aggregate_diversity.py" |
38 39 40 41 | shell: """ {params.HAPLOCLIQUE} {params.EXTRA_PARAMETERS} {params.RELAX} {params.NO_SINGLETONS} {params.NO_PROB0} --limit_clique_size={params.CLIQUE_SIZE_LIMIT} --max_cliques={params.MAX_NUM_CLIQUES} --log={log.outfile} --bam {input} {params.OUTPREFIX} 2> >(tee {log.errfile} >&2) """ |
70 71 72 73 | shell: """ {params.COMPUTE_MDS} -q {params.INPREFIX} -s {params.REGION_START} -e {params.REGION_END} {params.USE_MSA} {params.MSA} -p {output.PDF} -o {params.TSV} > {log.output} 2> >(tee {log.errfile} >&2) """ |
103 104 105 106 107 108 109 110 111 112 113 114 115 | shell: """ # Convert BAM to FASTQ without re-reversing reads - SAVAGE expect all reads in the same direction source {params.FUNCTIONS} SamToFastq {params.PICARD} I={input} FASTQ={output.R1} SECOND_END_FASTQ={output.R2} RC=false 2> >(tee {log.errfile} >&2) # Remove /1 and /2 from the read names sed -i -e "s:/1$::" {output.R1} sed -i -e "s:/2$::" {output.R2} R1=${{PWD}}/{output.R1} R2=${{PWD}}/{output.R2} {params.SAVAGE} -t {threads} --split {params.SPLIT} -p1 ${{R1}} -p2 ${{R2}} -o {params.OUTDIR} 2> >(tee -a {log.errfile} >&2) """ |
143 144 145 146 147 148 149 150 151 | shell: """ # Convert BAM to FASTQ without re-reversing reads - SAVAGE expect all reads in the same direction source {params.FUNCTIONS} SamToFastq {params.PICARD} I={input} FASTQ={output.R1} 2> >(tee {log.errfile} >&2) R1=${{PWD}}/{output.R1} {params.SAVAGE} -t {threads} --split {params.SPLIT} -s ${{R1}} -o {params.OUTDIR} 2> >(tee -a {log.errfile} >&2) """ |
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | shell: """ {params.SAMTOOLS} sort -n {input.fname_bam} -o {output.fname_sam} 2> >(tee {log.errfile} >&2) mkdir -p {output.OUTPREFIX} {params.PREDICTHAPLO} \ --sam {output.fname_sam} \ --reference {input.fname_ref} \ --prefix {output.OUTPREFIX}/ \ --have_true_haplotypes 0 \ --min_length {params.read_min_length} \ 2> >(tee -a {log.errfile} >&2) # TODO: copy over actual haplotypes touch {output.fname_out} """ |
34 35 36 37 | shell: """ {params.ALN2BASECNT} --first "{params.ARRAYBASED}" --basecnt "{output.BASECNT}" --coverage "{output.COVERAGE}" --name "{params.NAME}" --stats "{output.STATS}" "{input.BAM}" > {log.outfile} 2> >(tee {log.errfile} >&2) """ |
56 57 58 | run: with open(output[0], "w") as out: out.write("\n".join(input)) |
101 102 103 104 | shell: """ {params.GATHER_COVERAGE} --output {output.COVERAGE} --stats {output.COVSTATS} --threads {threads} @{input.COVLIST} > >(tee {log.outfile}) 2> >(tee {log.errfile} >&2) """ |
155 156 157 158 | shell: """ {params.MINORITY_CALLER} -r {input.REF} -c {params.MIN_COVERAGE} -N {params.NAMES} -t {threads} -o {params.OUTDIR} {params.FREQUENCIES} {input.BAM} > >(tee {log.outfile}) 2> >(tee {log.errfile} >&2) """ |
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | shell: """ echo "Trimming BAM with ivar" # iVar will Segfault without this: mkdir -p "$(dirname {params.ivar_tmp}"")" {params.IVAR} trim -e -i {input.BAM} -b {params.BED_PRIMERS} -p {params.ivar_tmp} > {log.outfile} 2> {log.errfile} # samtools complains without that: rm -f '{params.sort_tmp}'.[0-9]*.bam {params.SAMTOOLS} sort -o {output.BAM} -T {params.sort_tmp} {params.ivar_tmp}.bam 2> >(tee -a {log.errfile} >&2) {params.SAMTOOLS} index {output.BAM} 2> >(tee -a {log.errfile} >&2) rm -f {params.ivar_tmp}.bam """ |
100 101 102 103 104 105 106 107 108 109 110 | shell: """ echo "Trimming BAM with samtools" # samtools complains without that: rm -f '{params.sort_tmp}'.[0-9]*.bam {params.SAMTOOLS} ampliconclip -b {params.BED_PRIMERS} -f {output.stats} {input.BAM} | {params.SAMTOOLS} sort -o {output.BAM} -T {params.sort_tmp} 2> >(tee {log.errfile} >&2) {params.SAMTOOLS} index {output.BAM} 2> >(tee -a {log.errfile} >&2) """ |
62 63 64 65 | shell: """ {params.script} {params.options} "{output.upload_prepared_touch}" "{params.sample_id}" "{wildcards.dataset}" {input:q} """ |
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | shell: """ # using zcat FILENAME.gz causes issues on Mac, see # https://serverfault.com/questions/570024/ # redirection fixes this: unpack_rawreads() {{ for fq in "${{@}}"; do zcat -f < "${{fq}}" done }} echo "Compress un-filtered sequences -----------------------------------" echo {params.BWA} mem -t {threads} \ -C \ -o {output.cram_sam} \ {input.global_ref} \ <(unpack_rawreads {input.R1:q}) \ <(unpack_rawreads {input.R2:q}) \ > {log.outfile} 2> >(tee {log.errfile} >&2) # HACK handle incompatibilities between: # - Illumina's 'bcl2fastq', which write arbitrary strings # - 'bwa mem' which keep comments in the SAM file verbatim as in the FASTQ file # - 'samtools' which expects comment to be properly marked as 'BC:Z:' # as per SAM format specs REGEXP=\'s{{(?<=\\t)([[:digit:]]:[[:upper:]]:[[:digit:]]:([ATCGN]+(\+[ATCGN]+)?|[[:digit:]]+))$}}{{BC:Z:\\1}}\' FMT=cram,embed_ref,use_bzip2,use_lzma,level=9,seqs_per_slice=1000000 rm -f '{params.sort_tmp}'.[0-9]*.bam perl -p -e ${{REGEXP}} {output.cram_sam} \ | {params.SAMTOOLS} sort -@ {threads} \ -T {params.sort_tmp} \ -M \ --reference {input.global_ref} \ --output-fmt ${{FMT}} \ -o {output.final_cram} \ 2> >(tee -a {log.errfile} >&2) {params.checksum_type}sum {output.final_cram} > {output.checksum} 2> >(tee -a {log.errfile} >&2) echo echo DONE ------------------------------------------------------------- echo """ |
158 159 160 161 | shell: """ {params.checksum_type}sum {input} > {output} """ |
30 31 32 33 | shell: """ {params.GUNZIP} -c {input} > {output} """ |
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | shell: """ echo "The length cutoff is: {params.LEN_CUTOFF}" > {log.outfile} {params.PRINSEQ} -fastq {input.R1} -fastq2 {input.R2} {params.EXTRA} -out_format 3 -out_good {wildcards.dataset}/preprocessed_data/R -out_bad null -min_len {params.LEN_CUTOFF} -log {log.outfile} 2> >(tee {log.errfile} >&2) # make sure that the lock held prinseq has been effectively released and propagated # on some networked shares this could otherwise lead to confusion or corruption if [[ "$OSTYPE" =~ ^linux ]]; then echo "Waiting for unlocks" >&2 for U in {wildcards.dataset}/preprocessed_data/R_{{1,2}}.fastq; do flock -x -o ${{U}} -c "echo ${{U}} unlocked >&2" done fi mv {wildcards.dataset}/preprocessed_data/R{{_,}}1.fastq mv {wildcards.dataset}/preprocessed_data/R{{_,}}2.fastq rm -f {wildcards.dataset}/preprocessed_data/R_?_singletons.fastq gzip {wildcards.dataset}/preprocessed_data/R1.fastq gzip {wildcards.dataset}/preprocessed_data/R2.fastq """ |
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | shell: """ echo "The length cutoff is: {params.LEN_CUTOFF}" > {log.outfile} {params.PRINSEQ} -fastq {input.R1} {params.EXTRA} -out_format 3 -out_good {wildcards.dataset}/preprocessed_data/R -out_bad null -min_len {params.LEN_CUTOFF} {params.EXTRA} -log {log.outfile} 2> >(tee {log.errfile} >&2) # make sure that the lock held prinseq has been effectively released and propagated # on some network shares this could otherwise lead to confusion or corruption if [[ "$OSTYPE" =~ ^linux ]]; then echo "Waiting for unlocks" >&2 for U in {wildcards.dataset}/preprocessed_data/R_{{1,2}}.fastq; do flock -x -o ${{U}} -c "echo ${{U}} unlocked >&2" done fi mv {wildcards.dataset}/preprocessed_data/R{{,1}}.fastq gzip {wildcards.dataset}/preprocessed_data/R1.fastq """ |
187 188 189 190 | shell: """ {params.FASTQC} -o {params.OUTDIR} -t {threads} {params.NOGROUP} {input} 2> >(tee {log.errfile} >&2) """ |
54 55 56 57 58 59 60 61 62 63 64 65 66 | shell: """ TMPTSV=$(mktemp -t XXXXXXXX_cov.tsv) TMPINT=$(mktemp -t XXXXXXXX_int.tsv) {params.GUNZIP} -c {input.TSV} | cut -f'2-' > $TMPTSV mkdir -p "$(dirname "{output}")" {params.EXTRACT_COVERAGE_INTERVALS} -c "{params.COVERAGE}" -w "{params.WINDOW_LEN}" -s "{params.SHIFT}" -N "{params.NAME}" {params.LIBERAL} -b {params.ARRAYBASED} {params.OVERLAP} -t "{threads}" -o $TMPINT "{input.BAM}" > >(tee {log.outfile}) 2>&1 cat $TMPINT >> "{log.outfile}" read name intervals < $TMPINT IFS=',' read -r -a interarray <<< "$intervals" printf "%s\n" "${{interarray[@]}}" > "{output}" rm $TMPTSV $TMPINT """ |
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | shell: """ let "WINDOW_SHIFTS=({params.READ_LEN} * 4/5 + {params.SHIFT}) / {params.SHIFT}" let "WINDOW_LEN=WINDOW_SHIFTS * {params.SHIFT}" echo "Windows are shifted by: ${{WINDOW_SHIFTS}} bp" > {log.outfile} echo "The window length is: ${{WINDOW_LEN}} bp" >> {log.outfile} # Get absolute path for input files CWD=${{PWD}} BAM=${{PWD}}/{input.BAM} REF={input.REF}; [[ ${{REF}} =~ ^/ ]] || REF=${{PWD}}/${{REF}} OUTFILE=${{PWD}}/{log.outfile} ERRFILE=${{PWD}}/{log.errfile} WORK_DIR=${{PWD}}/{params.WORK_DIR} if [[ -n "{params.LOCALSCRATCH}" ]]; then # put input files in localscratch rsync -aq "${{BAM}}" "${{REF}}" "{params.LOCALSCRATCH}/" BAM="{params.LOCALSCRATCH}/${{BAM##*/}}" REF="{params.LOCALSCRATCH}/${{REF##*/}}" fi # Run ShoRAH in each of the predetermined regions (regions with sufficient coverage) LINE_COUNTER=0 FILES=( ) FILES_VCF=( ) while read -r region || [[ -n ${{region}} ]] do echo "Running ShoRAH on region: ${{region}}" >> $OUTFILE (( ++LINE_COUNTER )) # Create directory for running ShoRAH in a corresponding region (if doesn't exist) DIR=${{WORK_DIR}}/REGION_${{LINE_COUNTER}} if [[ ! -d "${{DIR}}" ]]; then echo "Creating directory ${{DIR}}" >> $OUTFILE mkdir -p ${{DIR}} else # Results from previous runs if [[ {params.KEEP_FILES} == "true" ]]; then DIR_DST=${{WORK_DIR}}/old echo "Moving results from a previous run to ${{DIR_DST}}" >> $OUTFILE rm -rf ${{DIR_DST}}/REGION_${{LINE_COUNTER}} mkdir -p ${{DIR_DST}} mv -f ${{DIR}} ${{DIR_DST}} mkdir -p ${{DIR}} fi fi # Change to the directory where ShoRAH is to be executed if [[ -z "{params.LOCALSCRATCH}" ]]; then cd ${{DIR}} else # special case: go in local scratch instead rsync -aq "${{DIR}}" "{params.LOCALSCRATCH}/" mkdir -p "{params.LOCALSCRATCH}/REGION_${{LINE_COUNTER}}" cd "{params.LOCALSCRATCH}/REGION_${{LINE_COUNTER}}" fi # NOTE: Execution command for ShoRAH2 valid from v1.99.0 and above {params.SHORAH} -t {threads} -a {params.ALPHA} -w ${{WINDOW_LEN}} -x 100000 {params.IGNORE_INDELS} -p {params.POSTHRESH} -c {params.COVERAGE} -r ${{region}} -R 42 -b ${{BAM}} -f ${{REF}} >> $OUTFILE 2> >(tee -a $ERRFILE >&2) if [[ -n "{params.LOCALSCRATCH}" ]]; then # copyback from localscratch rsync -auq "{params.LOCALSCRATCH}/REGION_${{LINE_COUNTER}}" "${{WORK_DIR}}" cd ${{DIR}} fi if [[ -s ${{DIR}}/reads.fas && -f ${{DIR}}/snv/SNVs_0.010000_final.csv ]]; then # Non empty reads: run had enough data and should have produced SNVs {params.BCFTOOLS} view ${{DIR}}/snv/SNVs_0.010000_final.vcf -Oz -o ${{DIR}}/snv/SNVs_0.010000_final.vcf.gz {params.BCFTOOLS} index ${{DIR}}/snv/SNVs_0.010000_final.vcf.gz FILES+=("${{DIR}}/snv/SNVs_0.010000_final.csv") FILES_VCF+=("${{DIR}}/snv/SNVs_0.010000_final.vcf.gz") elif (( {params.COVINT} == 0 && LINE_COUNTER == 1 )) && [[ -f ${{DIR}}/reads.fas && ( ! -s ${{DIR}}/reads.fas ) ]]; then # if we have disabled coverage intervales entirely, the first and only line might have no reads # (e.g.: in negative controls ) echo "No reads while coverage intervals disabled (possible negative control sample)" 2> >(tee -a $ERRFILE >&2) cd ${{CWD}} (( --LINE_COUNTER )) || true # Strict mode : (( 0 )) = fail break else echo "ERROR: unsuccesful execution of ShoRAH" 2> >(tee -a $ERRFILE >&2) exit 1 fi # Change back to working directory cd ${{CWD}} done < {input.TSV} # Aggregate results from different regions if (( ${{#FILES[@]}} )); then echo "Intermediate csv files: ${{FILES[*]}}" >> {log.outfile} echo "Intermediate vcf files: ${{FILES_VCF[*]}}" >> {log.outfile} (head -n 1 "${{FILES[0]}}"; tail -q -n +2 "${{FILES[@]}}" | sort -t, -nk2) > {output.CSV} {params.BCFTOOLS} concat -o ${{WORK_DIR}}/snvs_tmp.vcf "${{FILES_VCF[@]}}" {params.BCFTOOLS} sort ${{WORK_DIR}}/snvs_tmp.vcf -o {output.VCF} rm -f ${{WORK_DIR}}/snvs_tmp.vcf elif (( LINE_COUNTER )); then echo "ERROR: unsuccesful execution of ShoRAH" 2> >(tee -a {log.errfile} >&2) exit 1 else echo "No alignment region reports sufficient coverage" >> {log.outfile} touch {output.CSV} touch {output.VCF} fi """ |
242 243 244 245 | shell: """ {params.SAMTOOLS} faidx {input} -o {output} > {log.outfile} 2> >(tee -a {log.errfile} >&2) """ |
288 289 290 291 292 293 294 295 296 297 298 | shell: """ # Add qualities to indels {params.LOFREQ} indelqual --dindel -f {input.REF} -o {output.BAM} --verbose {input.BAM} > {log.outfile} 2> >(tee {log.errfile} >&2) # Index bam file {params.SAMTOOLS} index {output.BAM} 2> >(tee {log.errfile} >&2) # Run Lofreq echo "Running LoFreq" >> {log.outfile} {params.LOFREQ} call {params.EXTRA} --call-indels -f {input.REF} -o {output.SNVs} --verbose {output.BAM} >> {log.outfile} 2> >(tee -a {log.errfile} >&2) """ |
42 43 44 45 | shell: """ {params.EXTRACT_COVERAGE_INTERVALS} -b {params.ARRAYBASED} -cf {input.TSV} -c {params.COVERAGE} --no-shorah -N {params.NAMES} -o {output} {input.BAM} > >(tee {log.outfile}) 2>&1 """ |
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | shell: """ SAMPLE_ID="{params.ID}" # Number of input reads LINECOUNT=$( cat {input.R1} | wc -l ) let "INPUT=LINECOUNT / {params.FACTOR}" # Number of reads after QC # For portability reason not using zcat {params.GUNZIP} -c {input.R1_QC} > {params.R1_temp} LINECOUNT=$( cat {params.R1_temp} | wc -l ) let "READCOUNT=LINECOUNT / {params.FACTOR}" rm {params.R1_temp} # Number of aligned reads ALNCOUNT=$( {params.SAMTOOLS} view {input.BAM} | wc -l ) echo -e "${{SAMPLE_ID}}\t${{INPUT}}\t${{READCOUNT}}\t${{ALNCOUNT}}" > {output} """ |
91 92 93 94 95 96 | shell: """ cat {input} > stats/temp echo -e "ID\tInput\tQC\tAlignments" | cat - stats/temp > {output} rm stats/temp """ |
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | shell: """ # Why a shell directive? # 1) script directive crashes with `VpipeConfig` # 2) run directive does not allow conda envs if [ ! -z "{params.nwk_file}" ] then # generate phylogenetic tree augur align --sequences {input.phylogeny_data} {input.consensus_file} --output {params.alignment_file} augur tree --alignment {params.alignment_file} --output {params.nwk_file} fi # generate read alignment create_datauri {input.reference_file} > {output.reference_uri_file} create_datauri {input.bam_file} > {output.bam_uri_file} # generate html visualization pages python "{params.assemble_visualization_webpage}" \ --consensus "{input.consensus_file}" \ --coverage "{input.coverage_file}" \ --tsvbased "{params.tsvbased}" \ --vcf "{input.vcf_file}" \ --gff "{input.gff_directory}" \ --primers "{input.primers_file}" \ --metainfo "{input.metainfo_file}" \ --snv_calling_template "{params.snv_visualization_template}" \ --alignment_template "{params.alignment_visualization_template}" \ --html_out_snv_calling "{output.snv_html_file}" \ --html_out_alignment "{output.alignment_html_file}" \ --wildcards "{wildcards.dataset}" \ --reference "{input.global_ref}" \ --reference_uri_file "{output.reference_uri_file}" \ --bam_uri_file "{output.bam_uri_file}" \ --nwk "{params.nwk_file}" \ > {log.outfile} 2> {log.errfile} """ |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import pandas as pd def main(in_fnames_diversity, in_fnames_shannon, out_diversity_csv, out_shannon_csv): merged_div_csv = pd.concat( [pd.read_csv(path_div) for path_div in in_fnames_diversity] ) merged_div_csv.to_csv(out_diversity_csv) merged_shan_csv = pd.concat( [pd.read_csv(path_shan) for path_shan in in_fnames_shannon] ) merged_shan_csv.to_csv(out_shannon_csv) if __name__ == "__main__": main( snakemake.input.fnames_diversity, snakemake.input.fnames_shannon, snakemake.output.diversity_csv, snakemake.output.shannon_csv, ) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 | import sys import os import vcf import pandas as pd import numpy as np import skbio from scipy.stats import sem def number_of_polymorphisms(df_mutations, minor_allele_frequency=0): df_temp = df_mutations[df_mutations["frequency"] >= minor_allele_frequency] variant_positions = df_temp["position"].unique() return len(variant_positions) def list_polymorphic_sites(df_mutations, minor_allele_frequency=0): df_temp = df_mutations[df_mutations["frequency"] >= minor_allele_frequency] variant_positions = df_temp["position"].unique() return variant_positions def number_of_mutations(df_mutations, minimum_frequency=0): df_temp = df_mutations[df_mutations["frequency"] >= minimum_frequency] n_reads_with_mut = df_temp["frequency"] * df_temp["coverage"] n_reads_with_mut = df_temp["rvar"] + df_temp["fvar"] return n_reads_with_mut.sum() def mutation_spectrum(df_mutations, bins): counts = np.zeros(len(bins)) for i in range(len(bins) - 1): df_temp = df_mutations[df_mutations["frequency"] > bins[i]] df_temp = df_temp[df_temp["frequency"] <= bins[i + 1]] n_reads_with_mut = df_temp["frequency"] * df_temp["coverage"] counts[i + 1] = n_reads_with_mut.sum() return list(counts) def population_nucleotide_diversity(df_mutations, length): # only the positions with mutations are needed pi = 0 for position_temp in list_polymorphic_sites(df_mutations, minor_allele_frequency=0): df_temp = df_mutations[df_mutations["position"] == position_temp] N = df_temp["coverage"].unique()[0] if N == 0: continue freq = df_temp["frequency"].to_numpy() ref_freq = 1 - freq.sum() position_pnd = freq**2 postion_pi = (1 - (position_pnd.sum() + ref_freq**2)) * N / (N - 1) pi += postion_pi return float(pi / length) def position_Shannon_entropy(df_mutations, position): df_temp = df_mutations[df_mutations["position"] == position] position_shannon = 0 sum_fraction = 0 var_shannon = df_temp["frequency"].apply(lambda x: x * np.log(x) if x > 0 else 0) sum_fraction = df_temp["frequency"].sum() position_shannon = var_shannon.sum() # add the reference base summand if 1 - sum_fraction > 0: position_shannon += (1 - sum_fraction) * np.log(1 - sum_fraction) return -position_shannon def mean_pos_Shannon_entropy(df_mutations, length): entropy = 0 for position_temp in list_polymorphic_sites(df_mutations, minor_allele_frequency=0): entropy += position_Shannon_entropy(df_mutations, position_temp) return entropy / length def convert_vcf(fname): """Convert VCF to JSON.""" output = [] if os.path.getsize(fname) == 0: print(f'Empty VCF: "{fname}"') return output print(f'Parsing VCF: "{fname}"') with open(fname) as fd: vcf_reader = vcf.Reader(fd) # check caller caller_source = vcf_reader.metadata["source"][0].lower() if caller_source.startswith("lofreq"): mode = "lofreq" elif caller_source.startswith("shorah"): mode = "shorah" else: raise RuntimeError(f"Invalid variant caller: {caller_source}") # output dataframe cols_snv = [ "position", "reference", "variant", "frequency", "tvar", "fvar", "rvar", "ftot", "rtot", "coverage", ] df_snv = pd.DataFrame(columns=cols_snv) # parse records for record in vcf_reader: if mode == "lofreq": freq = round(record.INFO["AF"], 3) fvar = record.INFO["DP4"][2] # counts variant forward reads rvar = record.INFO["DP4"][3] # counts variant reverse reads rtot = ( record.INFO["DP4"][1] + record.INFO["DP4"][3] ) # counts reverse reads ftot = ( record.INFO["DP4"][0] + record.INFO["DP4"][2] ) # counts forward reads elif mode == "shorah": freq = round( np.mean( [v for k, v in record.INFO.items() if k.startswith("Freq")] ), 3, ) fvar = record.INFO["Fvar"] # counts variant forward reads rvar = record.INFO["Rvar"] # counts variant reverse reads rtot = record.INFO["Rtot"] # counts reference reverse reads ftot = record.INFO["Ftot"] # counts reference forward reads df_snv = df_snv.append( { "position": record.POS, "reference": record.REF, "variant": [v.sequence for v in record.ALT], "frequency": freq, "tvar": fvar + rvar, "fvar": fvar, "rvar": rvar, "ftot": ftot, "rtot": rtot, "coverage": ftot + rtot, }, ignore_index=True, ) id = record.CHROM return df_snv, id def load_reference_seq(reference_file): for seq in skbio.io.read(reference_file, format="fasta"): return seq def main(fname_snv_in, fname_reference, output_diversity_csv, output_shannon_csv): """ Compute various diversity indices for underlying sample. Writes a csv-file with computations. Parameters ---------- fname_snv_in: Absolute path to snv.vcf created by lofreq or ShoRAH. ref_seq_length: Length of the reference sequence. output_diversity_csv: Absolute path to diversity-csv. output_shannon_csv: Absolute path to position-wise Shannon-entropy csv. Output ------ diversity_measures.csv: Listing all computed diversity measures. position_shannon_entropy.csv: Position-wise Shannon entropy, if non-zero. """ # get length of reference sequence ref_seq_length = len(load_reference_seq(fname_reference)) # Parse snv.vcf df_snv, id = convert_vcf(fname_snv_in) # Sample, patient, date information general_info = { "sample": fname_snv_in.split("/variants")[0].split("/")[-3], "patient": fname_snv_in.split("/variants")[0].split("/")[-2], "date": fname_snv_in.split("/variants")[0].split("/")[-1], } # prepare dict collecting all the diversity measures out_dict = {"id": id, "length": ref_seq_length} # number of mutations with different minor allele frequency out_dict.update( { "n_mutations_minFrq_0": number_of_mutations(df_snv, minimum_frequency=0), "n_mutations_minFrq_1": number_of_mutations(df_snv, minimum_frequency=0.01), "n_mutations_minFrq_5": number_of_mutations(df_snv, minimum_frequency=0.05), } ) # sum mutation frequencies out_dict.update({"sum_mutation_frq": df_snv["frequency"].sum()}) # mean mutation frequencies out_dict.update({"mean_mutation_frq": df_snv["frequency"].mean()}) # the standard error of the mean (SEM) mutation frequency out_dict.update({"sem_mutation_frq": sem(df_snv["frequency"].to_numpy())}) # population nucleotide diversity out_dict.update( { "population_nucleotide_diverstiy": population_nucleotide_diversity( df_snv, ref_seq_length ) } ) # mean position-wise Shannon entropy out_dict.update( {"mean_position_shannon": mean_pos_Shannon_entropy(df_snv, ref_seq_length)} ) # mutation spectrum bins = np.arange(0, 1, 0.05) out_dict.update({"mutation_spectrum_bins": list(bins)}) out_dict.update({"mutation_spectrum": mutation_spectrum(df_snv, bins)}) out_dict.update(general_info) # save to dataframe df_diveristy = pd.DataFrame(columns=list(out_dict.keys())) df_diveristy = df_diveristy.append(out_dict, ignore_index=True) df_diveristy.to_csv(output_diversity_csv, index=False) # position-wise Shannon entropy pos_shannon_dict = general_info for i in range(ref_seq_length): if position_Shannon_entropy(df_snv, i) != 0: pos_shannon_dict.update({i: position_Shannon_entropy(df_snv, i)}) df_pos_shannon = pd.DataFrame(columns=list(pos_shannon_dict.keys())) df_pos_shannon = df_pos_shannon.append(pos_shannon_dict, ignore_index=True) df_pos_shannon.to_csv(output_shannon_csv, index=False) if __name__ == "__main__": main( snakemake.input.fnames_samples_snvs_vcf, snakemake.input.fname_ref, snakemake.output.diversity_csv, snakemake.output.shannon_csv, ) |
7
of
scripts/compute_diversity_measures.py
Support
- Future updates