ENA SARS-CoV2 sequence analysis workflow

public public 1yr ago Version: Version 1 0 bookmarks

This is the official repository of the SARS-CoV-2 variant surveillance pipeline developed by Danish Technical University (DTU), Eotvos Lorand University (ELTE), EMBL-EBI, Erasmus Medical Center (EMC) under the Versatile Emerging infectious disease Observatory (VEO) project. The project consists of 20 European partners. It is funded by the European Commission.

The pipeline has been integrated on EMBL-EBI infrastructure to automatically process raw SARS-CoV-2 read data, presenting in the COVID-19 Data Portal: https://www.covid19dataportal.org/sequences?db=sra-analysis-covid19&size=15&crossReferencesOption=all#search-content.

Architecture

The pipeline supports sequence reads from both Illumina and Nanopore platforms. It is designed to be highly portable for both Google Cloud Platform and High Performance Computing cluster with IBM Spectrum LSF. We have performed secondary and tertiary analysis on millions of public samples. The pipeline shows good performance for large scale production.

Component diagram

The pipeline takes SRA from the public FTP from ENA. It submits analysis objects back to ENA on the fly. The intermediate results and logs are stored in the cloud storage buckets or high performance local POSIX file system. The metadata is stored in Google BigQuery for metadata and status tracking and analysis. The runtime is created with Docker / Singularity containers and NextFlow.

Process to run the pipelines

The pipeline requires the Nextflow Tower for the application level monitoring. A free test account can be created for evaluation purposes at https://tower.nf/.

Preparation

  1. Store export TOWER_ACCESS_TOKEN='...' in $HOME/.bash_profile . Restart the current session or source the updated $HOME/.bash_profile .

  2. Run git clone https://github.com/enasequence/covid-sequence-analysis-workflow .

  3. Create ./covid-sequence-analysis-workflow/data/projects_accounts.csv with submission_account_id and submission_passwor, for example:

project_id,center_name,meta_key,submission_account_id,submission_password,ftp_password PRJEB45555,"European Bioinformatics Institute",public,,,

Running pipelines

  1. Run ./covid-sequence-analysis-workflow/init.sra_index.sh to initialize or reinitialize the metadata in BigQuery.

  2. Run ./covid-sequence-analysis-workflow/./start.lsf.jobs.sh with proper parameters to start the batch jobs on LSF or ./covid-sequence-analysis-workflow/./start.gls.jobs.sh with proper parameters to start the batch jobs on GCP.

Error handling

If a job is killed or died, run the following to update the metadata to avoid reprocessing samples completed successfully.

  1. Run ./covid-sequence-analysis-workflow/update.receipt.sh <batch_id> to collect the submission receipts and to update submission metadata. The script can be run at anytime. It needs to be run if a batch job is killed instead of completed for any reason.

  2. Run ./covid-sequence-analysis-workflow/set.archived.sh to update stats for analyses submitted. The script can be run at anytime. It needs to be run at least once before ending a snapshot to make sure that the stats are up-to-date.

To reprocess the samples failed, delete the record in sra_processing .

Code Snippets

28
29
30
"""
fastqc -t ${task.cpus} -q ${reads[0]} ${reads[1]}
"""
46
47
48
49
50
51
"""
trimmomatic PE ${reads} ${run_id}_trim_1.fq \
${run_id}_trim_1_un.fq ${run_id}_trim_2.fq ${run_id}_trim_2_un.fq \
-summary ${run_id}_trim_summary -threads ${task.cpus} \
SLIDINGWINDOW:5:30 MINLEN:50
"""
69
70
71
"""
fastqc -t ${task.cpus} -q ${trimmed_reads}
"""
93
94
95
96
97
98
99
"""
bowtie2 --very-sensitive-local -p ${task.cpus} \
-x $index_base --met-file ${run_id}_bowtie_human_summary \
-1 ${trimmed_reads[0]} -2 ${trimmed_reads[2]} \
-U ${trimmed_reads[1]},${trimmed_reads[3]} | \
samtools view -Sb -f 4 > ${run_id}_nohuman.bam
"""
116
117
118
119
"""
samtools bam2fq -1 ${run_id}_nohuman_1.fq -2 ${run_id}_nohuman_2.fq \
-s ${run_id}_nohuman_s.fq ${bam} > ${run_id}_nohuman_3.fq
"""
142
143
144
145
146
147
148
"""
bowtie2 -p ${task.cpus} --no-mixed --no-discordant \
--met-file ${run_id}_bowtie_nohuman_summary -x $index_base \
-1 ${fastq[0]} -2 ${fastq[1]} | samtools view -bST ${sars2_fasta} | \
samtools sort | samtools view -h -F 4 -b > ${run_id}.bam
samtools index ${run_id}.bam
"""
165
166
167
168
"""
picard MarkDuplicates I=${bam} O=${run_id}_dep.bam REMOVE_DUPLICATES=true \
M=${run_id}_marked_dup_metrics.txt
"""
186
187
188
189
"""
samtools mpileup -A -Q 30 -d 1000000 -f ${sars2_fasta} ${bam} > \
${run_id}.pileup
"""
207
208
209
"""
cat ${pileup} | awk '{print \$2,","\$3,","\$4}' > ${run_id}.coverage
"""
NextFlow From line 207 of raw/workflow.nf
230
231
232
233
234
235
236
237
"""
samtools index ${bam}
lofreq call-parallel --pp-threads ${task.cpus} -f ${sars2_fasta} \
-o ${run_id}.vcf ${bam}
bgzip ${run_id}.vcf
tabix ${run_id}.vcf.gz
bcftools stats ${run_id}.vcf.gz > ${run_id}.stat
"""
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
"""
bcftools filter -i "DP>50" ${vcf} -o ${run_id}.cfiltered.vcf
bgzip ${run_id}.cfiltered.vcf
tabix ${run_id}.cfiltered.vcf.gz
bcftools filter -i "AF>0.5" ${run_id}.cfiltered.vcf.gz > \
${run_id}.cfiltered_freq.vcf
bgzip -c ${run_id}.cfiltered_freq.vcf > ${run_id}.cfiltered_freq.vcf.gz
bcftools index ${run_id}.cfiltered_freq.vcf.gz
bcftools consensus -f ${sars2_fasta} ${run_id}.cfiltered_freq.vcf.gz > \
${run_id}.cons.fa
sed -i "1s/.*/>${run_id}/" ${run_id}.cons.fa
rm ${run_id}.cfiltered.vcf.gz
rm ${run_id}.cfiltered.vcf.gz.tbi
rm ${run_id}.cfiltered_freq.vcf
rm ${run_id}.cfiltered_freq.vcf.gz.csi
rm ${run_id}.cfiltered_freq.vcf.gz
"""
290
291
292
293
294
295
296
"""
bcftools filter -i "DP>50" ${vcf} -o ${run_id}.filtered.vcf
bgzip ${run_id}.filtered.vcf
tabix ${run_id}.filtered.vcf.gz
bcftools filter -i "AF>0.1" ${run_id}.filtered.vcf.gz > \
${run_id}.filtered_freq.vcf
"""
314
315
316
317
318
319
"""
cat ${vcf} | sed "s/^NC_045512.2/NC_045512/" > \
${run_id}.newchr.filtered_freq.vcf
java -Xmx4g -jar /data/tools/snpEff/snpEff.jar -v -s ${run_id}.snpEff_summary.html sars.cov.2 \
${run_id}.newchr.filtered_freq.vcf > ${run_id}.annot.n.filtered_freq.vcf
"""
NextFlow From line 314 of raw/workflow.nf
ShowHide 7 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/enasequence/covid-sequence-analysis-workflow
Name: ena-sars-cov2-variant-calling
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Boost Software License 1.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...