Snakemake pipeline for de novo transcriptome assembly with 454 reads

public public 1yr ago 0 bookmarks

1. Description

This is a workflow to assemble RNA reads from 454 into a transcriptome. The procedure is as follows:

  1. Quality control

    1. Base calling with PyroBayes (if .sff files are provided)

    2. Quality, length and adaptor trimming with SnoWhite

  2. Assembly

    With gsAssembler from Roche's Data Analysis tools ( newbler )

2. First steps

  1. Clone the repo

    git clone https://github.com/jlanga/smsk_454.git # Clone
    cd smsk_454
    
  2. Activate the environment ( deactivate to deactivate):

    source bin/activate
    
  3. Install software and packages via pip and homebrew (edit whatever is necessary):

    bash bin/install/brew.sh
    bash bin/install/from_brew.sh
    bash bin/install/from_pip3.sh
    bash bin/install/from_tarball.sh
    
  4. Additional requirements

    Pyrobayes is an accurate base caller for 454 datasets. It used to be available through free registration at here . It used to be a file called pyrobayes.unified_release_64bit.tar.gz . If you are able to get it, do the following

    mkdir -p src/
    pushd src/
    cp /path/to/tarball.tar.gz .
    tar xvf pyrobayes.unified_release_64bit.tar.gz
    popd
    cp src/UnifiedRelease/bin/PyroBayes bin/
    

    The same applies to get gsAssembler . You should go to Roche and ask for a copy (it is free but requires registration). You should get a file called DataAnalysis_2.9_All_20130530_1559.tgz . If you are connecting through ssh, use the -X option to allow graphic interfaces ( ssh server -X ).

    From here,

    mkdir -p src/
    pushd src/
    cp /path/to/DataAnalysis_2.9_All_20130530_1559.tgz .
    tar xvf DataAnalysis_2.9_All_20130530_1559.tgz
    pushd DataAnalysis_2.9_All/
    bash setup.sh
    

    And a window will pop up. Select "local installation" and choose as installation path the src/ directory of this project (in my case /home/jlanga/pipelines/smsk_454/src/454/ )

  5. Download sample data from the European Nucleotide Archive (ENA; two sff files and two fastq files):

    bash bin/download_test_data.sh
    
  6. Execute the pipeline (should take up to 10 minutes):

    snakemake -j 24
    

3. File organization

The hierarchy of the folder is the one described in A Quick Guide to Organizing Computational Biology Projects :

smsk
├── .linuxbrew: brew files
├── bin: scripts,binaries and snakemeake related files.
├── data: raw data, hopefully links to backuped data.
├── doc: logs.
├── README.md
├── results: processed data, reports, etc.
└── src: additional source code, tarballs, etc.

Bibliography and resources

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator
import sys

def convert(fasta_file, qual_file, fastq_file):
    """
    Combine fasta_file and qual_file into fastq_file in PHRED33 qual format.
    """
    SeqIO.write(
        (record for record in PairedFastaQualIterator(fasta_file, qual_file)),
        fastq_file,
        "fastq"
    )


if __name__ == "__main__":

    if len(sys.argv) != 3:
        sys.exit("ERROR! Provide a fasta and a qual file. Output will be spitted through stdout")

    fasta = open(sys.argv[1], "r")
    qual  = open(sys.argv[2], "r")
    fastq = sys.stdout

    convert(fasta, qual, fastq)
30
31
32
33
34
35
36
37
38
39
40
41
42
43
shell:
    "./src/454/bin/runAssembly "
        "-cdna "
        "-cpu {threads} "
        "-minlen {params.minimum_sequence_length} "
        "-mi {params.minimum_overlap_identity} "
        "-ml {params.minimum_overlap_length} "
        "-o {params.out_dir} "
        "-vs {input.adaptors} "
        "-vt {input.adaptors} "
        "{params.additional_params} "
        "{input.sff} "
        "{input.fastq} "
    "2> {log} 1>&2"
5
6
shell:
    "rm -r results/"
14
shell: "rm -r {raw_dir}"
22
shell: "rm -r {qc_dir}"
30
shell: "rm -r {assembly_dir}"
38
shell: "rm -r {report_dir}"
15
16
17
18
19
20
21
shell:
    "pigz "
        "--decompress "
        "--stdout "
        "{input.fq_gz} "
    "> {output.fq} "
    "2> {log}"
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
shell:
    "tar "
        "xvf "
        "{params.tarball} "
        "-C . "
    "2> {log} 1>&2 ; "
    "mv SnoWhite_2.0.3_release/* . ; "
    "yes | perl snowhite_2.0.3.pl "
        "-f {input.fastq} "
        "-o {params.out_folder} "
        "-v {input.adaptors} "
        "-s {input.adaptors} "
        "-m {params.minimum_sequence_length} "
        "-g {params.delete_temporary_files} "
        "-R {params.output_fastq} "
        "`#-B {params.split_in_subfiles}` "
        "`#-j {params.barcodes_in_end}` "
        "`#-z {params.number_of_mismatches}` "
        "-Q {params.minimum_quality} "
        "`#-E {input.adaptors}` "
        "`#-c {params.bases_to_clip}` "
        "-D {params.execute_tagdust} "
        "-d {params.false_discovery_rate} "
        "-L {params.execute_seqclean} "
        "-p {threads} "
        "-Y {params.execute_poly_trimming} "
        "-l {params.poly_tail_min_repeat_length} "
        "-a {params.poly_a_ends} "
        "-t {params.poly_t_ends} "
        "-b {params.poly_tail_terminal_bases_cap} "
        "-r {params.poly_tail_min_repeat_length} "
        "-i {params.internal_poly_minimum_length} "
        "-k {params.keep_longer_end} "
        "-n {params.ns_as_a_or_t} "
        "-w {params.allow_wobble} "
    "2>> {log} 1>&2 || true ; " #Snowhite returns exit status different from 0 even after success
    "mv "
        "out/FinalOutput/out.clean "
        "{output.fastq} "
    "2>> {log} 1>&2 ; "
    "pigz "
        "--best "
        "--stdout "
        "{output.fastq} "
    "> {output.fastq_gz} "
    "2>> {log}"
152
153
154
155
156
157
158
159
shell:
    "trimmomatic SE "
        "-threads {threads} "
        "-phred33 "
        "<(pigz -dc {input.fq_gz}) "
        ">(cut -f 1 -d \" \" | pigz -9 > {output.fq_gz}) "
        "{params.trimmomatic_params} "
    "2> {log}"
177
178
179
shell:
    "pigz "
        "--decompress "
18
19
20
21
22
shell:
    "./bin/PyroBayes "
        "--sffFile {input.sff} "
        "--outStub {params.stub} "
    "2> {log} 1>&2"
41
42
43
44
45
46
shell:
    "( python3 bin/fasta_qual_to_fastq.py "
        "{input.fasta} "
        "{input.qual} | "
    "pigz --best > {output.fastq_gz} ) "
    "2> {log} 2>&1"
64
65
66
67
68
69
70
shell:
    "pigz "
        "--decompress "
        "--stdout "
        "{input.fq_gz} "
    "> {output.fq} "
    "2> {log}"
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
shell:
    "tar "
        "xvf "
        "{params.tarball} "
        "-C . "
    "2> {log} 1>&2 ; "
    "mv SnoWhite_2.0.3_release/* . ; "
    "yes | perl snowhite_2.0.3.pl "
        "-f {input.fastq} "
        "-o {params.out_folder} "
        "-v {input.adaptors} "
        "-s {input.adaptors} "
        "-m {params.minimum_sequence_length} "
        "-g {params.delete_temporary_files} "
        "-R {params.output_fastq} "
        "`#-B {params.split_in_subfiles}` "
        "`#-j {params.barcodes_in_end}` "
        "`#-z {params.number_of_mismatches}` "
        "-Q {params.minimum_quality} "
        "`#-E {input.adaptors}` "
        "`#-c {params.bases_to_clip}` "
        "-D {params.execute_tagdust} "
        "-d {params.false_discovery_rate} "
        "-L {params.execute_seqclean} "
        "-p {threads} "
        "-Y {params.execute_poly_trimming} "
        "-l {params.poly_tail_min_repeat_length} "
        "-a {params.poly_a_ends} "
        "-t {params.poly_t_ends} "
        "-b {params.poly_tail_terminal_bases_cap} "
        "-r {params.poly_tail_min_repeat_length} "
        "-i {params.internal_poly_minimum_length} "
        "-k {params.keep_longer_end} "
        "-n {params.ns_as_a_or_t} "
        "-w {params.allow_wobble} "
    "2>> {log} 1>&2 || true ; " #Snowhite returns exit status different from 0 even after success
    "mv "
        "out/FinalOutput/out.clean "
        "{output.fastq} "
    "2>> {log} 1>&2 ; "
    "pigz "
        "--best "
        "--stdout "
        "{output.fastq} "
    "> {output.fastq_gz} "
    "2>> {log}"
207
208
209
210
211
212
213
214
215
shell:
    "trimmomatic SE "
        "-threads {threads} "
        "-phred33 "
        "<(pigz -dc {input.fq_gz}) "
        ">(cut -f 1 -d \" \" | pigz -9 > {output.fq_gz}) "
        "{params.trimmomatic_params} "
    "2> {log} ; "
    "sleep 5s"
233
234
235
236
shell:
    "sleep 5s ; "
    "pigz "
        "--decompress "
13
14
15
16
17
shell:
    "ln --symbolic "
        "$(readlink --canonicalize {input.sff} "
        "{output.sff}) "
    "2> {log}"
SnakeMake From line 13 of snakefiles/raw
33
34
35
36
37
shell:
    "ln --symbolic "
        "$(readlink --canonicalize {input.fastq} "
        "{output.fastq}) "
    "2> {log}"
SnakeMake From line 33 of snakefiles/raw
55
56
57
58
59
60
61
62
shell:
    "(python3 bin/fasta_qual_to_fastq.py "
        "<(./src/454/bin/sffinfo -s {input.sff}) "
        "<(./src/454/bin/sffinfo -q {input.sff}) | "
    "pigz "
        "--best "
    "> {output.fq_gz} ) "
    "2> {log}"
SnakeMake From line 55 of snakefiles/raw
80
81
82
83
84
shell:
    "ln --symbolic "
        "$(readlink --canonicalize {input.fa}) "
        "{output.fa} "
    "2> {log}"
SnakeMake From line 80 of snakefiles/raw
18
19
20
21
22
23
shell:
    "fastqc "
        "--nogroup "
        "--outdir {params.outdir} "
        "{input.fastq} "
    "> {log} 2>&1"
44
45
46
47
48
49
shell:
    "fastqc "
        "--nogroup "
        "--outdir {params.outdir} "
        "{input.fastq} "
    "> {log} 2>&1"
70
71
72
73
74
75
shell:
    "fastqc "
        "--nogroup "
        "--outdir {params.outdir} "
        "{input.fastq} "
    "> {log} 2>&1"
 96
 97
 98
 99
100
101
shell:
    "fastqc "
        "--nogroup "
        "--outdir {params.outdir} "
        "{input.fastq} "
    "> {log} 2>&1"
128
129
130
131
132
133
134
135
136
137
138
139
140
141
shell:
    "echo {input} | "
    "tr -s \" \" | "
    "tr \" \" \"\\n\" "
    "> {params.fofn} 2> {log} ; "
    "multiqc "
        "--fullnames "
        "--title {params.title} "
        "--filename {output.html} "
        "--template {params.template} "
        "--file-list {params.fofn} "
        "--data-format {params.data_format} "
        "{params.additional_params} "
    "2>> {log} 1>&2 ; "
164
165
166
167
168
169
shell:
    "fastqc "
        "--nogroup "
        "--outdir {params.outdir} "
        "{input.fastq} "
    "> {log} 2>&1"
190
191
192
193
194
195
shell:
    "fastqc "
        "--nogroup "
        "--outdir {params.outdir} "
        "{input.fastq} "
    "> {log} 2>&1"
216
217
218
219
220
221
shell:
    "fastqc "
        "--nogroup "
        "--outdir {params.outdir} "
        "{input.fastq} "
    "> {log} 2>&1"
247
248
249
250
251
252
253
254
255
256
257
258
259
260
shell:
    "echo {input} | "
    "tr -s \" \" | "
    "tr \" \" \"\\n\" "
    "> {params.fofn} 2> {log} ; "
    "multiqc "
        "--fullnames "
        "--title {params.title} "
        "--filename {output.html} "
        "--template {params.template} "
        "--file-list {params.fofn} "
        "--data-format {params.data_format} "
        "{params.additional_params} "
    "2>> {log} 1>&2 ; "
311
312
313
314
315
316
317
318
319
320
321
322
323
324
shell:
    "echo {input} | "
    "tr -s \" \" | "
    "tr \" \" \"\\n\" "
    "> {params.fofn} 2> {log} ; "
    "multiqc "
        "--fullnames "
        "--title {params.title} "
        "--filename {output.html} "
        "--template {params.template} "
        "--file-list {params.fofn} "
        "--data-format {params.data_format} "
        "{params.additional_params} "
    "2>> {log} 1>&2"
352
353
354
355
356
357
358
359
360
361
362
363
364
365
shell:
    "echo {input} | "
    "tr -s \" \" | "
    "tr \" \" \"\\n\" "
    "> {params.fofn} 2> {log} ; "
    "multiqc "
        "--fullnames "
        "--title {params.title} "
        "--filename {output.html} "
        "--template {params.template} "
        "--file-list {params.fofn} "
        "--data-format {params.data_format} "
        "{params.additional_params} "
    "2>> {log} 1>&2"
389
390
391
392
393
394
395
396
397
398
399
400
401
402
shell:
    "echo {input} | "
    "tr -s \" \" | "
    "tr \" \" \"\\n\" "
    "> {params.fofn} 2> {log} ; "
    "multiqc "
        "--fullnames "
        "--title {params.title} "
        "--filename {output.html} "
        "--template {params.template} "
        "--file-list {params.fofn} "
        "--data-format {params.data_format} "
        "{params.additional_params} "
    "2>> {log} 1>&2"
430
431
432
433
434
435
436
437
438
439
440
441
442
443
shell:
    "echo {input} | "
    "tr -s \" \" | "
    "tr \" \" \"\\n\" "
    "> {params.fofn} 2> {log} ; "
    "multiqc "
        "--fullnames "
        "--title {params.title} "
        "--filename {output.html} "
        "--template {params.template} "
        "--file-list {params.fofn} "
        "--data-format {params.data_format} "
        "{params.additional_params} "
    "2>> {log} 1>&2"
471
472
473
474
475
476
477
478
479
480
481
482
483
484
shell:
    "echo {input} | "
    "tr -s \" \" | "
    "tr \" \" \"\\n\" "
    "> {params.fofn} 2> {log} ; "
    "multiqc "
        "--fullnames "
        "--title {params.title} "
        "--filename {output.html} "
        "--template {params.template} "
        "--file-list {params.fofn} "
        "--data-format {params.data_format} "
        "{params.additional_params} "
    "2>> {log} 1>&2"
ShowHide 31 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/jlanga/smsk_454
Name: smsk_454
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...