An awesome Oxford Nanopore Pipeline for direct RNA-sequencing

public public 1yr ago Version: v0.5.0 0 bookmarks

modr 🔬

An Oxford Nanopore direct RNA-sequencing Pipeline


This is the home of the pipeline, modr. Its long-term goals: to accurately estimate known and novel transcript expression, to predict poly-A tail lengths, and to detect RNA modifications in Oxford Nanopore direct RNA-sequencing data like no pipeline before!

Overview

Welcome to modr! Before getting started, we highly recommend reading through modr's documentation .

The ./modr pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:

modr is an awesome pipeline to detect known/novel transcript expression, poly-A tail lengths, and RNA-editing in Oxford Nanopore direct RNA sequencing data. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster.

The pipeline is compatible with data generated from Nanopore sequencing technologies . As input, it accepts a set of FastQ & Fast5 files and can be run locally on a compute instance or on-premise using a cluster. A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM (more coming soon!). A hybrid approach ensures the pipeline is accessible to all users.

Before getting started, we highly recommend reading through the usage section of each available sub command.

For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .

Dependencies

Requires: singularity>=3.5 snakemake>=6.0 conda/mamba (optional)

Snakemake must be installed on the target system. Snakemake is a workflow manager that orchestrates each step of the pipeline. The second dependency, i.e singularity OR conda/mamba , handles the dowloading/installation of any remaining software dependencies. By default, the pipeline will utilize singularity to guarantee the highest level of reproducibility; however, the --use-conda option of the run sub command can be provided to use conda/mamba instead of singularity. If possible, we recommend using singularity over conda for reproducibility; however, it is worth noting that singularity and conda produce identical results for this pipeline.

If you are running the pipeline on Windows, please use the Windows Subsystem for Linux (WSL) . Singularity can be installed on WSL following these instructions .

Installation

Please clone this repository to your local filesystem using the following command:

# Clone Repository from Github
git clone https://github.com/OpenOmics/modr.git
# Change your working directory
cd modr/
# Add dependencies to $PATH
# Biowulf users should run
module load snakemake singularity
# Get usage information
./modr -h

For more detailed installation instructions, please see our setup page .

Contribute

This site is a living document, created for and by members like you. modr is maintained by the members of OpenOmics and is improved by continous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .

References

1. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2. Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.

Code Snippets

46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
shell: """
# Find differential alternative
# splicing events such as skipped
# exons, alternative 5'/3' splice 
# sites, and retained introns 
flair diffSplice \\
    --threads {threads} \\
    --test \\
    --isoforms {input.isoforms} \\
    --counts_matrix {input.counts} \\
    --conditionA {params.group1} \\
    --conditionB {params.group2} \\
    --out_dir {params.outdir} \\
    --out_dir_force
"""
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
shell: """
# Subset counts matrix to contain 
# only samples belonging to the 
# contrast of interest, this will
# ensure the correct group is set
# as the baseline for design matrix.
# By default, the first group it 
# encounters in the sample names 
# will be the intercept. 
# Get the first group's samples
grp1_samples=$(
    awk -F '\\t' -v OFS='_' \\
        '$2=="{params.group1}" {{print $1,$2,$3}}' \\
        {input.manifest}
)
# Get the second group's samples
grp2_samples=$(
    awk -F '\\t' -v OFS='_' \\
        '$2=="{params.group2}" {{print $1,$2,$3}}' \\
        {input.manifest}
)
# Create per-contrast counts matrix,
# extracts samples by their col name
{params.script} -c ids ${{grp1_samples}} ${{grp2_samples}} \\
    -i {input.counts} \\
> {output.counts}

# Find differential gene/isoform 
# expression and find differential
# isoform usage
flair diffExp \\
    --threads {threads} \\
    --counts_matrix {output.counts} \\
    --out_dir {params.outdir} \\
    --exp_thresh 10 \\
    --out_dir_force
"""
21
22
23
24
25
26
27
28
29
shell: 
    """
    # Download primary assembly,
    # reference genome sequence
    wget -O - \\
        {params.uri} \\
    | gzip -d \\
    > {output.genome}
    """
51
52
53
54
55
56
57
58
shell: 
    """
    # Download transcript sequences
    wget -O - \\
        {params.uri} \\
    | gzip -d \\
    > {output.transcripts}
    """
80
81
82
83
84
85
86
87
shell: 
    """
    # Download PRI gene annotation
    wget -O - \\
        {params.uri} \\
    | gzip -d \\
    > {output.gtf}
    """
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
shell: 
    """
    # Setups temporary directory for
    # intermediate files with built-in 
    # mechanism for deletion on exit
    if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
    tmp=$(mktemp -d -p "{params.tmpdir}")
    trap 'rm -rf "${{tmp}}"' EXIT

    # Convert any U bps to T 
    zcat {input.fq} \\
        | sed 's/U/T/g' \\
    >  {output.fq}

    # Align against genome with graphmap2
    graphmap2 align \\
        -x rnaseq \\
        -t {threads} \\
        -r {input.ref} \\
        -d {output.fq} \\
        -o {params.sam}

    # Convert to bam, index, and sort
    samtools sort -@{threads} \\
        -T "${{tmp}}" \\
        -O bam \\
        -o {output.bam} \\
        {params.sam}
    samtools index -@{threads} \\
        {output.bam} \\
        {output.bai}
    """
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
shell: 
    """
    # Build an index mapping
    # basecalled reads to event 
    # signals from the sequencer
    nanopolish index \\
        -d {params.f5} \\
        {input.fq}
    # Overlay event signal to 
    # kmers of reference genome
    nanopolish eventalign \\
        --reads {input.fq} \\
        --bam {input.bam} \\
        --genome {input.ref} \\
        --summary {output.summary} \\
        --print-read-names \\
        --threads {threads} \\
        --scale-events \\
    > {output.events}
    """
176
177
178
179
180
181
182
183
184
185
186
shell: 
    """
    # Build an sequence dictionary
    java -jar ${{PICARDJARPATH}}/picard.jar \\
        CreateSequenceDictionary \\
        R={input.ref} \\
        O={output.dct}
    # Create FASTA index
    samtools faidx \\
        {input.ref}
    """
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
shell: 
    """
    # Convert SAM to raw features TSV
    samtools view \\
        -@{threads} \\
        -h \\
        -F 4 \\
        {input.bam} \\
    | java -jar ${{PICARDJARPATH}}/sam2tsv.jar -r {input.ref} \\
    | awk 'BEGIN{{FS=OFS="\\t"}} ($9 != "S") && ($9 != "H") && ($9 != "N")' - \\
    | awk 'BEGIN{{FS=OFS="\\t"}} \\
        ($7=="."){{$7="-99";}} \\
        ($4=="."){{$4="-99"}} \\
        ($5=="."){{$5="na"}} \\
        ($8=="."){{$8="na"}} \\
        ($9=="D"){{$6=" "}} \\
        ($2==16){{$2="n"}} \\
        ($2==0){{$2="p"}} 1' \\
    > {output.tsv}
    """
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
shell: 
    """
    # Setup for the combine step,
    # see github repo for more info:
    # https://github.com/darelab2014/Dinopore/blob/main/code/s2.Combine_raw_nnpl.sh
    cd "{params.outdir}"
    head -n 1 {input.events} \\
        > {params.header}
    noline=$(wc -l {input.events} | cut -d " " -f 1)
    tmp="{params.tmp}"
    num=2
    chunk=10000000
    num1=$(expr $num + $chunk)
    num2=$(expr $num1 + 1)
    filecount=1

    # Combine event-level signal information
    while [ $num -le $noline ]; do
        sed -n "$num,${{num1}}p; ${{num2}}q" {input.events} > "$tmp"
        chk=$(wc -l "$tmp" | cut -f 1 -d " ")
        echo "Processing reads $num to $num1 in file $filecount"
        if [ $chk -gt 0 ]; then
            Rscript ${{DINOPORE_CODE}}/s2.Combine_raw_nanopolish.R \\
                -f "$tmp" \\
                -t {threads} \\
                -o {output.events}.part${{filecount}} \\
                -s ${{noline}} \\
                -n ${{num}} \\
                -c ${{chunk}}
            if test -f tmp.eli; then
                eli=$(cat tmp.eli)
                rm "tmp.eli"
            else
                eli=0
            fi
        fi
        num=$(( $num1 - $eli + 1 ))
        num1=$(expr $num + $chunk)
        num2=$(expr $num1 + 1)
        filecount=$(expr $filecount + 1)
    done

    # Combine results of all parts
    cat ${{DINOPORE_CODE}}/misc/nnpl.header > {output.events}
    cat {output.events}.part* | grep -v contig >> {output.events}
    rm -f tmp.nnpl {output.events}.part*
    """
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
shell: 
    """
    # Setups temporary directory for
    # intermediate files with built-in 
    # mechanism for deletion on exit
    if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
    tmp=$(mktemp -d -p "{params.tmpdir}")
    trap 'rm -rf "${{tmp}}"' EXIT

    # Setup for the feature table step,
    # see github repo for more info:
    # https://github.com/darelab2014/Dinopore/blob/main/code/S3.Generate_raw_features.sh
    cd "{params.outdir}"
    num=1
    chunk=100000
    num1=$(expr $num + $chunk - 1)
    num2=$(expr $num1 + 1)
    filecount=1

    # Extract common reads between filtered
    # alignment TSV and nanopolish events 
    awk -F '\\t' 'NR!=1 {{print $1}}' {input.tsv} \\
        | LC_ALL=C uniq \\
        | LC_ALL=C sort -S {params.memory} --parallel={threads}  \\
    > {params.tsv}
    awk '{{print $2}}' {input.events} \\
        | LC_ALL=C uniq \\
        | LC_ALL=C sort -S {params.memory} --parallel={threads}  \\
    > {params.events}
    # Prints lines/reads common in both files
    comm -12 \\
        {params.tsv} \\
        {params.events} \\
    > {params.common}
    noline=$(wc -l {params.common} | cut -d " " -f 1)

    # Generate raw features files
    while [ $num -le $noline ]; do
        echo "Processing reads $num to $num1 in file $filecount"
        sed -n "$num,${{num1}}p; ${{num2}}q" {params.common} \\
        > {params.tmp}
        chk=$(wc -l {params.tmp} | cut -f 1 -d " ")
        if [ $chk -gt 0 ]; then
            # The headers can also be found here:
            # https://github.com/darelab2014/Dinopore/tree/main/code/misc
            cat ${{DINOPORE_CODE}}/misc/nnpl.header \\
            > {params.innnpl}
            cat ${{DINOPORE_CODE}}/misc/tsv.header \\
            > {params.intsv}
            LC_ALL=C grep \\
                -h \\
                -F \\
                -w \\
                -f {params.tmp} \\
                {input.tsv} \\
            >> {params.intsv}
            LC_ALL=C grep \\
                -h \\
                -F \\
                -w \\
                -f {params.tmp} \\
                {input.events} \\
            >> {params.innnpl}

            # Create chunked raw features files,
            # Rscript cannot use files with abs paths
            rel_intsv="$(basename {params.intsv})"
            rel_innnpl="$(basename {params.innnpl})"
            Rscript ${{DINOPORE_CODE}}/s3.Generate_raw_feature_table.R \\
                -n ${{rel_innnpl}} \\
                -t ${{rel_intsv}} \\
                -o {params.outname}.part${{filecount}}

            num=$num2
            num1=$(expr $num + $chunk - 1)
            num2=$(expr $num1 + 1)
            filecount=$(expr $filecount + 1)
            rm {params.tmp}
        fi
    done

    # Combine results of all parts
    cat ${{DINOPORE_CODE}}/misc/inae.header \\
    > {output.table}
    cat {params.outname}.part*.positive \\
        | LC_ALL=C sort -T "${{tmp}}" -k1,1 -k3,3n \\
        | grep -v contig \\
    >> {output.table}
    cat {params.outname}.part*.negative \\
        | LC_ALL=C sort -T "${{tmp}}" -k1,1 -k3,3n \\
        | grep -v contig \\
    >> {output.table}

    # Cleanup intermediate files
    rm -f {params.tsv} \\
        {params.events} \\
        {params.common} \\
        {params.intsv} \\
        {params.innnpl} \\
        {params.outname}.part*.positive \\
        {params.outname}.part*.negative
    """
473
474
475
476
477
478
479
480
481
482
483
484
shell: 
    """
    # Aggregate features of all
    # reads at the same position
    # across multiple samples in 
    # the same group
    cd "{params.outdir}"
    Rscript ${{DINOPORE_CODE}}/s4.Aggregating_reads_pos.R \\
        -t {threads} \\
        -o {output.table} \\
        -r _{params.group} 
    """
508
509
510
511
512
513
514
515
516
517
518
519
shell: 
    """
    # Transform/pre-process features
    # file to create input for CNN,
    # regression model
    cd "{params.outdir}"
    Rscript ${{DINOPORE_CODE}}/s5.Preprocess_data_matrix_inputCNN.R \\
        -t {threads} \\
        -i {input.table} \\
        -o {output.rdata} \\
        -c {params.truth}
    """
548
549
550
551
552
553
554
555
556
557
558
559
shell: 
    """
    # Predict A-to-I RNA-editing
    # Class 0: A (unmodified)
    # Class 1: I (editied)
    # Class 2: G (SNP)
    cd "{params.outdir}"
    DINOPORE_CODE="{params.code}"
    Rscript ${{DINOPORE_CODE}}/s6.Predict_test_data.R \\
        -t {threads} \\
        -i {input.rdata}
    """
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
shell: 
    """
    # Setups temporary directory for
    # intermediate files with built-in 
    # mechanism for deletion on exit
    if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
    tmp=$(mktemp -d -p "{params.tmpdir}")
    trap 'rm -rf "${{tmp}}"' EXIT

    # Align against reference genome,
    # minimap2 automatically handles
    # conversion of U to T bps 
    minimap2 \\
        -ax splice \\
        --MD \\
        -uf \\
        -k14 \\
        {input.ref} \\
        {input.fq} \\
    | samtools sort -@{threads} \\
        -T "${{tmp}}" \\
        -O bam \\
        --write-index \\
        -o {output.bam}##idx##{output.bai} \\
        -

    # Gather BAM statistics
    samtools idxstats {output.bam} > {output.idx}
    samtools flagstat {output.bam} > {output.flagstat}
    samtools stats {output.bam} > {output.stats}
    """
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
shell: 
    """
    # Setups temporary directory for
    # intermediate files with built-in 
    # mechanism for deletion on exit
    if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
    tmp=$(mktemp -d -p "{params.tmpdir}")
    trap 'rm -rf "${{tmp}}"' EXIT

    # Align against reference genome,
    # minimap2 automatically handles
    # conversion of U to T bps, See 
    # this issue for the author of 
    # minimap2's recomendations for 
    # aligning direct RNA reads against 
    # the transcriptome:
    # https://github.com/lh3/minimap2/issues/340  
    minimap2 \\
        -ax map-ont \\
        -N 10 \\
        -k10 \\
        {input.ref} \\
        {input.fq} \\
    | samtools sort -@{threads} \\
        -T "${{tmp}}" \\
        -O bam \\
        --write-index \\
        -o {output.bam}##idx##{output.bai} \\
        -
    """
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
shell: 
    """
    # CPM normalized BigWig
    # of the forward strand
    bamCoverage \\
        -b {input.bam} \\
        -o {output.fwd} \\
        --samFlagExclude 16 \\
        --normalizeUsing CPM

    # CPM normalized BigWig
    # of the reverse strand
    bamCoverage \\
        -b {input.bam} \\
        -o {output.rev} \\
        --samFlagInclude 16 \\
        --normalizeUsing CPM
    """
29
30
31
32
33
34
35
shell: """
# Build an index to map basecalled
# reads to ONT device's raw signal 
nanopolish index \\
    --directory={params.f5} \\
    {input.fq}
"""
65
66
67
68
69
70
71
72
73
74
shell: """
# Estimate polyA tail lengths
# for each transcript
nanopolish polya \\
    --threads={threads} \\
    --reads={input.fq} \\
    --bam={input.bam} \\
    --genome={input.ref} \\
> {output.polya}
"""
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
shell: """
# Create pycoQC summary file
Fast5_to_seq_summary \\
    --threads {threads} \\
    -f {params.f5path} \\
    -s {output.tsv} \\
    --verbose 2 \\
    --fields barcode_arrangement barcode_full_arrangement \\
        barcode_score calibration_strand_end \\
        calibration_strand_genome_template \\
        calibration_strand_identity calibration_strand_start \\
        called_events channel channel_digitisation \\
        channel_offset channel_range channel_sampling_rate \\
        device_id duration flow_cell_id mean_qscore_template \\
        protocol_run_id read_id read_number run_id sample_id \\
        sequence_length_template skip_prob start_mux \\
        start_time stay_prob step_prob strand_score
"""
62
63
64
65
66
67
shell: """
pycoQC \\
    -f {input.tsv} \\
    -o {output.html} \\
    -j {output.json} 
"""
89
90
91
92
93
94
shell: """
fastqc \\
    -t {threads} \\
    -o {params.outdir} \\
    {input.fq}
"""
116
117
118
119
120
121
shell: """
fastqc \\
    -t {threads} \\
    -o {params.outdir} \\
    {input.fq}
"""
144
145
146
147
148
149
150
151
shell: """
NanoPlot \\
    -t {threads} \\
    --bam {input.bam} \\
    -o {params.outdir} \\
    --no_static \\
    --verbose
"""
173
174
175
176
177
178
shell: """
NanoStat \\
    -t {threads} \\
    --bam {input.bam} \\
> {output.metrics}
"""
207
208
209
210
211
212
213
214
215
shell: """
multiqc \\
    --ignore '*/.singularity/*' \\
    -f \\
    -c {params.conf} \\
    --interactive \\
    --outdir {params.outdir} \\
    {params.wdir}
"""
23
24
25
26
27
28
29
shell: """
NanoCount \\
    -i {input.bam} \\
    --extra_tx_info \\
    -e {params.em_iter} \\
    -o {output.counts} 
"""
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
shell: """
# Estimated counts matrix,
# raw counts 
{params.script} \\
    --input {input.counts} \\
    --output {output.est_count} \\
    --join-on transcript_name \\
    --extract est_count \\
    --clean-suffix '.nanocount.transcripts.tsv' \\
    --nan-values 0.0
# TPM counts matrix,
# normalized counts
{params.script} \\
    --input {input.counts} \\
    --output {output.tpm} \\
    --join-on transcript_name \\
    --extract tpm \\
    --clean-suffix '.nanocount.transcripts.tsv' \\
    --nan-values 0.0
"""
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
shell: """
# Convert BAM into BED12
bam2Bed12 \\
    -i {input.bam} \\
> {output.bed12}
# Correct misaligned splice
# sites with flair correct 
flair correct \\
    -q {output.bed12} \\
    -g {input.ref} \\
    -f {input.gtf} \\
    -t {threads} \\
    -o {params.prefix} \\
    --nvrna
"""
SnakeMake From line 102 of rules/quant.smk
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

# Merge the corrected BED files
cat {input.corrected} > {output.merged}

# Find high-confidence isoforms.
# Flair does not like transcript 
# IDs reported by GENCODE  
flair collapse \\
    --temp_dir "${{tmp}}" \\
    --threads {threads} \\
    --gtf {input.gtf} \\
    --genome {input.genome} \\
    --annotation_reliant generate \\
    --check_splice \\
    --stringent \\
    --support 1 \\
    --reads {input.reads} \\
    --query {output.merged} \\
    --output {params.prefix}
"""
SnakeMake From line 157 of rules/quant.smk
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

# Create sample manifest file,
# TSV containing each samples 
# basename, group, batch, and 
# path to its FastQ file, the
# sample manifest file, cannot
# contain any underscores in the 
# first three columns; however,
# hypens cannot/should not used
# in the groups/condition column
awk -F '\\t' -v OFS='\\t' \\
    '{{print \\
        $1, \\
        $3, \\
        "batch1", \\
        "{params.workdir}/"$1"/fastqs/"$1".filtered.fastq.gz" \\
    }}' {params.groups} \\
    | awk -F '\\t' -v OFS='\\t' \\
        '{{gsub("_","-",$1); \\
            gsub("_","",$2); \\
            gsub("_","-",$3); \\
            print \\
        }}' \\
>  {output.manifest}

# Create isoform-by-sample counts 
# matrix that can be used in the 
# diffExp and diffSplice modules
flair quantify \\
    -r {output.manifest} \\
    -i {input.fa} \\
    --isoform_bed {input.bed} \\
    --check_splice \\
    --stringent \\
    --temp_dir "${{tmp}}" \\
    --threads {threads} \\
    --tpm \\
    --output {params.prefix}
"""
SnakeMake From line 224 of rules/quant.smk
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
shell: """
# Structural and quality annotation
# of collapsed transcript from flair
sqanti3_qc.py \\
    {input.fa} \\
    {input.gtf} \\
    {input.genome} \\
    --aligner_choice=minimap2 \\
    --cpus {threads} \\
    --output {params.prefix} \\
    --dir {params.outdir} \\
    --CAGE_peak {params.cage_peak} \\
    --polyA_motif_list {params.polya_motif} \\
    --isoAnnotLite \\
    --force_id_ignore \\
    --fasta
"""
102
103
104
105
106
107
108
109
110
111
shell: """
# Applies ML filter on selected
# SQANTI3 QC classification file
sqanti3_filter.py ML \\
    --output {params.prefix} \\
    --dir {params.outdir} \\
    --gtf {input.gtf} \\
    --isoforms {input.fa} \\
    {input.txt}
"""
SnakeMake From line 102 of rules/sqanti.smk
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
shell: 
    """
    # Setups temporary directory for
    # intermediate files with built-in 
    # mechanism for deletion on exit
    if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
    tmp=$(mktemp -d -p "{params.tmpdir}")
    trap 'rm -rf "${{tmp}}"' EXIT

    # Align against reference genome,
    # minimap2 automatically handles
    # conversion of U to T bps, See 
    # this issue for the author of 
    # minimap2's recomendations for 
    # aligning direct RNA reads against 
    # the transcriptome:
    # https://github.com/lh3/minimap2/issues/340  
    minimap2 \\
        -ax map-ont \\
        -N 10 \\
        -k10 \\
        {input.ref} \\
        {input.fq} \\
    | samtools sort -@{threads} \\
        -T "${{tmp}}" \\
        -O bam \\
        --write-index \\
        -o {output.bam}##idx##{output.bai} \\
        -
    """
55
56
57
58
shell: 
    """
    {params.prefix} {input.fq} {params.suffix} {output.fq}
    """
SnakeMake From line 55 of rules/trim.smk
79
80
81
82
83
84
85
86
shell: 
    """
    # Nanofilt requires uncompressed input
    gunzip -c {input.fq} \\
        | NanoFilt -q {params.qual_filt} \\
        | gzip \\
    > {output.flt}
    """
ShowHide 25 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://openomics.github.io/modr/
Name: modr
Version: v0.5.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...