RNA-seq pipeline comparison and analyses

public public 1yr ago 0 bookmarks

RNA-seq Pipeline Comparison and Analyses

A Snakemake workflow for running TALON, FLAIR, and pipeline-nanopore-ref-isoforms. Performs a comparative analyses of results using tools such as GFFcompare.

Flowchart

Dependencies

  1. Snakemake 7.3.1
    A full snakemake instalation is recommended

  2. Singularity 3.7.0

Installation

Clone the repository to desired location.

How to run

  1. Set parameters in config.yaml

  2. run: snakemake -p --use-singularity --singularity-prefix "resources" --singularity-args "--bind *" --use-conda -j ** all --configfile "config/config.yaml"

Note * : You should provide your own directory for the --bind command so that the data is accesible from the singularity containers.
Note ** : Specify number of available threads here.

Snakemake report

You can run snakemake --report report.html AFTER the workflow finished to create a report containing results.

Notes

The GTF files located in the 03_combined and 05_matched_transcripts have a column called TPM. This is actuallu the raw number of counts. The attribute is hijacked to pass counts to GFFCompare.

When testing the workflow it took about 18 hours on 10 threads with 100g memory to process 6 Human samples. Running with a much smaller RNA-virus dataset it took about 8 hours for 6 samples.

The main bottleneck is TranscriptClean which requires many hours and high memory to correct all samples.

Troubleshooting

Transcriptclean

Transcriptclean requires the reference genome Fasta file to only have one string per header. In order to run TranscriptClean you must edit the headers.

Conda environment fails to build

There seems to be an issue with Snakemake 7.3.1 when building conda environments. If a time out error occurs you can try running the workflow with an older version of snakemake such as version 5.3.2.

License

MIT, see LICENSE

Code Snippets

  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
import logging

# Setup logging
logging.basicConfig(filename=snakemake.output.removed_transcripts,
                    level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s %(name)s %(message)s')
logger = logging.getLogger(__name__)


def format_oxford(abundance):
    """
    Takes oxford abundance file and produces dictionary with counts per
    transcript.

    :param abundance: File containing transcript counts
    :return: Dictionary: key = transcript id, value = counts
    """
    ox_dict = {}
    with open(abundance, 'r') as file:
        # Skip header
        next(file)
        for line in file:
            line = line.strip("\n").split(',')
            transcript_id = line[0]
            # Can contain an empty string instead of 0
            counts = sum([int(count) for count in line[1:] if count != ''])
            ox_dict[transcript_id] = counts
    return ox_dict


def format_talon(abundance):
    """
    Takes talon abundance file and produces dictionary with counts per
    transcript.

    :param abundance: File containing transcript counts
    :return: Dictionary: key = transcript id, value = counts
    """
    talon_dict = {}
    with open(abundance, 'r') as file:
        # Skip header
        next(file)
        for line in file:
            line = line.strip('\n').split('\t')
            transcript_id = line[3]
            # Always has a round number
            counts = sum(map(int, line[11:]))
            talon_dict[transcript_id] = counts
    return talon_dict


def format_flair(abundance):
    """
    Takes flair abundance file and produces dictionary with counts per
    transcript.

    :param abundance: File containing transcript counts
    :return: Dictionary: key = transcript id, value = counts
    """
    flair_dict = {}
    with open(abundance, 'r') as file:
        # Skip header
        next(file)
        for line in file:
            line = line.strip('\n').split('\t')
            transcript_id = line[0]
            # Numbers are always rounded but saves them as x.0
            counts = sum(map(float, line[1:]))
            flair_dict[transcript_id] = counts
    return flair_dict


def combine(gtf_file, counts, output, missing_output):
    """
    Loops through the GTF file and checks the counts dictionary
    to add counts to the line and then writes it to a new file.

    This function is used per GTF file.

    Flair combines the transcript id with the gene id for known
    transcripts using '_' as a delimiter. This can cause problems
    when the names in the annotation file contain '_'.

    Flair can have transcripts present in the GTF file that are missing
    from the abundance file. This is caused by minimap2 mapping
    ambiguity. If there are any missing transcripts these will be
    written to the missing transcript log and extracted to a separate
    GTF.

    Stringtie adds transcripts with 0 count to the GTF file in
    annotation-guided mode. These 0 count transcripts get removed and
    extracted to a separate GTF file. The IDs of the removed transcripts
    are written to the log files.

    :param missing_output: GTF file with the removed transcripts
    :param gtf_file: The GTF file produced by a pipeline.
    :param counts: Dictionary: key = transcript id, value = counts
    :param output: Output directory and file name
    :return: Writes a file to the output param
    """
    # Read the GTF file
    with open(gtf_file) as GTF, open(output, 'w') as outfile, open(missing_output, 'a') as removed_out:
        origin_file = gtf_file.split('/')[-1]
        write = True
        for line in GTF:
            # Check if line is a transcript
            if line.split()[2] == 'transcript':
                try:
                    # Check for transcript id in counts
                    transcript_id = line.split()[11][1:-2]
                    count = counts[transcript_id]
                    # if count is 0 transcript is removed
                    if count == 0:
                        logger.warning(f'ORIGIN: {origin_file} 0 count: {transcript_id}')
                        removed_out.write(f'{line[:-1]}\n')
                        write = False
                        continue
                except KeyError:
                    # Transcript is missing from the abundance file
                    try:
                        # This is for the flair '_' delimiter issue
                        gene_id = line.split()[9][1:-2]
                        new_transcript_id = f'{transcript_id}_{gene_id}'
                        count = counts[new_transcript_id]
                    except KeyError:
                        # transcript is actually missing
                        # Add missing transcript id to missing transcripts.log
                        logger.error(f'ORIGIN: {origin_file} FIRST TRY: {transcript_id} SECOND TRY: {new_transcript_id}')
                        removed_out.write(f'{line[:-1]}\n')
                        write = False
                        continue

                # transcript is not missing from the abundance and has at least 1 count.
                # Add transcript count to end of line and write to file
                # counts are not in TPM, hijacking the column for use with GffCompare
                write = True
                outfile.write(f'{line[:-1]} TPM "{str(count)}";\n')
            else:
                # Line is not a transcript
                if write:
                    # Line is not an exon from a removed transcript
                    outfile.write(line)
                else:
                    # Line is an exon from a removed transcript
                    removed_out.write(line)


def main():
    """
    Takes abundance file and passes it to a format function. The
    resulting dictionary is passed to combine. Combine produces a new
    GTF that has counts for transcripts.
    :return: -
    """
    flair_abundance = snakemake.input.flair_count
    flair_gtf = snakemake.input.flair_transcripts
    flair_dict = format_flair(flair_abundance)
    combine(flair_gtf, flair_dict, snakemake.output.flair_combined, snakemake.output.flair_removed)
    flair_dict.clear()

    ox_abundance = snakemake.input.oxford_count
    ox_gtf = snakemake.input.oxford_transcript
    ox_dict = format_oxford(ox_abundance)
    combine(ox_gtf, ox_dict, snakemake.output.oxford_combined, snakemake.output.oxford_removed)
    ox_dict.clear()

    talon_abundance = snakemake.input.talon_count
    talon_gtf = snakemake.input.talon_transcripts
    talon_dict = format_talon(talon_abundance)
    combine(talon_gtf, talon_dict, snakemake.output.talon_combined, snakemake.output.talon_removed)


main()
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
from collections import defaultdict
import numpy as np
import matplotlib.pyplot as plt


def tracking(gfftracking):
    """
    Read the data from the GffCompare gffcmp.tracking file and
    format as a dictionary.

    Only takes three-way matches from the file.

    :return: tcons_XXXX (key) : [[transcript_id 1],
                                [transcript_id 2],
                                [transcript_id 3]] (value)
    """
    tcons = {}
    with open(gfftracking) as file:
        for line in file:
            line = line.split()
            transcripts = line[4::]
            temp_list = []
            # '-' means that at least one pipeline did not have a matching transcript
            # only take lines with three transcripts
            if '-' not in transcripts:
                for transcript in transcripts:
                    temp_list.append(transcript.split('|')[1])
                tcons[line[0]] = temp_list
    return tcons


def gtf(file):
    """
    Read transcript data from a GTF file and format as dictionary.
    :param file: GTF file
    :return: transcript_id (key): [(exon_start, exon_end),
                                   (exon_start, exon_end),
                                   (exon_start, exon_end) etc...](value)
    """
    gtf_dict = defaultdict(list)
    with open(file) as GTF:
        for line in GTF:
            # skip comments
            if line.startswith("#"):
                continue
            line = line.split()
            if line[2] == 'exon':
                transcript_id = line[11][1:-2]
                start = int(line[3])
                end = int(line[4])
                exon = (start, end)
                gtf_dict[transcript_id].append(exon)
    return gtf_dict


def sort_gtf(gtf_dict):
    """
    The exons for each transcript in the TALON gtf file are not sorted
    correctly. (issue with talon)
    :param gtf_dict: transcript_id (key): [(exon_start, exon_end),
                                   (exon_start, exon_end),
                                   (exon_start, exon_end) etc...](value)
    :return: gtf_dict but with sorted exons
    """
    for key, value in gtf_dict.items():
        gtf_dict[key] = sorted(value)
    return gtf_dict


def retrieve_terminal_exon(tcons, oxford, flair, talon):
    """
    Places terminal exon positions of transcripts into dictionary by
    matching transcript id.

    :param tcons: tcons_XXXX (key) : [[transcript_id 1 ],
                                      [transcript_id 2 ],
                                      [transcript_id 3]] (value)

    :param oxford: key = transcript_id, value = [(exon_start, exon_end)]
    :param flair: key = transcript_id, value = [(exon_start, exon_end)]
    :param talon: key = transcript_id, value = [(exon_start, exon_end)]

    :return: tcons_XXXX(key):[[(exon_start, exon_end), (exon_start, exon_end)],
                            [(exon_start, exon_end), (exon_start, exon_end)],
                            [(exon_start, exon_end), (exon_start, exon_end)]] (value)
    """
    tcons_exon = {}
    for tcon_id, transcript_ids in tcons.items():
        # no need to select terminal exons for single-exon transcripts
        # data only has three-matches and all transcripts have the same number of exons
        if len(oxford[transcript_ids[0]]) == 1:
            # Get exons that belong to transcript_id
            oxford_exons = oxford[transcript_ids[0]]
            flair_exons = flair[transcript_ids[1]]
            talon_exons = talon[transcript_ids[2]]
            # add exons to tcon_exon dictionary
            tcons_exon[tcon_id] = [oxford_exons, flair_exons, talon_exons]
        # Slice terminal exons for multi-exon transcripts
        else:
            oxford_terminal_exons = [oxford[transcript_ids[0]][0],
                                     oxford[transcript_ids[0]][-1]]
            flair_terminal_exons = [flair[transcript_ids[1]][0],
                                    flair[transcript_ids[1]][-1]]
            talon_terminal_exons = [talon[transcript_ids[2]][0],
                                    talon[transcript_ids[2]][-1]]
            tcons_exon[tcon_id] = [oxford_terminal_exons, flair_terminal_exons, talon_terminal_exons, ]
    return tcons_exon


def calculate_difference(transcript_exons):
    """
    Calculate the differences between exon positions
    :param transcript_exons: tcons_XXXX (key): [[(exon_start, exon_end)],
                                                [(exon_start, exon_end)],
                                                [(exon_start, exon_end)]]
    :return: Differences between terminal exon positions
    """
    oxford_flair = []
    oxford_talon = []
    flair_talon = []
    for key, value in transcript_exons.items():
        value = np.array(value, dtype=object)
        oxford = value[0]
        flair = value[1]
        talon = value[2]
        a = np.subtract(oxford, flair)
        b = np.subtract(oxford, talon)
        c = np.subtract(flair, talon)
        oxford_flair.append(a)
        oxford_talon.append(b)
        flair_talon.append(c)
    return oxford_flair, oxford_talon, flair_talon


def start_end(compare):
    """
    Sorts the calculated differences into 4 dictionaries.
    Start-exon - start position and end position
    End-exon - start position and end position
    single-exon - start position and end position
    :param compare: Dictionary with calculated differences.
    :return: Dictionaries with start and end positions.
    """
    start_exon_start = defaultdict(list)
    start_exon_end = defaultdict(list)
    end_exon_start = defaultdict(list)
    end_exon_end = defaultdict(list)
    single_exon_start = defaultdict(list)
    single_exon_end = defaultdict(list)
    for positions in compare:
        # single exon transcripts
        if len(positions) == 1:
            # start position
            single_exon_start['0'].append(abs(positions[0][0]))
            # end position
            single_exon_end['0'].append(abs(positions[0][1]))
        else:
            # multi exon transcripts
            for number, position in enumerate(positions):
                # start exon
                if number == 0:
                    # start exon - start
                    start_exon_start[number].append(abs(position[0]))
                    # start exon - end
                    start_exon_end[number].append(abs(position[1]))
                    # end exon
                else:
                    # end exon start
                    end_exon_start[number].append(abs(position[0]))
                    # end exon end
                    end_exon_end[number].append(abs(position[1]))
    return start_exon_start, start_exon_end, end_exon_start, end_exon_end, single_exon_start, single_exon_end


def plot(start_exon_start, start_exon_end, end_exon_start, end_exon_end, single_exon_start, single_exon_end, output):
    """
    Plots the exon differences into 3 figures with 6 subplots each.
    """
    plt.rcParams["figure.figsize"] = [15, 15]
    plt.rcParams["figure.autolayout"] = True
    plt.rcParams.update({'font.size': 15})

    main_figure, axs = plt.subplots(nrows=3, ncols=2)
    title = output.split('/')[-1].split('.')[0]
    main_figure.suptitle(f'Comparison of {title} start and end exons from three-way matched transcripts')

    labels_start = single_exon_start.keys()
    data_start =  single_exon_start.values()
    axs[0, 0].boxplot(data_start, showfliers=True)
    axs[0, 0].set(xlabel="Exon number", ylabel="Number of bases")
    axs[0, 0].set_title('Single-exon, start position differences')
    axs[0, 0].set_xticklabels(labels_start, rotation=45)

    labels_start = single_exon_end.keys()
    data_start =  single_exon_end.values()
    axs[0, 1].boxplot(data_start, showfliers=True)
    axs[0, 1].set(xlabel="Exon number", ylabel="Number of bases")
    axs[0, 1].set_title('Single-exon, end position differences')
    axs[0, 1].set_xticklabels(labels_start, rotation=45)

    labels_start = start_exon_start.keys()
    data_start =  start_exon_start.values()
    axs[1, 0].boxplot(data_start, showfliers=True)
    axs[1, 0].set(xlabel="Exon number", ylabel="Number of bases")
    axs[1, 0].set_title('Start exon, start position differences')
    axs[1, 0].set_xticklabels(labels_start, rotation=45)

    labels_start = start_exon_end.keys()
    data_start =  start_exon_end.values()
    axs[1, 1].boxplot(data_start, showfliers=True)
    axs[1, 1].set(xlabel="Exon number", ylabel="Number of bases")
    axs[1, 1].set_title('Start exon, end position differences')
    axs[1, 1].set_xticklabels(labels_start, rotation=45)

    labels_start = end_exon_start.keys()
    data_start =  end_exon_start.values()
    axs[2, 0].boxplot(data_start, showfliers=True)
    axs[2, 0].set(xlabel="Exon number", ylabel="Number of bases")
    axs[2, 0].set_title('End exon, start position differences')
    axs[2, 0].set_xticklabels(labels_start, rotation=45)

    labels_start = end_exon_end.keys()
    data_start =  end_exon_end.values()
    axs[2, 1].boxplot(data_start, showfliers=True)
    axs[2, 1].set(xlabel="Exon number", ylabel="Number of bases")
    axs[2, 1].set_title('End exon, end position differences')
    axs[2, 1].set_xticklabels(labels_start, rotation=45)

    plt.savefig(output, dpi=200)


def main():
    tcons = tracking(snakemake.input.tracking)
    oxford = gtf(snakemake.input.oxford)
    flair = gtf(snakemake.input.flair)
    talon = gtf(snakemake.input.talon)
    talon_sorted = sort_gtf(talon)

    transcript_exons = retrieve_terminal_exon(tcons, oxford, flair, talon_sorted)

    oxford_flair, oxford_talon, flair_talon = calculate_difference(transcript_exons)

    start_exon_start, start_exon_end, end_exon_start, end_exon_end, single_exon_start, single_exon_end = start_end(oxford_flair)
    plot(start_exon_start, start_exon_end, end_exon_start, end_exon_end, single_exon_start, single_exon_end, snakemake.output.oxford_flair)

    start_exon_start, start_exon_end, end_exon_start, end_exon_end, single_exon_start, single_exon_end = start_end(oxford_talon)
    plot(start_exon_start, start_exon_end, end_exon_start, end_exon_end, single_exon_start, single_exon_end, snakemake.output.oxford_talon)

    start_exon_start, start_exon_end, end_exon_start, end_exon_end, single_exon_start, single_exon_end = start_end(flair_talon)
    plot(start_exon_start, start_exon_end, end_exon_start, end_exon_end, single_exon_start, single_exon_end, snakemake.output.flair_talon)


main()
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
from collections import defaultdict


def categorize_tracking(tracking_file):
    """
    Gets transcript id's for each match in the tracking file and sorts
    them into three dictionaries.

    :param tracking_file: tracking file from GffCompare
    :return: tcons_xxx (key): [transcript_id,
                               transcripts_id,
                               transcript_id] (value)
    """
    with open(tracking_file, 'r') as tracking:
        three_match, two_match, no_match = {}, {}, {}
        for line in tracking:
            line = line.split()
            # section containing the transcript match information
            transcripts = line[4::]
            transcript_ids = []

            count = 0
            for transcript in transcripts:
                # No transcript present in position
                if transcript == '-':
                    count += 1
                    transcript_id = transcript
                # transcript present
                else:
                    transcript_id = transcript.split('|')[1]
                transcript_ids.append(transcript_id)

            # 1 transcript was not present
            if count == 1:
                two_match[line[0]] = transcript_ids
            # 2 transcripts were not present
            elif count == 2:
                no_match[line[0]] = transcript_ids
            # all transcripts are present
            else:
                three_match[line[0]] = transcript_ids
    return three_match, two_match, no_match


def gtf(file):
    """
    Reads in GTF file and puts transcripts with its exons in a
    dictionary.
    :param file: GTF file.
    :return: Key: Transcript_id Value: Lines that contain the id.
    """
    gtf_dict = defaultdict(list)
    with open(file) as GTF:
        for line in GTF:
            # skip comments
            if line.startswith("#"):
                continue
            # skip genes
            if line.split()[2] == "gene":
                continue
            transcript_id = line.split()[11][1:-2]
            gtf_dict[transcript_id].append(line.strip('\n'))
    return gtf_dict


def write(tracking_dict, oxford, flair, talon, outfile):
    """
    Creates new sorted GTF file containing all matched transcripts.
    :param tracking_dict: Dictionary with transcript id's
    :param oxford: GTF file in dictionary by transcript id
    :param flair: GTF file in dictionary by transcript id
    :param talon: GTF file in dictionary by transcript id
    :param outfile: name of outfile
    :return: -
    """
    with open(outfile, 'w') as out:
        for tcons_id, transcipt_ids, in tracking_dict.items():
            # transcript order: q1 oxford, q2 flair, q3 talon
            # writes out one of the three transcripts
            if transcipt_ids[0] != '-':
                for line in oxford[transcipt_ids[0]]:
                    out.write(f'{line} match_id "{tcons_id}";\n')
            # if no oxford transcript, write flair transcript
            elif transcipt_ids[1] != '-':
                for line in flair[transcipt_ids[1]]:
                    out.write(f'{line} match_id "{tcons_id}";\n')
            # if no oxford and flair transcript, write talon transcript
            else:
                for line in talon[transcipt_ids[2]]:
                    out.write(f'{line} match_id "{tcons_id}";\n')


def main():
    """
    Creates dictionaries from GTF files and tracking file and creates
    three sorted GTF files containing matched transcripts.
    :return:
    """
    tracking_three, tracking_two, tracking_one = categorize_tracking(snakemake.input.tracking)
    oxford = gtf(snakemake.input.oxford)
    flair = gtf(snakemake.input.flair)
    talon = gtf(snakemake.input.talon)
    write(tracking_three, oxford, flair, talon, snakemake.output.tracking_three)
    write(tracking_two, oxford, flair, talon, snakemake.output.tracking_two)
    write(tracking_one, oxford, flair, talon, snakemake.output.tracking_one)


main()
74
75
76
77
78
79
80
81
82
shell:
    '''
    NanoPlot \
    -t {threads} \
    --fastq {input.fq} \
    --plots dot kde \
    -p {params.prefix} \
    -o {params.outdir}
    '''
 99
100
101
102
103
104
105
106
107
108
shell:
    '''
    minimap2 \
        -t {threads} \
        -ax splice \
        {params.opts} \
        --MD \
        {input.genome} \
        {input.fq} > {output.sam_files}
    '''
123
124
125
126
127
shell:
    '''
    samtools view -Sb {input.sam} | samtools sort -@ {threads} -o {output.bam}
    samtools index {output.bam}
    '''
149
150
151
152
153
154
155
shell:
    '''
    get_SJs_from_gtf \
        --f {input.annotation} \
        --g {input.genome} \
        --o {output.splicejns}
    '''
172
173
174
175
176
177
178
179
180
shell:
    '''
    TranscriptClean \
        --sam {input.sam_files} \
        --genome {input.genome} \
        -t {threads} \
        --spliceJns {input.splicejns} \
        --outprefix {params.outdir}
    '''
196
197
198
199
200
201
202
203
204
205
shell:
    '''
    talon_label_reads \
        --f {input.clean_sam}\
        --g {input.genome} \
        --t {threads} \
        --ar 20 \
        --deleteTmp \
        --o {params.outdir}
    '''
218
219
220
221
run:
    for label, name in zip(input.labels, params.datasetnames):
        with open(output.config, 'a+') as config:
            config.write("%s,%s,ONT,%s\n" % (name, name, label))
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
shell:
    '''

    talon_initialize_database \
        --f {input.annotation} \
        --a {params.annotation_name} \
        --g {params.genome_name} \
        --5p {params.end5} \
        --3p {params.end3} \
        --o {params.outdir}
    echo database init done

    talon \
        --f {input.config} \
        --db {params.dbloc} \
        --build {params.genome_name} \
        --o {params.outdir} \
        -t {threads}
    echo database annotation done
    '''
282
283
284
285
286
287
288
289
290
291
292
shell:
    '''
    talon_filter_transcripts \
        --db {input.db} \
        --datasets {params.datasetnames} \
        -a {params.annotation_name} \
        --maxFracA 0.5 \
        --minCount {params.mincount} \
        --minDatasets {params.mindatasets} \
        --o {output.filtered_transcripts}
    '''
309
310
311
312
313
314
315
316
317
shell:
    '''
    talon_abundance \
        --db {input.db} \
        --whitelist {input.filter} \
        -a {params.annotation_name} \
        --build {params.genome_name} \
        --o {params.outdir}
    '''
334
335
336
337
338
339
340
341
342
shell:
    '''
    talon_create_GTF \
        --db {input.db} \
        --whitelist {input.filter} \
        -a {params.annotation_name} \
        --build {params.genome_name} \
        --o {params.outdir}
    '''
364
365
366
367
shell:
    '''
    {params.scriptpath} --input_bam {input.bam} > {output.bed12}
    '''
387
388
389
390
391
392
393
394
395
396
397
shell:
    '''
    flair.py correct \
        --genome {input.genome} \
        --query {input.bed} \
        --gtf {input.annotation} \
        --nvrna \
        --threads {threads} \
        --window {params.window} \
        --output {params.outdir}
    '''
410
411
412
413
shell:
    '''
    cat {input.bed_corrected} >> {output.bed_concatenated}
    '''
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
shell:
    '''
    flair.py collapse \
        --genome {input.genome} \
        --gtf {input.annotation} \
        --reads {params.reads} \
        --query {input.bed_concatenated} \
        --temp_dir {params.temp_dir} \
        --generate_map \
        --threads {threads} \
        --window {params.window} \
        --quality {params.quality} \
        --support {params.support} \
        --max_ends {params.max_ends} \
        {params.opts} \
        --output {params.outdir}
    '''
467
468
469
470
run:
    for read, name in zip(input.reads, params.datasetnames):
        with open(output.config, 'a+') as config:
            config.write("%s\tcondition\tbatch\t%s\n" % (name, read))
486
487
488
489
490
491
492
493
494
495
shell:
    '''
    flair.py quantify \
        --reads_manifest {input.manifest} \
        --isoforms {input.coll_fasta} \
        --threads {threads} \
        --tpm \
        --quality {params.quality} \
        --output {output.abundance}
    '''
513
514
515
516
517
518
519
shell:
    '''
    samtools stats \
        --reference {input.reference} \
        --threads {threads} \
        {input.bam} > {output.stats}
    '''
544
545
546
547
548
549
550
shell:
    '''
    samtools view -q {params.min_mq} -F 2304 -b {input.bam} \
    | seqkit bam -j {threads} -x -T '{params.flt}' \
    | samtools sort -@ {threads} -o {output.bam_filtered}
    samtools index {output.bam_filtered}
    '''
566
567
568
569
shell:
    '''
    stringtie --rf -G {input.annotation} -L -p {threads} {params.opts} -o {output.gff} {input.bam_filtered}
    '''
585
586
587
588
589
590
591
592
shell:
    '''
    stringtie \
        --merge {input.gff_files} \
        -G {input.annotation} \
        {params.opts} \
        -o {output.merged_gff}
    '''
606
607
608
609
610
611
612
613
614
615
shell:
    '''
    stringtie \
        -G {input.merged_gtf} \
        -e \
        -L \
        -p {threads} \
        -o {output.count_gtf} \
        {input.bam}
    '''
629
630
631
632
run:
    for read, name in zip(input.count_gtf, params.datasetnames):
        with open(output.config, 'a+') as config:
            config.write("%s\t%s\n" % (name, read))
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
shell:
    '''
    declare -i calc=0;
    for line in {input.stats}
    do
       numb=$(awk 'NR==33 {{print $4}}' < ${{line}});
       calc+=$((${{numb}}));
    done;
    words=$(echo {input.stats} | wc -w);
    avg_len=$((${{calc}} / ${{words}}));

    prepDE.py \
        -i {input.config_file} \
        -l ${{avg_len}} \
        -g {output.genes} \
        -t {output.transcripts}
    '''
682
683
684
685
686
687
688
689
690
691
692
shell:
    '''
    sqanti3_qc.py \
        {input.gtf_input} \
        {input.annotation} \
        {input.genome} \
        --force_id_ignore \
        --output {params.outdir} \
        --cpus {threads} \
        --report both
    '''
706
707
708
709
710
711
712
713
714
715
716
shell:
    '''
    featureCounts \
        {input.bams} \
        -T {threads} \
        -a {input.annotation} \
        -L \
        -M \
        -S 1 \
        -o {output.counts}
    '''
741
742
script:
    '''scripts/combine.py'''
760
761
762
763
764
765
766
767
768
769
770
shell:
    '''
    gffcompare \
        -r {input.annotation} \
        -e 100 \
        -d 100 \
        --no-merge \
        -T \
        -o {params.outdir} \
        {input.oxford} {input.flair} {input.talon}
    '''
787
788
script:
    '''scripts/extract_overlap.py'''
871
script: '''scripts/exon_comparison.py'''
887
888
889
890
891
892
893
shell:
    '''
    multiqc \
    {params.nanoplot_dir} \
    {params.samtools_subread_dir} \
    --outdir {params.outdir}
    '''
ShowHide 25 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bayraktar1/RPCA
Name: rpca
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...