RNA-Seq Analysis Workflow for Single-End Data using qproject

public public 1yr ago Version: 7 0 bookmarks

RNA-Seq workflow for single-end data. The workflow can be downloaded and run on a cluster environment using the tool qproject provided on github: https://github.com/qbicsoftware/qproject

The workflow uses a module system to load the required software. Be sure to check the jobscript.sh file to see which software modules are required. The modules are loaded automatically when using qproject run to start the workflow, otherwise they have to be loaded manually.

One should use qproject to download the files. This also creates all folders necessary for the workflow.

qproject create -t . -w github:qbicsoftware/rnaseq

Be sure to add a config file "params.json" in etc which should look like this:

{
 "gtf": "path/to/gtf",
 "indexed_genome": "path/to/genome/basename",
 "stranded": "no",
 "overlap_mode": "union",
 "feature_type": "exon",
 "gff_attribute": "gene_id",
 "normalize_counts": "deseq2"
}

where indexed_genome and gtf are paths relative to ref .

indexed_genome is the basename of a bowtie2 index. gtf is the .gtf file of the reference genome

The parameters stranded , overlap_mode , feature_type and gff_attribute are explained in the htseq documentation.

Members of QBiC can download the data for analysis using qpostman : https://github.com/qbicsoftware/postman-cli

It should be installed on the computing stations and can be loaded with:

module load qbic/qpostman/0.1.2.3

To download the data navigate to the data folder and either provide a QBiC ID

java -jar qpostman.jar -i <QBiC ID> -u <your_qbic_username>

or a file containing the project QBiC IDs:

postman-0.1.2.3 -f sample_IDs.csv -u user-name

If you're not using qpostman just put the relevant files in the data folder (formats supported: .fastq , .fastq.gz ).

To run the workflow navigate to the src folder. Using snakemake -n one can display the operations the workflow will perform. Using the --dag parameter and piping it to dot one can create a .pdf version of the directed acyclic graph used by snakemake to inspect the behavious of the workflow on a local machine.

module load qbic/anaconda
cd src/
snakemake -n
snakemake --dag | dot -Tpdf > dag.pdf

To run the workflow:

qproject run -t ..

While running one can inspect the log files (e.g. in a different screen session) for the progress and errors generated by the workflow:

cd logs/
tail snake.err -f

And to check the jobs on the computing cluster one can use qstat .

Alternatively to using qproject run one could use snakemake -j to run the workflow, but then be sure to check the jobscript.sh to load the required modules manually and also note that this would also not use qsub to submit the jobs.

Code Snippets

111
112
113
114
115
116
117
118
run:
    out = os.path.abspath(str(output))
    if glob.glob(data("*.sha256sum")):
        shell("cd %s; "
              "sha256sum -c *.sha256sum && "
              "touch %s" % (data('.'), out))
    else:
        shell("touch %s" % out)
SnakeMake From line 111 of master/Snakefile
123
shell: "ln -s {input} {output}"
SnakeMake From line 123 of master/Snakefile
128
shell: "zcat {input} > {output}"
SnakeMake From line 128 of master/Snakefile
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
run:
    with open(str(input)) as infile, open(str(output), 'w') as outfile:
        num_ok = 0
        num_filtered = 0
        line = infile.readline()
        while line:
            assert line.startswith('@')
            body = ''.join(infile.readline() for _ in range(3))
            if all(':Y:' not in part for part in line.split(' ')[1:]):
                num_ok += 1
                outfile.write(line)
                outfile.write(body)
            else:
                num_filtered += 1
            line = infile.readline()
    num_all = num_ok + num_filtered
    if num_filtered / num_all > .1:
        raise ValueError("More than 10% of reads were filtered in "
            "PreFilterReads. This probably indicates a bug.")
SnakeMake From line 133 of master/Snakefile
156
shell: 'mkdir -p {output} && (fastqc {input} -o {output} --extract || (rm -rf {output} && exit 1))'
SnakeMake From line 156 of master/Snakefile
161
shell: "cp {input}/{wildcards.name}_fastqc.zip {output}"
SnakeMake From line 161 of master/Snakefile
166
shell: 'mkdir -p {output} && (fastqc {input} -o {output} --extract || (rm -rf {output} && exit 1))'
SnakeMake From line 166 of master/Snakefile
171
shell: "cp {input}/{wildcards.name}_fastqc.zip {output}"
SnakeMake From line 171 of master/Snakefile
176
177
178
179
180
181
182
183
184
185
186
187
188
run:
    f = open(str(input) + "/" + wildcards['name'] + "_fastqc/fastqc_data.txt")
    out = open(str(output), "w")
    sw = False
    for line in f:
        if line.startswith(">>Overrepresented"):
            sw = True
        if not line.startswith(">>END_MODULE") and sw:
            out.write(line)
        else:
            sw = False
    f.close()
    out.close()
SnakeMake From line 176 of master/Snakefile
193
194
195
196
197
198
199
200
201
202
203
204
205
run:
    f = open(str(input))
    out = open(str(output), "w")
    for line in f:
        if not (line.startswith("#") or line.startswith(">>")):
             tokens1 = line.split("\t")
             print(tokens1)
             fseq = tokens1[0]
             fheader = '>' + tokens1[3]
             out.write(fheader)#+'\n')#it just happens that is the last column
             out.write(fseq+'\n')
    f.close()
    out.close()
SnakeMake From line 193 of master/Snakefile
210
shell: "cat {input} > {output}"
SnakeMake From line 210 of master/Snakefile
215
216
217
218
shell:
    """
   awk '/^>/ {{P=index($0,"No Hit")==0}} {{if(P) print}} ' {input} > {output}
    """
SnakeMake From line 215 of master/Snakefile
223
224
225
226
227
228
229
run:
    with open(str(input[0])) as f:
        skip = not bool(f.read(1))
    if skip:
        os.symlink(os.path.abspath(str(input[1])), str(output))
    else:
        shell('cutadapt --discard-trimmed -a file:{input[0]} -o {output} {input[1]}')
234
235
236
237
238
run:
    gtf = parameters['gtf']
    genome = parameters['indexed_genome']
    shell('tophat --no-coverage-search -o {output} -p 2 -G %s %s {input}'
          % (gtf, genome))
SnakeMake From line 234 of master/Snakefile
243
244
245
246
247
run:
    sam_command = "samtools view {input}/accepted_hits.bam"
    htseq = ("htseq-count -i {gff_attribute} -t {feature_type} "
             "-m {overlap_mode} -s {stranded} - {gtf}").format(**parameters)
    shell("%s | %s > {output}" % (sam_command, htseq))
254
255
256
257
258
259
260
261
262
263
264
265
266
run:
    import pandas as pd
    import re

    pattern = "HTSeqCounts_([0-9a-zA-Z_\- ]*).txt"
    names = [re.search(pattern, str(name)).groups()[0] for name in input]
    data = {}
    for name, file in zip(names, input):
        file = str(file)
        data[name] = pd.Series.from_csv(file, sep='\t')
    df = pd.DataFrame(data)
    df.index.name = parameters['gff_attribute']
    df.to_csv(str(output))
271
shell: "samtools index {input}/accepted_hits.bam && mv -f {input}/accepted_hits.bam.bai  {output}"
276
shell: "cp {input}/align_summary.txt {output}"
SnakeMake From line 276 of master/Snakefile
281
shell: "samtools depth {input}/accepted_hits.bam > {output}"
286
shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}'''
SnakeMake From line 286 of master/Snakefile
291
shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}'''
SnakeMake From line 291 of master/Snakefile
296
297
298
299
300
301
302
    shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}'''

"""
Rule to get software versions of used programs in workflow. Rule either calls program with --version flag if possible, or runs
it without parameters displaying output row containing version information with unix tail command.
It then redirects it to Summary/software_versions.txt 
"""
SnakeMake From line 296 of master/Snakefile
307
308
309
310
311
312
313
314
run:
    shell("anaconda --version > Summary/software_versions.txt")
    shell("conda --version >> Summary/software_versions.txt")
    shell("fastqc --version >> Summary/software_versions.txt")
    shell("htseq-count -h | tail -1 >> Summary/software_versions.txt")
    shell("bowtie2 --version | head -1 >> Summary/software_versions.txt")
    shell("tophat2 --version >> Summary/software_versions.txt")
    shell("samtools --version | head -2 >> Summary/software_versions.txt")
ShowHide 19 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/qbicsoftware-archive/rnaseq
Name: rnaseq
Version: 7
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...