Transcriptomics Data Quality Control and Adapter Trimming Workflow

public public 1yr ago 0 bookmarks

Implementation homework for transcriptomics

As discribed in task:

  1. performed fastqc

  2. performed multiqc

  3. trimmed barcodes

  4. performed fastqc on trimmed

  5. performed multiqc on trimmed

the biggest indicator of succesful removal of adapters is Adapter content plot:

Before trimming after trimming
notTrimmed trimmed
  1. Used STAR aliger

  2. indexed bam with samtools index, and featureCounts

here we specify strandness:

0 not stranded

1 stranded

2 reversly stranded

Collibri is stranded, therefore 1

KAPA is reversly stranded, therefore 2

  1. Used DESeq2 to perform DE analysis comparing UHRRvsHBR:

as a result obtained DE genes with padj values:

For collibri collibri

and KAPA KAPA

also volkano plots:

for collibri: collibri-volkano

and KAPA:

kapa-volkano it has Snakemake file, which will create plot

  1. performed PCA using DE genes:

plot for collibri:

collibri-pca

and kapa:

kapa-pca

Code Snippets

19
20
shell:
    "tar -xf {input} -C genome"
47
48
wrapper:
    "v1.28.0/bio/multiqc"
63
64
wrapper:
    "v1.28.0/bio/bbtools/bbduk"
92
93
wrapper:
    "v1.28.0/bio/multiqc"
104
105
106
107
108
109
shell:
    """
    echo {params.sjdbOverhang}
    STAR --runThreadN 4 --runMode genomeGenerate --genomeDir {output} \
    --genomeFastaFiles {input}
    """
119
120
121
122
shell:
    """
    STAR --genomeDir {input.index} --readFilesIn {input.fq1} --outFileNamePrefix star/ --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c --outStd BAM_SortedByCoordinate > {output.aln}
    """
131
132
133
134
135
shell:
    """
    echo {input.bam}
    samtools index {input.bam} {output.bai}
    """
152
153
154
155
156
157
158
159
160
161
162
163
164
shell:
    """
    export test={input.samples}
    if [[ "$test" == *"Collibri"* ]]; then
        strand="1"
    else
        strand="2"
    fi
    echo "strand"
    echo $strand

    samtools sort -n {input.samples} | featureCounts -p -t exon -g gene_id -a {input.annotation} -o {output[0]} -s $strand
    """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2021, Filipe G. Vieira"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

java_opts = get_java_opts(snakemake)
extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "in={}".format(snakemake.input.sample)
    trimmed = "out={}".format(snakemake.output.trimmed)
else:
    reads = "in={} in2={}".format(*snakemake.input.sample)
    trimmed = "out={} out2={}".format(*snakemake.output.trimmed)


singleton = snakemake.output.get("singleton", "")
if singleton:
    singleton = f"outs={singleton}"


discarded = snakemake.output.get("discarded", "")
if discarded:
    discarded = f"outm={discarded}"


stats = snakemake.output.get("stats", "")
if stats:
    stats = f"stats={stats}"


shell(
    "bbduk.sh {java_opts} t={snakemake.threads} "
    "{reads} "
    "{adapters} "
    "{extra} "
    "{trimmed} {singleton} {discarded} "
    "{stats} "
    "{log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)
ShowHide 5 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/jarekrzdbk/snakemake-homework
Name: snakemake-homework
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Accessed: 1
Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...