riboCleaner: rRNA Contamination Detection and Quantification Workflow for RNA-Seq Data

public 1yr ago Version: v2.0.0 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

riboCleaner is an open-source workflow designed to identify and quantify rRNA abundance in RNA-Seq data. riboCleaner has two main workflows: identify and quantify.

The identify workflow identifies putative rDNA genes in a genome of interest. First, it uses barrnap to identify coordinates that contain rDNA elements in a genome (fasta) and then bundles these elements into non-overlapping operons. Then it compares these regions to the predicted genes (gff) to identify genes that were predicted in overlapping coordinate space as a predicted rDNA operon. These genes are flagged as putative rDNA contaminants. Then, the rest of the genes are compared to the rDNA contaminant gene list using blastn and any that pass a sequence identity and coverage threshold are also flagged as rDNA.

The quantify workflow quantifies the abundance of rRNA transcripts in RNA-Seq samples. For this we use salmon to quantify the abundance of transcripts in each sample (including the putative rDNA gene models). This allows any reads derived from rRNA to map to the rRNA transcripts and avoids biasing the counts of mRNA transcripts with homologous rRNA reads. Then, riboCleaner quantifies the abundance of the rRNA transcripts to give an estimate of rRNA prevalence in the RNA-seq data and visualizes the results. As a final step, riboCleaner removes the rRNA transcripts from the salmon counts table to yield a table of mRNA abundances that can be normalized and used for downstream analysis.

This workflow diagram shows how the different parts of the riboCleaner standard workflow fits together. You can also read a more detailed description of the methods or see the Snakemake DAG .

workflow_diagram

Quick Start

If you want to see the output riboCleaner produces, we have provided the report from running riboCleaner on a test dataset

To jump right in and run riboCleaner as fast as possible:

# clone this repo
git clone https://github.com/basf/riboCleaner.git
cd riboCleaner/test_data

Then follow the instructions in the test data README to make sure everything is set up properly.

Once the workflow runs on the test data, replace the test data inputs with your own, delete the contents of the results directory, and rerun.

Otherwise see our detailed documentation for installing and running riboCleaner .

Code Snippets

shell:
    """
    barrnap --kingdom {params.kingdom} --threads {threads} {input.reference} > {output.gff}
    """

SnakeMake Barrnap From line 13 of rules/rDNA_identification.smk

shell:
    """
    bedtools merge -c 1 -o count -d {params.nor_distance} -i {input.gff} | grep -v ^unmapped | awk -v c={params.nor_counts} '{{if ($4>c) print}}' > {output.nor}
    """

SnakeMake BEDTools From line 32 of rules/rDNA_identification.smk

shell:
    """
    bedtools intersect -a {input.gene_models} -b {input.barrnap} {input.nor} > {output.tmp}

    # grep will exit with a 1 if it finds no matches
    set +e
    grep {params.gff_feature} {output.tmp} > {output.rdna_models}
    exitcode=$?
    if [ $exitcode -eq 1 ]
    # no matches were found
    then
        exit 0
    else
        exit $exitcode
    fi
    """

SnakeMake BEDTools From line 53 of rules/rDNA_identification.smk

run:
    # parse through the gff and extract the fields required for the BED
    # [chromosome, start, end, name]
    with open(input.gff, 'r') as IN, open(output.bed, 'w') as OUT:
        for line in IN:
            if line.startswith("#"):
                continue

            columns = line.strip().split("\t")

            chrom = columns[0]
            feature_type = columns[2]
            start = columns[3]
            end = columns[4]

            # only grab the features we want
            if feature_type == params.feature:
                # get a string like "ID=Glyma.13G020400.1.Wm82.a2.v1;Name=Glyma.13G020400.1;pacid=30501300;longest=1;Parent=Glyma.13G020400.Wm82.a2.v1"
                attrs = columns[8]

                # tokenize attrs
                tokens = {attr.split("=")[0]: attr.split("=")[1] for attr in attrs.split(";")}
                try:
                    name = tokens[params.name]
                except KeyError:
                    raise ValueError("'{name}' missing in '{line}'".format(name=params.name, line=line))

                # write the line in the BED file
                OUT.write("\t".join((chrom, start, end, name)) + "\n")

SnakeMake From line 82 of rules/rDNA_identification.smk

shell:
    "bedtools getfasta -name -fi {input.reference} -bed {input.models} -s > {output.fasta}"

SnakeMake BEDTools From line 120 of rules/rDNA_identification.smk

shell:
    "dedupe.sh minidentity={params.identity} in={input.fasta} out={output.deduped}"

SnakeMake From line 136 of rules/rDNA_identification.smk

run:
    import re

    # gff/gtf have different ways to specify attributes
    gff_regex = re.compile('(?P<key>[^=]+) *= *(?P<value>[^;]+);? *')
    gtf_regex = re.compile('(?P<key>[^ ]+) *\"(?P<value>[^;\"]+)\"?;? *')

    filetype = None
    names = set()
    with open(input.models, 'r') as IN:
        # read the first line to determine if we are using gff/gtf
        first_line = IN.readline()
        if first_line:
            attrs = first_line.split("\t")[8]
            if gff_regex.match(attrs):
                attr_regex = gff_regex
                name_key = "ID"
            elif gtf_regex.match(attrs):
                attr_regex = gtf_regex
                name_key = "gene_id"
            else:
                raise ValueError("Couldn't filetype as either GFF/GTF")

        # reset to the beginning of the file
        IN.seek(0)

        for line in IN:
            # get a string like "ID=Glyma.13G020400.1.Wm82.a2.v1;Name=Glyma.13G020400.1;pacid=30501300;longest=1;Parent=Glyma.13G020400.Wm82.a2.v1"
            # or 'gene_id "FBgn0267517"; gene_symbol "5.8SrRNA-Psi:CR45857"';
            # see: http://uswest.ensembl.org/info/website/upload/gff.html
            attrs = line.split("\t")[8]

            # tokenize attrs
            #tokens = {attr.split("=")[0]: attr.split("=")[1] for attr in attrs.split(";")}

            for match in attr_regex.finditer(attrs):
                if match.group("key") == name_key:
                    names.add(match.group("value"))
                    break
            else:
                raise ValueError("Couldn't detect gene name in '{line}'".format(line=line))

    # write the names 1/line to output
    with open(output.names, 'w') as OUT:
        for n in names:
            OUT.write(n + "\n")

SnakeMake From line 149 of rules/rDNA_identification.smk

shell:
    """ 
    # -w gets the full trascript (+ UTR, etc)
    gffread -w {output.features} -g {input.fasta} {input.gff}
    """

SnakeMake gffread From line 210 of rules/rDNA_identification.smk

shell:
    "blastn -query {input.query} -subject {input.subject} -outfmt '6 qseqid sseqid qstart qend pident qcovs' -out {output.hits} || touch {output.hits}"

SnakeMake BLAST From line 227 of rules/rDNA_identification.smk

run:
    rdna = set()
    with open(input.blast, 'r') as IN:
        current_query = ""
        current_subject = ""
        current_weight_id = 0
        current_weight_n = 0
        current_cov = 0
        for line in IN:
            query, subject, start, end, pident, qcovs = line.strip().split("\t")

            # trim the bedtools names at the first ":"
            #query = query.split(":")[0]

            # gather the HSPs to each subject for each query and check them against our thresholds
            if query != current_query or subject != current_subject:
                # check the coverage and weighted perc id of the hits
                if current_query != "" and float(current_cov) >= params.qcovs and current_weight_id / current_weight_n >= params.pident:
                    rdna.add(current_query)

                # reset
                current_query = query
                current_subject = subject
                current_weight_id = 0
                current_weight_n = 0
                current_cov = qcovs

            # add this hit; keeping a running % identity for the HSPs in this query/subject pairing
            current_weight_id += float(pident) * abs(int(end) - int(start))
            current_weight_n += abs(int(end) - int(start))

    # subtract the known names
    barrnap_names = set()
    with open(input.barrnap_names, 'r') as IN:
        for line in IN:
            barrnap_names.add(line.strip())

    with open(output.names, 'w') as OUT:
        for n in rdna.difference(barrnap_names):
            OUT.write(n + "\n")

SnakeMake BEDTools From line 243 of rules/rDNA_identification.smk

shell:
    "cat {input.barrnap} {input.homology} > {output.names}"

SnakeMake From line 291 of rules/rDNA_identification.smk

shell:
    "/opt/bbmap/filterbyname.sh include=t substring=name in={input.features} names={input.names} out={output.fasta}"        

SnakeMake From line 301 of rules/rDNA_identification.smk

run:
    from collections import defaultdict
    # count non-partial rDNA predicted by barrnap
    barrnap_results = defaultdict(lambda: defaultdict(int))
    with open(input.barrnap_predictions, 'r') as IN:
        for line in IN:
            if not line.startswith("#"):
                # detect if this is partial
                partial = "partial" in line

                # quantify each rDNA type
                if "5S_rRNA" in line:
                    barrnap_results["5S"][partial] += 1
                elif "18S_rRNA" in line:
                    barrnap_results["18S"][partial] += 1
                elif "28S_rRNA" in line:
                    barrnap_results["28S"][partial] += 1
                elif "5_8S_rRNA" in line:
                    barrnap_results["5.8S"][partial] += 1
                # needs to be expanded for prokaryotes 
                else:
                    barrnap_results["other"][partial] += 1

    # count the rDNA regions
    rDNA_regions = 0
    with open(input.rdna_regions, 'r') as IN:
        for line in IN:
            rDNA_regions += 1

    # count the models added via BLAST homology
    homology_models = 0
    with open(input.homology_models, 'r') as IN:
        for line in IN:
            homology_models += 1

    # count the rDNA genes
    rDNA_gene_models = 0
    with open(input.gene_models, 'r') as IN:
        for line in IN:
            rDNA_gene_models += 1

    # output the summary
    with open(output.summary, 'w') as OUT:
        OUT.write("riboCleaner ran barrnap to detect the following counts of rDNA features:\n")

        for elem in ["18S", "5.8S", "28S", "5S", "other"]:
            OUT.write("    {elem}: {complete} complete, {partial} partial\n".format(elem=elem, complete=barrnap_results[elem][False], partial=barrnap_results[elem][True]))

        OUT.write("\n")
        OUT.write("These rDNA features were merged into {} rDNA-rich regions.\n".format(rDNA_regions))
        OUT.write("\n")
        OUT.write("The rDNA regions were found to overlap with {overlap} features with the `{tag}` tag in the GFF file.\n".format(overlap=rDNA_gene_models - homology_models, tag=params.gff_feature))
        OUT.write("Additionally, {homology} models were identified as having significant homology to these features for a total of {total} models flagged as putative rDNA.\n".format(homology=homology_models, total=rDNA_gene_models))

        OUT.write("\n")

        # begin the warning section
        if not rDNA_regions:
            OUT.write("WARNING: There were no rDNA regions detected. This may be because there are no rDNA sequences in your reference genome (detectable by barrnap) or because they do not form regular rDNA repeats. riboCleaner works best when it is able to detect rDNA regions.\n")

        if not rDNA_gene_models:
            OUT.write("WARNING: There were no rDNA gene models detected. This is often good news but means riboCleaner will detect no rDNA for this genome. However, since there are no false gene models originating from rDNA, your counts table resulting from standard transcriptome analysis won't be biased by false rDNA gene models!\n")

SnakeMake Barrnap From line 316 of rules/rDNA_identification.smk

shell:
    """
    salmon index -p {threads} -t <(cat {input.features} {input.genome}) -d <(grep "^>" {input.genome} | cut -d " " -f 1 | sed 's/>//g') -i {output}
    """

SnakeMake Salmon From line 11 of rules/rDNA_quantification.smk

shell:
    """
    salmon quant --threads {threads} -i {input.index} --libType A -1 {input.r1} -2 {input.r2} -o {output.tmp} 2>&1 | tee {log}
    cp -v {output.tmp}/quant.sf {output.sf}
    cp -v {output.tmp}/aux_info/meta_info.json {output.meta_info}
    """

SnakeMake Quant Salmon From line 30 of rules/rDNA_quantification.smk

shell:
    """
    salmon quant --threads {threads} -i {input.index} --libType A -r {input.r1} -o {output.tmp} 2>&1 | tee {log}
    cp -v {output.tmp}/quant.sf {output.sf}
    cp -v {output.tmp}/aux_info/meta_info.json {output.meta_info}
    """

SnakeMake Quant Salmon From line 48 of rules/rDNA_quantification.smk

run:
    import pandas as pd
    import json

    # get the rDNA names
    with open(input.rdna_names, 'r') as IN:
        names = [line.strip() for line in IN]

    sample_summary = {}

    # gather the counts for each file
    all_counts = []
    for sf, meta_info, sample in zip(input.sf, input.meta_info, params.samples):
        with open(meta_info) as IN:
            info = json.load(IN)

        d = pd.read_csv(sf, sep="\t", header=0, index_col=0)
        counts = d["TPM"]

        counts.name = sample
        all_counts.append(counts)

        # get the sample summary info
        sample_summary[sample] = {
            'percent_unmapped': 100 - float(info['percent_mapped']),
            # scale percent rDNA to the mapped portion
            'percent_rDNA': float(info['percent_mapped']) * counts.filter(items=names).sum() / counts.sum()
            }

        # the usable percent is whatever is left over
        sample_summary[sample]['percent_mRNA'] = 100 -  sample_summary[sample]['percent_unmapped'] - sample_summary[sample]['percent_rDNA']

    # and combine into one df
    all_counts = pd.DataFrame(all_counts).fillna(0).transpose()

    # make the index be just the gene names (not the position)
    all_counts.index = [g.split(":")[0] for g in all_counts.index]

    # get just the rdna
    all_counts_rdna = all_counts.filter(items=names, axis=0)
    all_counts_rdna.to_csv(output.rdna_abundance, sep="\t", index_label="sample")

    # done as is above to write out more details in the future, for now, just the TPM
    summary = pd.DataFrame.from_dict(sample_summary, orient="index").round(2)

    # write sorted by sample name
    summary.sort_index().to_csv(output.summary, sep="\t", index_label="sample", float_format="%.2f")

SnakeMake Pandas JSON From line 68 of rules/rDNA_quantification.smk

run:
    import pandas as pd
    import json

    # get the rDNA names
    with open(input.rdna_names, 'r') as IN:
        rdna = [line.strip() for line in IN]

    sample_summary = {}

    # gather the counts for each file
    all_counts = []
    for sf, sample in zip(input.sf, params.samples):
        d = pd.read_csv(sf, sep="\t", header=0, index_col=0)
        counts = d["TPM"]

        # get the total (including rDNA)
        total = counts.sum()

        # remove rDNA
        counts = counts.filter(items=[n for n in counts.index if n not in rdna])

        # rescale counts by the rDNA ratio (to get the TPM as if the rDNA was never there)
        counts = counts * (total / counts.sum())

        counts.name = sample
        all_counts.append(counts)

    # and combine into one df
    all_counts = pd.DataFrame(all_counts).fillna(0).transpose()

    # make the index be just the gene names (not the position)
    all_counts.index = [g.split(":")[0] for g in all_counts.index]

    all_counts.to_csv(output.clean_tpm, sep="\t", index_label="gene")

SnakeMake Pandas JSON From line 128 of rules/rDNA_quantification.smk

run:
    import pandas as pd
    from matplotlib import pyplot as plt

    summary = pd.read_csv(input.summary, sep="\t", index_col=0)[["percent_mRNA", "percent_rDNA", "percent_unmapped"]]
    summary.columns = ["mRNA", "rDNA", "unmapped"]

    # make the experiment level plot
    # adapted from: https://www.dataforeverybody.com/matplotlib-seaborn-pie-charts/
    overall_summary = summary.mean(axis=0)
    pie, ax = plt.subplots(figsize=[10,6])
    labels = overall_summary.keys()
    plt.pie(x=overall_summary, autopct="%.2f%%", explode=[0.05]*3, labels=labels, pctdistance=0.5)
    plt.title("Experiment Overview", fontsize=14)
    pie.savefig(output.experiment_summary)
    pie.savefig(output.experiment_summary_svg)

    # make the sample level plot
    # limited to 5 samples per inch of plot (and 3 inches for overhead)
    width = 3 + int(len(summary) / 5)
    summary.plot(kind="bar", stacked=True, figsize=(width, 8))
    plt.legend(loc='lower left', bbox_to_anchor=(1.0, 0.5))
    plt.xticks(rotation=45, ha="right")
    plt.title("Sample Composition")
    plt.ylabel("% of Reads")
    plt.tight_layout()
    plt.gcf().savefig(output.sample_summary)
    plt.gcf().savefig(output.sample_summary_svg)

SnakeMake Pandas matplotlib From line 13 of rules/rDNA_visualization.smk

run:
    # print the workflow docstring
    print("Snakefile: {p}".format(p=workflow.snakefile))
    if workflow.globals["__doc__"]:
        print(workflow.globals["__doc__"])

    # print each (public) rule
    print("\n##### RULES #####\n")
    for rule in workflow.rules:
        if not rule.name.startswith("_"):
            if rule.docstring:
                print(rule.name)
                print(rule.docstring)
                print("----------\n")

SnakeMake From line 96 of workflow/Snakefile

shell:
    """
    bbduk.sh in={input.r1} in2={input.r2} out={output.r1} out2={output.r2} reads={params.reads} threads={threads}
    """

SnakeMake BBMap From line 143 of workflow/Snakefile

shell:
    """
    bbduk.sh in={input.r1} out={output.r1} reads={params.reads} threads={threads}
    """

SnakeMake BBMap From line 157 of workflow/Snakefile

ShowHide 13 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/basf/riboCleaner

Name: ribocleaner

Version: v2.0.0

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

JSON Quantification Barrnap BBMap BEDTools BLAST gffread Pandas Quant Salmon Snakemake matplotlib RNA-Seq

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free