Snakemake workflow: RNA-seq

public public 1yr ago 0 bookmarks

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Authors

  • Konnor von Emster (@kve)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

  1. Create a new github repository using this workflow as a template .

  2. Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Sample table specification

Required Columns:

  • "treatment" –– specifies the treatment of the sample. This is the same within biological replicates.

  • "sample_name" –– specifies a sample identifier. This must be unique for each sample. For Paired End Reads:

  • "fwd_read" –– specifies the path of the forward read.

  • "rev_read" –– specifies the path of the reverse read. For Single End Reads:

  • "read_path" –– specifies the path of the single read.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

or

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

  1. Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:snakemake-workflows/DGE-co-culture-workflow.git or git remote add -f upstream https://github.com/snakemake-workflows/DGE-co-culture-workflow.git if you do not have setup ssh keys.

  2. Update the upstream version: git fetch upstream .

  3. Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .

  4. Investigate the changes: vim upstream-changes.diff .

  5. Apply the modified diff via: git apply upstream-changes.diff .

  6. Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 8: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

  1. Fork the original repo to a personal or lab account.

  2. Clone the fork to your local system, to a different place than where you ran your analysis.

  3. Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.

  4. Commit and push your changes to your fork.

  5. Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test . They are automatically executed via continuous integration with Github Actions .

Code Snippets

13
14
15
shell:
    "htseq-count --idattr=ID -a {params.mapping_quality_thresh} -s reverse -t exon -r pos --nonunique all "
    "{input.map_file} {input.gff} > {output}"
10
11
shell:
    "fastqc {input}"
22
23
24
25
26
shell:
    """
    mv {input.html} {output.html}
    mv {input.zipp} {output.zipp}
    """
6
7
shell:
    "cat {input} > {output}; sleep 1"
6
7
shell:
    "cat {input} > {output}; sleep 1"
16
17
script:
    "../scripts/gff_mod.py"
12
13
shell:
    "bowtie2 -x {input.ref} -1 {input.r1} -2 {input.r2} -S {output} -p {resources.tasks}"
22
23
shell:
    "bowtie2-build --threads {resources.tasks} {input.ref} {input.ref}"
11
12
shell:
    "bwa mem {input.ref} {input.r1} {input.r2} > {output}"
22
23
shell:
    "bwa index {input.ref}"
6
7
script:
    "../scripts/generate_bio_db_ref_table.py"
15
16
script:
    "../scripts/generate_annotation_df.py"
31
32
script:
    "../scripts/generate_raw_counts_metadata_dfs.py"
45
46
script:
    "../scripts/generate_comparison_results.R"
63
64
script:
    "../scripts/generate_data_json.py"
12
13
14
15
16
17
shell:
    "bbduk.sh threads={resources.tasks} "
    "in1={input.r1} in2={input.r2} "
    "out1={output.o1} out2={output.o2} "
    "minlen=25 qtrim=rl trimq=10 "
    "ref={input.ref} ktrim=r k=23 mink=11 hdist=1"
 9
10
shell:
    "samtools view -@ {resources.tasks} -S -b {input} > {output}"
19
20
shell:
    "samtools sort -@ {resources.tasks} {input} -o {output}"
29
30
shell:
    "samtools index -@ {resources.tasks} -b {input}"
8
9
shell:
    "gzip -d {input}"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from pathlib import Path
import pandas as pd
import logging as log

log.basicConfig(format='%(levelname)s:%(message)s', level=log.DEBUG)

def create_attr_dict(attr_row):
    log.debug(attr_row)
    d = {}
    for key_value_pair in attr_row.split(";"):
        k_v_list = key_value_pair.split(sep="=", maxsplit=1)
        if len(k_v_list) == 2:
            k, v = k_v_list
            d[k] = v
    return d

def generate_gff_df(gff_file):
    ## This method is heavily influenced from GFFpandas
    df = pd.read_csv(gff_file, sep="\t", comment="#",
            names=[
                "seq_id",
                "source",
                "type",
                "start",
                "end",
                "score",
                "strand",
                "phase",
                "attributes",
            ],
    )
    attr_dict_series = df.attributes.apply(
        lambda attributes: create_attr_dict(attributes)
    )
    key_set = set()
    attr_dict_series.apply(lambda at_dic: key_set.update(list(at_dic.keys())))
    for attr in sorted(list(key_set)):
        df[attr] = attr_dict_series.apply(
            lambda attr_dict: attr_dict.get(attr)
        )
    return df

gffs = []
for genome, info in snakemake.config['input']['reference genomes'].items():
    attributes_df = generate_gff_df(info['annotation'])
    attributes_df['organism'] = genome
    gffs.append(attributes_df)
attributes_df = pd.concat(gffs)

# filtering by 'type' column 
feature_types_to_keep = snakemake.config["feature_types_to_count"]
attributes_df = attributes_df[attributes_df['type'].isin(feature_types_to_keep)]

attributes_df = attributes_df.set_index(['organism','ID'])
attributes_df.to_csv(snakemake.output[0], sep='\t')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from pathlib import Path
import pandas as pd
import logging as log
import urllib.error
import Bio.KEGG.REST as kegg_rest

log.basicConfig(format='%(levelname)s:%(message)s', level=log.INFO)

kegg_pathway_columns = {'em_BRITE':'brite', 'em_KEGG_Module':'module', 'em_KEGG_Pathway':'pathway', 'em_KEGG_Reaction':'reaction', 'em_KEGG_rclass':'rclass'}

kegg_ids = {k:set() for k in kegg_pathway_columns.values()}

for genome, info in snakemake.config['input']['reference genomes'].items():
    attributes = pd.read_table(info['annotation'], header=None, comment="#")[8].to_list()
    for gene in attributes:
        for annot in gene.split(";"):
            if "=" in annot:
                k, v = annot.split("=", maxsplit=1)
                if k in kegg_pathway_columns.keys():
                    kegg_ids[kegg_pathway_columns[k]].update(v.split(","))

def ident_kegg_term(db, id):
    try: 
        line = kegg_rest.kegg_find(db, id).readline()
        return line.split('\n')[0].split('\t')[1]
    except IndexError as e:
        log.error(f'problematic term {id}:\n{kegg_rest.kegg_find(db, id).readline()}')
        return ""
    except urllib.error.URLError as e:
        return e.read().decode("utf8", 'ignore')

ref_table = []
for col, db in kegg_pathway_columns.items():
    s = kegg_ids[db]
    s -= {'-', 'nan'}
    log.info(f"{col} with {len(s)} entries")
    for i in s:
        log.info(i)
        ref_table.append([col, db, i, ident_kegg_term(db, i)])

pd.DataFrame(ref_table, columns=['eggnog column', 'kegg hierarchy id', 'kegg id', 'kegg name']).to_csv(snakemake.output[0], sep='\t', index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
library("DESeq2")

counts = read.csv(file=snakemake@input$counts, sep='\t')
annots = read.csv(file=snakemake@input$annotations, sep='\t')

organism = snakemake@wildcards$organism
experiment = snakemake@wildcards$experiment
comparison = snakemake@wildcards$comparison

exp_design = as.formula(snakemake@config[["Comparisons"]][[organism]][[experiment]]$Design)

sample_use_df = read.csv(file=snakemake@config$input[['sample table']], sep='\t')

if ('Included Samples' %in% names(snakemake@config[["Comparisons"]][[organism]][[experiment]])) {
    included_samples = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Included Samples']]
    sample_use_df = subset(sample_use_df, sample %in% included_samples)
} else if ('Excluded Samples' %in% names(snakemake@config[["Comparisons"]][[organism]][[experiment]])) {
    excluded_samples = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Excluded Samples']]
    sample_use_df = subset(sample_use_df, !sample %in% excluded_samples)
}

row.names(sample_use_df) <- sample_use_df$sample

counts_subset = subset(counts, organism==snakemake@wildcards$organism)
rownames(counts_subset) = counts_subset$ID
counts_subset = subset(counts_subset, select=rownames(sample_use_df))

annots_subset = subset(annots, organism==snakemake@wildcards$organism)
rownames(annots_subset) = annots_subset$ID

# organism must be present in all samples. Good test to tell if it is from experience...
if (min(colMeans(counts_subset[sapply(counts_subset, is.numeric)],na.rm=TRUE)) < 1){
    stop('organism must be present in all samples')
}

dds = DESeqDataSetFromMatrix(countData = counts_subset, colData = sample_use_df, design = exp_design)
dds_processed = DESeq(dds)

vst_counts = assay(varianceStabilizingTransformation(dds_processed))
vst_counts = cbind(ID=rownames(vst_counts), vst_counts)
write.table(vst_counts, file=snakemake@output$vst, quote=FALSE, sep='\t', row.names=FALSE)
print("vst file exists:")
print(snakemake@output$vst)
print(file.exists(snakemake@output$vst))

rlog_counts = assay(rlog(dds_processed))
rlog_counts = cbind(ID=rownames(rlog_counts), rlog_counts)
write.table(rlog_counts, file=snakemake@output$rlog, quote=FALSE, sep='\t', row.names=FALSE)
print("rlog file exists:")
print(snakemake@output$rlog)
print(file.exists(snakemake@output$rlog))

comparison_factor = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Factor']]
comparison_treatment = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Treatment']]
comparison_control = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Control']]
comparison_results = as.data.frame(results(dds_processed, contrast=c(comparison_factor, comparison_treatment, comparison_control)))
comparison_results = cbind(ID=rownames(comparison_results), symlog10baseMean = with(comparison_results, log10(baseMean + 1)), comparison_results)
write.table(comparison_results, file=snakemake@output$results, quote=FALSE, sep='\t', row.names=FALSE)
print("results file exists:")
print(snakemake@output$results)
print(file.exists(snakemake@output$results))

results_w_annot = merge(comparison_results, annots_subset, by=0)
write.table(results_w_annot, file=snakemake@output$results_w_annot, quote=FALSE, sep='\t', row.names=FALSE)

print("results_w_annot file exists:")
print(snakemake@output$results_w_annot)
print(file.exists(snakemake@output$results_w_annot))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from pathlib import Path
import pandas as pd
import logging as log
import json
import yaml
import gzip

log.basicConfig(format='%(levelname)s:%(message)s', level=log.INFO)

open(snakemake.output[0],'w').close()

mapping_metadata = pd.read_table(snakemake.input['mapping_metadata'], index_col=0)
organism_occurance = pd.read_table(snakemake.input['organism_occurance'], index_col='organism')
gene_sparsity = pd.read_table(snakemake.input['gene_sparsity'], index_col='organism')
samples = pd.read_table(snakemake.config["input"]["sample table"], index_col="sample")
annotations = pd.read_table(snakemake.input['annotations'], index_col=['organism', 'ID'])
raw_counts = pd.read_table(snakemake.input['counts'], index_col=['organism', 'ID'])
bio_ref_df = pd.read_table(snakemake.input['bio_df_ref'])

data = {
    'metadata': {
        'mapping quality': mapping_metadata.to_dict(),
        'organism occurance': organism_occurance.to_dict(),
        'gene sparsity': gene_sparsity.to_dict(),
    },
    'samples' : samples.to_dict(),
    'config' : snakemake.config,
    'annotations': {
        org: annotations.loc[org].to_dict()
        for org in annotations.index.get_level_values('organism').unique()
    },
    'raw_counts': {
        org: raw_counts.loc[org].to_dict()
        for org in raw_counts.index.get_level_values('organism').unique()
    },
    'id ref table': bio_ref_df.to_dict(),
    'deseq2' : {}
}

for rlog, vst, result in zip(snakemake.input['rlogs'], snakemake.input['vsts'], snakemake.input['deseq2_results']):
    assert Path(rlog).parent == Path(vst).parent == Path(result).parent
    org = Path(rlog).parent.parent.parent.name
    exp = Path(rlog).parent.parent.name
    comp = Path(rlog).parent.name
    if org not in data['deseq2'].keys():
        data['deseq2'][org] = {}
    if exp not in data['deseq2'][org].keys():
        data['deseq2'][org][exp] = {}
    data['deseq2'][org][exp][comp] = {
        'rlog' : pd.read_table(rlog, index_col='ID').to_dict(),
        'vst' : pd.read_table(vst, index_col='ID').to_dict(),
        'results' : pd.read_table(result, index_col='ID').to_dict()
    }

with gzip.open(snakemake.output[0], 'w') as fout:
    fout.write(json.dumps(data).encode('utf-8'))  
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from pathlib import Path
import pandas as pd
import logging as log

log.basicConfig(format='%(levelname)s:%(message)s', level=log.DEBUG)

attributes_df = pd.read_table(snakemake.input['annotations'], index_col=[0,1])

raw_counts_df_list = []
raw_metadata_df_list = []
reference_index = None

for path in [Path(p) for p in snakemake.input['counts']]:
    sample_name = path.stem
    temp_df = pd.read_table(path, names=['ID', sample_name])
    temp_df = temp_df.set_index('ID')

    if isinstance(reference_index, pd.Index):
        assert reference_index.all() == temp_df.index.all()
    else:
        reference_index = temp_df.index

    temp_metadata_df = temp_df[temp_df.index.str.contains('__')]
    temp_counts_df = temp_df[~temp_df.index.str.contains('__')]

    # append df to raw_counts_df_list
    raw_counts_df_list.append(temp_counts_df)
    raw_metadata_df_list.append(temp_metadata_df)

counts_df = pd.concat(raw_counts_df_list, axis=1)
metadata_df = pd.concat(raw_metadata_df_list, axis=1)

metadata_df.index = metadata_df.index.str.replace('__', '')

counts_df = counts_df.loc[attributes_df.index.get_level_values('ID')]
counts_df.index = attributes_df.index

feature_df = counts_df.join(attributes_df['type']).groupby(['type']).sum() 
metadata_df = pd.concat([feature_df, metadata_df])

counts_df.to_csv(Path(snakemake.output["counts"]), sep="\t")
metadata_df.to_csv(Path(snakemake.output["mapping_metadata"]), sep='\t')
counts_df.join(attributes_df).to_csv(Path(snakemake.output["counts_w_annotations"]), sep="\t")

organism_occurance = counts_df.groupby(level='organism').sum().div(counts_df.sum())
organism_occurance.to_csv(Path(snakemake.output["organism_occurance"]), sep='\t')

sparsity_df = (counts_df >= 1).groupby(level='organism').sum().div(counts_df.index.get_level_values('organism').value_counts(), axis='index')
sparsity_df.index.name = 'organism'
sparsity_df.to_csv(Path(snakemake.output["gene_sparsity"]), sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pandas as pd

feature_types_to_keep = snakemake.config["feature_types_to_count"]

annotation = pd.read_table(snakemake.input[0], header=None, comment='#')

annotation = annotation[annotation[2].isin(feature_types_to_keep)]

annotation[2] = 'exon'

annotation.to_csv(snakemake.output[0], sep='\t', index=False, header=False)
ShowHide 16 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/konnorve/DGE-snakemake
Name: dge-snakemake
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...