RADSeq tool with Snakemake workflow integration for analysis of RAD sequencing data.

public public 1yr ago Version: latest 0 bookmarks

NodeRAD is a Snakemake workflow for analysis of single-end reads from RAD sequencing without the presence of a reference genome. It detects loci and genomic variants using sequencing error and heterozygosity rates. For more information please have a look at the associated bachelor thesis .

Note: Currently the workflow is limited to diploid species.

Authors

  • Antonie Vietor (@AntonieV)

This workflow is part of a bachelor thesis The topic and the underlying model were developed by:

  • Johannes Köster (@johanneskoester), https://koesterlab.github.io

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

  1. Create a new GitHub repository using this workflow as a template .

  2. Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 3: Install Snakemake

  1. Install conda .

  2. Install mamba in conda base and create a snakemake conda environment:

 conda install -n base -c conda-forge mamba
 conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

You can run the workflow with some examples through the script start.sh . To use your own data, change the paths in config/config.yaml for samples , fastq-data and eval-data (if there is data from a ddRAGE simulation).

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda --conda-frontend mamba -n

Execute the workflow locally via

snakemake --use-conda --conda-frontend mamba --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --conda-frontend mamba --cluster qsub --jobs 100

or

snakemake --use-conda --conda-frontend mamba --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --conda-frontend mamba --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

  1. Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:TharjaX/NodeRAD.git or git remote add -f upstream https://github.com/TharjaX/NodeRAD.git if you do not have setup ssh keys.

  2. Update the upstream version: git fetch upstream .

  3. Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .

  4. Investigate the changes: vim upstream-changes.diff .

  5. Apply the modified diff via: git apply upstream-changes.diff .

  6. Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 8: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

  1. Fork the original repo to a personal or lab account.

  2. Clone the fork to your local system, to a different place than where you ran your analysis.

  3. Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.

  4. Commit and push your changes to your fork.

  5. Create a pull request against the original repository.

Testing

More test cases are in the subfolder .test . To run the workflow with one of the test data sets adjust the paths for samples , fastq-data and eval-data in .test/config/config.yaml for the desired test dataset and run the script start_test.sh . Also note that for the different data sets you may have to adjust the threshold values in the configuration file. Test cases are also automatically executed via continuous integration with Github Actions .

Code Snippets

16
17
script:
    "../scripts/sim_to_fasta.py"
28
29
script:
    "../scripts/vcf_to_fasta.py"
40
41
wrapper:
    "v1.25.0/bio/samtools/faidx"
56
57
shell:
    "makeblastdb -in {input.fasta_sim} -input_type fasta -dbtype nucl -parse_seqids"
73
74
shell:
    "blastn -query {input.res} -outfmt 6 -db {params.dbname} -out {output.blast}"
85
86
shell:
    "echo -ne \"{params.header}\" | cat - {input} > {output}"
102
103
script:
    "../scripts/plots_blast.R"
 9
10
wrapper:
    "v1.25.0/bio/minimap2/index"
24
25
wrapper:
    "v1.25.0/bio/minimap2/aligner"
37
38
wrapper:
    "v1.25.0/bio/samtools/view"
72
73
script:
    "../scripts/noderad_main.py"
11
12
wrapper:
    "v1.25.0/bio/fastqc"
21
22
wrapper:
    "v1.25.0/bio/multiqc"
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import sys
import gzip
import pysam
from graph_tool.all import *
import numpy as np
import graph_operations
import likelihood_operations

sys.stderr = open(snakemake.log[0], "w")

sample = snakemake.wildcards.get('sample')

# input
bam = snakemake.input.get("bam")
reads = snakemake.input.get("fastq")

# required output
vcf = open(snakemake.output.get("vcf", ""), "w")
vcf_header = "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t{}\n".format(sample)
vcf.write(vcf_header)

# optional output
graph_xml = snakemake.output.get("graph_xml", "")
output_figure = snakemake.output.get("graph_figure", "")
connected_components_xml = snakemake.output.get("connected_components_xml", "")
connected_components_figure = snakemake.output.get("connected_components_figure", "")
dir_subgraphs = snakemake.output.get("components_subgraphs", "")

# params for ploidy, threshold, noise and cluster-size
threshold = snakemake.params.get("threshold_max_edit_distance", "")
ploidy = snakemake.params.get("ploidy", "")  # a diploid chromosome set is determined for the prototype
noise_small = snakemake.params.get("treshold_seq_noise_small", "")
noise_large = snakemake.params.get("treshold_seq_noise_large", "")
cluster_size = snakemake.params.get("treshold_cluster_size", "")

# get reads format
format = ""  # format of reads
if reads.endswith((".fastq", ".fq", ".fastq.gz", ".fq.gz")):
    format = "fastq"
if reads.endswith((".fasta", ".fa", ".fasta.gz", ".fa.gz")):
    format = "fasta"

# default value of threshold for maximum distance value
if threshold == "":
    threshold = 23
else:
    threshold = int(threshold)

# default value of noise
if not noise_small:
    noise_small = 1
if not noise_large:
    noise_large = 1

# init graph
graph = Graph(directed=True)

# set graph properties
for (name, prop, prop_type) in graph_operations.set_properties():
    if name.startswith("g_"):
        vars()[name] = graph.new_graph_property(prop_type)
        graph.graph_properties[prop] = vars()[name]
    if name.startswith("v_"):
        vars()[name] = graph.new_vertex_property(prop_type)
        graph.vertex_properties[prop] = vars()[name]
    if name.startswith("e_"):
        vars()[name] = graph.new_edge_property(prop_type)
        graph.edge_properties[prop] = vars()[name]

# set ploidy, sequencing error and heterozygosity
graph.graph_properties["ploidy"] = ploidy

graph.graph_properties["ins-error-rates"] = snakemake.params.get("err_ins", "")
graph.graph_properties["del-error-rates"] = snakemake.params.get("err_del", "")

graph.graph_properties["subst-heterozygosity"] = snakemake.params.get("heterozyg_subst", "")
graph.graph_properties["ins-heterozygosity"] = snakemake.params.get("heterozyg_ins", "")
graph.graph_properties["del-heterozygosity"] = snakemake.params.get("heterozyg_del", "")

# create first empty node for graph
node = graph.add_vertex()
v_id[node] = "{idx}_{sample}".format(idx=0, sample=sample)
v_name[node] = ""
v_seq[node] = ""
v_q_qual[node] = ""

# add reads as vertices of the graph
if reads.endswith(".gz"):
    with gzip.open(reads, "rt") as _reads:
        graph = graph_operations.set_nodes(graph, _reads, format, sample)
else:
    with open(reads, "rU") as _reads:
        graph = graph_operations.set_nodes(graph, _reads, format, sample)

# add edges from all-vs-all alignment of reads (please see rule minimap2)
verbose = pysam.set_verbosity(0)  # https://github.com/pysam-developers/pysam/issues/939
bam = pysam.AlignmentFile(bam, "rb")
pysam.set_verbosity(verbose)
for read in bam.fetch(until_eof=True):
    graph = graph_operations.set_edges(graph, read, threshold)
bam.close()
graph.remove_vertex(0)


# write log files
sys.stderr.write(
    "graph construction summary for sample {}:"
    "\n nodes:\t{}\n edges:\t{}\n".format(sample, graph.num_vertices(), graph.num_edges()))

graph_operations.save_and_draw_graph(graph, xml_out=graph_xml,
                                     figure_out=output_figure)

# in case subgraph directory is expected as optional output, but some samples do not produce
# subgraphs. This way the empty directories for these samples preserved and are not removed by snakemake.
graph_operations.set_dir(dir_subgraphs)

# step 1: extract connected components
message = "CONNECTED COMPONENTS based on the graph construction from the edit distances (minimap2)"
connected_components = graph_operations.get_components(graph, message, snakemake.wildcards.get('sample'),
                                                       dir_subgraphs, connected_components_xml,
                                                       connected_components_figure, v_color="component-label")

graph = None
loc_nr = 0
for (comp, comp_nr) in connected_components:
    # step 2: likelihood of allele fractions
    alleles = likelihood_operations.get_candidate_alleles(comp, comp.vertices(), noise_small, noise_large, cluster_size)
    n = len(alleles)
    vafs_candidates = list(likelihood_operations.get_candidate_vafs(n, ploidy))
    nodes = list([comp.vertex_index[node] for node in comp.vertices()])
    read_allele_likelihoods = {}

    # calculate the likelihood over ALL reads
    vafs_likelihoods = [likelihood_operations.calc_vafs_likelihood(comp, vafs, nodes, alleles, read_allele_likelihoods) for vafs in
                        vafs_candidates]
    if not vafs_likelihoods:  # case empty list, e.g. if the treshold-seq-noise value is set too large
        continue
    max_likelihood_idx = np.argmax(vafs_likelihoods)

    # obtain ML solution for step 2
    max_likelihood_vafs = vafs_candidates[max_likelihood_idx]

    # write to log file
    sys.stderr.write(
        "\n\nStats for component {} in sample {} with {} alleles and ploidy = {}:\n".format(comp_nr, sample,
                                                                                            n, ploidy))
    sys.stderr.write("\n\tMaximum vafs likelihood:\n")
    for vaf, allele in zip(max_likelihood_vafs, alleles):
        sys.stderr.write("\t\t{} for allele {}\n".format(vaf, allele))

    # step 3: likelihood of loci given alleles and fractions
    loci_candidates = list(likelihood_operations.get_candidate_loci(n, ploidy, max_likelihood_vafs))
    loci_likelihoods = {}

    loci_likelihoods = [
        likelihood_operations.calc_loci_likelihoods(comp, max_likelihood_vafs, alleles, loci, loci_likelihoods)
        for loci in loci_candidates
    ]
    max_likelihood_idx = np.argmax(loci_likelihoods)
    max_likelihood_loci = loci_candidates[max_likelihood_idx]

    # write to log file
    sys.stderr.write("\n\tMaximum loci likelihood calculations:\n")
    sys.stderr.write("\t\tloci_likelihoods:\n\t\t\t{}\n\t\tmax_likelihood_idx:\n\t\t\t{}"
                     "\n\t\tmax_likelihood_loci:\n\t\t\t{}\n".format(loci_likelihoods, max_likelihood_idx,
                                                                     max_likelihood_loci))

    # step 4: results output to VCF file
    loci_alleles = likelihood_operations.get_sorted_loci_alleles(alleles, max_likelihood_loci)
    gt_indices = likelihood_operations.get_gt_indices(alleles, max_likelihood_loci, loci_alleles)

    for gt_idx_locus in list(set(gt_indices)):
        vcf.write("{chrom}\t{pos}\t.\t{ref}\t{alt}\t.\t.\t.\tGT\t{gt}\n".format(chrom="LOC{}".format(loc_nr),
                                                                             pos="1", ref=loci_alleles[0],
                                                                             alt=', '.join(loci_alleles[1:]) if len(loci_alleles) > 1 else ".",
                                                                             gt=likelihood_operations.get_genotype(gt_idx_locus)))
        loc_nr += 1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
library("tidyverse")
library("stringr")

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

blast <- snakemake@input[[1]]
blast_data <- read_tsv(blast, col_names = TRUE, trim_ws = TRUE)
names(blast_data) <- c('res_id','sim_id','identity', 'len_alignment', 'mismatches', 'gap_opens', 'q_start', 'q_end', 's_start', 's_end', 'evalue', 'bit_score')

blast_data <- blast_data %>%
  mutate(res_id=as.character(res_id)) %>%
  separate(res_id, c("sample", "individual", "locus"), sep = "\\|", extra="merge", convert=TRUE) #%>%

for(i in blast_data$locus) {
  if(nchar(strsplit(i, "-")[[1]][1])<5){
    blast_data$locus[blast_data$locus==i] <- str_replace(blast_data$locus[blast_data$locus==i], 'LOC', 'LOC0')
  }
}

plot <- ggplot(data = blast_data, aes(y=locus, x=identity)) +
  geom_bar(width = 1.0, position = "dodge", stat="identity", aes(fill = identity), colour="Black") +
  scale_fill_gradient(low="mediumpurple3",high="green3", name = "Identity") +
  ggtitle("Indentity [%] of loci identified by NodeRAD vs. simulated loci") +
  xlab("Identity [%]") +
  ylab("Locus") +
  theme_minimal() +
  theme(aspect.ratio = 2.5/1.5, plot.title = element_text(hjust = 0.5), legend.position = "right", legend.key.size = unit(0.8, "cm"), axis.text.y = element_text(hjust = 0))
plot
ggsave(snakemake@output[["ident"]], width = 7, height = 7)

blast_data$intervals <- cut(blast_data$identity, seq(0,100,by=1))
plot <- ggplot(blast_data, aes(intervals)) +
    geom_histogram(stat= "count", aes(fill = ..count..)) +
    xlab("Identity [%]") +
    ylab("Counts") +
    scale_fill_gradient(name = "Counts")+
    theme(aspect.ratio = 2.5/1.5, plot.title = element_text(hjust = 0.5), legend.position = "right") +
    ggtitle("Histogram of number of alleles identified by NodeRAD\nwith respect to their similarity to simulated data.")
plot
ggsave(snakemake@output[["ident_hist"]], width = 7, height = 7)

plot <- ggplot(data = blast_data, aes(x=locus, y=bit_score, group = individual)) +
  geom_line(color = "gray70") +
  geom_point(aes(color = bit_score), size =3) +
  geom_point(shape = 1,size = 3, colour = "black") +
  scale_color_gradient(low="red", high="green", name = "Bit score") +
  ggtitle("Bitscores of loci identified by NodeRAD vs. simulated loci") +
  xlab("Locus") +
  ylab("Bit score") +
  theme_minimal() +
  theme(axis.text.x = element_text(color = "black", size = 7, angle = 90, hjust = 0, face = "plain"), plot.title = element_text(hjust = 0.5), legend.position = "right", legend.key.size = unit(0.4, "cm"), axis.text.y = element_text(hjust = 0))
plot
ggsave(snakemake@output[["bit_scores"]], width = 7, height = 7)

plot <- ggplot(data = blast_data, aes(x=locus, y=evalue, group = individual)) +
  geom_line(color = "gray70") +
  geom_point(aes(color = evalue), size =3) +
  geom_point(shape = 1,size = 3, colour = "black") +
  scale_color_gradient(low="green", high="red", name = "E-value") +
  ggtitle("E-Values of loci identified by NodeRAD vs. simulated loci") +
  xlab("Locus") +
  ylab("E-Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(color = "black", size = 7, angle = 90, hjust = 0, face = "plain"), plot.title = element_text(hjust = 0.5), legend.position = "right", legend.key.size = unit(0.4, "cm"), axis.text.y = element_text(hjust = 0))
plot
ggsave(snakemake@output[["evalues"]], width = 7, height = 7)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import sys
import yaml
from itertools import chain
import re

sys.stderr = open(snakemake.log[0], "w")

sample = snakemake.wildcards.get('sample')

# input
sim_in = snakemake.input.get("sim_data_stats")
individual_name=snakemake.params.get("individual")
individual = snakemake.params.get("individual").replace("_", " ")

fasta = ""

# read yaml file
yaml_in = open(sim_in, 'r')
yaml_in_load = list(yaml.load_all(yaml_in, Loader=yaml.FullLoader))
sim_data_file = yaml_in_load[1]

# parse fasta data from yaml
for key_loc in sim_data_file.keys():
    # loci sequences
    seq = sim_data_file[key_loc]["p5 seq"]

    # loci mutations of given individual
    indiv_data = sim_data_file[key_loc]["individuals"][individual]
    mutations = list([indiv_data[item]["mutations"] for item in indiv_data])
    # modifying the sequence when mutations are present for given individual and locus
    index = 0
    if mutations:
        for mutation in mutations:
            if mutation:
                mut_list = [re.split("@", item)[1] for item in mutation]
                mut_tuples = [tuple(re.split(":", mut)) for mut in mut_list]
                for (pos, mut) in mut_tuples:
                    if '(' or ')' in pos:
                        idx = re.split("\(", re.split("\)", pos)[0])[-1]
                    idx = int(pos)  # nucleotide position indexing in ddRAGE starts at 0,
                    # but it should start at position 1 https://varnomen.hgvs.org/bg-material/numbering/
                    if ">" in mut:
                        mut_subs = re.split(">", mut)
                        if seq[idx] == mut_subs[0]:
                            seq = seq[:idx] + mut_subs[-1] + seq[idx + 1:]
                    if "+" in mut:
                        mut_subs = re.split('\+', mut)[-1]
                        seq = seq[:idx] + mut_subs + seq[idx:]
                    if "-" in mut:
                        mut_subs = re.split('\-', mut)[-1]
                        if all([seq[idx] == mut_subs[i] for i in range(len(mut_subs))]):
                            seq = seq[:idx] + seq[idx + len(mut_subs):]
            # parsing to fasta format
            fasta_id = ">{}|{}-{}|simulated".format(individual_name, key_loc.replace(" ", "_"), index)
            fasta += "{}\n{}\n".format(fasta_id, seq)
            index += 1
    else:
        # parsing to fasta format
        fasta_id = ">{}|{}|simulated".format(individual_name, key_loc.replace(" ", "_"))
        fasta += "{}\n{}\n".format(fasta_id, seq)
# write fasta file
fasta_file = open(snakemake.output.get("fasta_sim", ""), "w")
fasta_file.write(fasta)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import sys
import pandas as pd

sys.stderr = open(snakemake.log[0], "w")

sample = snakemake.wildcards.get('sample')

# input
vcf_in = pd.read_csv(snakemake.input.get("vcf_data"), sep="\t")[["#CHROM", "REF", "ALT"]]

individual = snakemake.params.get("individual").replace(" ", "_")

prefix = "{sample}|{indiv}|".format(sample=sample,indiv=individual)
fasta = ""

for i in range(len(vcf_in["#CHROM"])):
    name = vcf_in["#CHROM"][i]
    fasta += ">{prefix}{loc}-REF\n{ref}\n".format(prefix=prefix, loc=name, ref=vcf_in["REF"][i])
    variants = str(vcf_in["ALT"][i]).split(",")
    if not variants[0] == ".":
        idx = 1
        for var in variants:
            fasta += ">{prefix}{loc}-ALT_{idx}\n{alt}\n".format(prefix=prefix, loc=name, idx=idx, alt=var)
            idx += 1

# write fasta file
fasta_file = open(snakemake.output.get("vcf_fasta", ""), "w")
fasta_file.write(fasta)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
__author__ = "Tom Poorten"
__copyright__ = "Copyright 2017, Tom Poorten"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import infer_out_format
from snakemake_wrapper_utils.samtools import get_samtools_opts


samtools_opts = get_samtools_opts(snakemake, parse_output=False)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_extra = snakemake.params.get("sort_extra", "")

out_ext = infer_out_format(snakemake.output[0])

pipe_cmd = ""
if out_ext != "PAF":
    # Add option for SAM output
    extra += " -a"

    # Determine which pipe command to use for converting to bam or sorting.
    if sort == "none":
        if out_ext != "SAM":
            # Simply convert to output format using samtools view.
            pipe_cmd = f"| samtools view -h {samtools_opts}"

    elif sort in ["coordinate", "queryname"]:
        # Add name flag if needed.
        if sort == "queryname":
            sort_extra += " -n"

        # Sort alignments.
        pipe_cmd = f"| samtools sort {sort_extra} {samtools_opts}"

    else:
        raise ValueError(f"Unexpected value for params.sort: {sort}")


shell(
    "(minimap2"
    " -t {snakemake.threads}"
    " {extra} "
    " {snakemake.input.target}"
    " {snakemake.input.query}"
    " {pipe_cmd}"
    " > {snakemake.output[0]}"
    ") {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
__author__ = "Tom Poorten"
__copyright__ = "Copyright 2017, Tom Poorten"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "(minimap2 -t {snakemake.threads} {extra} "
    "-d {snakemake.output[0]} {snakemake.input.target}) {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts

samtools_opts = get_samtools_opts(
    snakemake, parse_threads=False, parse_write_index=False, parse_output_format=False
)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell("samtools faidx {samtools_opts} {extra} {snakemake.input[0]} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts

samtools_opts = get_samtools_opts(snakemake)
extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)


shell("samtools view {samtools_opts} {extra} {snakemake.input[0]} {region} {log}")
ShowHide 14 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/AntonieV/NodeRAD
Name: noderad
Version: latest
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...