Pipeline to pull microbial reads from WGS data and perform metagenomic analysis

public public 1yr ago Version: v1.1 0 bookmarks

Doi for manuscript: https://doi.org/10.12688/wellcomeopenres.19155.1

Please follow the tutorial in my Jupyter Book Available Here: https://aidanfoo96.github.io/MINUUR/ for reproduction of my analysis or to apply in your host of interest :)

MINUUR is a snakemake pipeline I developed to extract non-host sequencing reads from mosquito whole genome sequencing data and utilise a range of metagenomic analyses to characterise potential host-associated microbes. Its application can be applied to other host-associated WGS data. MINUUR aims to leverage pre-existing WGS data to recover microbial information pertaining to host associated microbiomes.

MINUUR utilises:

  • KRAKEN2: Classify taxa from unmapped read sequences

  • KrakenTools: extract classified reads for downstream analysis

  • BRACKEN: reestimate taxonomic abundance from KRAKEN2

  • MetaPhlan3: Classify taxa using marker genes

  • MEGAHIT: Metagenome assemblies using unmapped reads

  • QUAST: Assembly statistics from MEGAHIT assemblies

  • MetaBat2: Bin contiguous sequences from MEGAHIT

  • CheckM: Assess bin quality from MetaBat2

Installation of Snakemake

MINUUR is run using the workflow manager Snakemake

Snakemake is best installed using the package manager Mamba

Once Mamba is installed run

mamba create -c bioconda -c conda-forge --name snakemake snakemake

Installation of MINUUR

Use git clone https://github.com/aidanfoo96/MINUUR/ and cd MINUUR/workflow . This is the reference point from which the pipeline will be run. See the JupyterBooks page for a full tutorial on establishing the configuration to run this pipeline.

Update 09/05/2023:

  • Added Github actions

  • Dummy dataset now included in workflow/data , tutorial for running this is included in the JupyterBooks page. Use this to ensure the pipeline works on your machine.

  • Added the option to run BUSCO to help assess eukaryotic contamination in MAGs

Any feedback or bugs please open an issue or contact: [email protected]

Code Snippets

50
51
script: 
    "../scripts/check_inputs.py"
75
76
wrapper: 
    "v0.80.1/bio/fastqc"
 99
100
wrapper:
    "v1.21.6/bio/cutadapt/pe"
122
123
wrapper: 
    "v0.80.1/bio/fastqc"
20
21
wrapper: 
    "0.79.0/bio/samtools/stats"
33
34
shell: 
    "sed -n '8p;14p;16p' {input.sample} > {output.txt}" 
70
71
shell: 
    "samtools view -b -f 12 -F 256 {input.sample} > {output.bam} 2> {log}"
88
89
shell: 
    "bamToFastq -i {input.sample} -fq {output.fastq1} -fq2 {output.fastq2}"
55
56
shell: 
    "sed -n '8p;14p;16p' {input.sample} > {output.txt}" 
89
90
shell: 
    "samtools view -b -f 4 {input} > {output.bam} 2> {log}"
106
107
shell: 
    "bamToFastq -i {input.sample} -fq {output.fastq1} -fq2 {output.fastq2}"
105
106
107
108
109
110
111
112
113
114
115
shell: 
    """
    extract_kraken_reads.py -k {input.krak_file} \
    -s1 {input.r1} -s2 {input.r2} \
    --output {output.read1} \
    --output2 {output.read2} \
    --fastq-output \
    --taxid {params.taxa} \
    --include-children \
    --report {input.krak_reprt} 2> {log}
    """
131
132
shell: 
    "kreport2mpa.py -r {input.krak_file} -o {output.mpa_txt} --display-header"
148
149
shell: 
    "combine_mpa.py -i {input.mpa_files} -o {output.combined_report}"
169
170
script: 
    "../scripts/generate_kraken_summaries.R"
186
187
script: 
    "../scripts/combined_kraken_alignment_stat_ffq.R"
199
200
shell: 
    "sed -n '1p;2p' {input.krak_mpa_report} > {output.classified_summaries}"
231
232
script: 
    "../scripts/plot_classifiedVsunclassified_reads.R"
253
254
script: 
    "../scripts/plot_kraken_heatmaps.R"
275
276
script: 
    "../scripts/plot_kraken_spatial.R"
324
325
shell: 
    "merge_metaphlan_tables.py {input.cln_out} > {output.concat_tbl}" 
337
338
shell: 
    "grep 's__\|UNKNOWN' {input.txt_out} | cut -f1,3 > {output.tbl_out}"
354
355
script:
    "../scripts/generate_metaphlan_summaries.R" 
419
420
script: 
    "../scripts/plot_bracken_results.R"
50
51
wrapper: 
    "0.80.2/bio/bwa/index"
78
79
wrapper:
   "v1.17.3/bio/bwa/mem"
88
89
script:
    "../scripts/plot_checkm_QA.R" 
124
125
wrapper:
    "v1.28.0/bio/busco"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import sys
import os
import pandas as pd
import numpy as np


sample_list = pd.read_csv(snakemake.params['sample_list'], sep  = "\t")
automatic_input = snakemake.params['automatic_input']

# Check sample names given in the fastq_samples.tsv file match the given fastq files

if automatic_input is False: 
    sample_table = pd.read_csv(snakemake.params['sample_table'], sep = "\t")
    assert np.isin(sample_table['sampleID'], sample_list['sampleID']).all(), f"samples in your sample table are not present in your sample list. Please ensure all samples match"
else:
    for sample in sample_list['sampleID']:
        for num in [1, 2]:
            fastq_sample = f"../resources/{sample}_{num}.fastq.gz"
            assert os.path.isfile(fastq_sample), "your sample names given in 'samples_fastq.tsv' do not match your fastq files. Please rename your files accrordingly"

print("Inputs Passed :D")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
library(tidyverse)
library(MetBrewer)

#### Load Data ####
concatenated_alignments <- read_tsv(snakemake@input[["alignment_stats"]], col_names = c("SN", "sequence", "num_sequences", "file_samples"))
concatenated_alignments$file_samples[concatenated_alignments$file_samples == "# excluding supplementary and secondary reads"] <- ""

#### Get List of Sample Names ####
sample_names <- concatenated_alignments %>%
  separate(file_samples, c("file", "sample"), sep = "stats_ffq/") %>%
  separate(sample, c("sample", "junk"), sep = "_align") %>%
  select(sequence, num_sequences, sample) %>%
  group_by(sample) %>%
  summarise(n()) %>%
  select(sample) %>%
  filter(sample != "NA")

sample_names <- as.vector(sample_names$sample)

#### Clean Data ####
concatenated_alignments_clean <- concatenated_alignments %>%
  select(!SN) %>%
  separate(file_samples, c("file", "sample"), sep = "stats_ffq/") %>%
  separate(sample, c("sample", "junk"), sep = "_align") %>%
  select(sequence, num_sequences, sample) %>%
  mutate(sample = replace(sample, is.na(sample), sample_names))

##### Generate Summary Tables ######
# Wider, 'human readable' format with proportions
concatenated_alignments_clean_wide <- concatenated_alignments_clean %>% 
  group_by(sample) %>%
  pivot_wider(names_from = sequence, values_from = num_sequences) %>%
  mutate(proportion_mapped = `reads mapped:` / `raw total sequences:` * 100) 

write.table(concatenated_alignments_clean_wide, file = snakemake@output[["human_read_tbl"]], sep = "\t", row.names = FALSE)
write.table(concatenated_alignments_clean, file = snakemake@output[["long_read_tbl"]], sep = "\t", row.names = FALSE)

#### Plot Summaries ####
alignment_statistics_plot <- concatenated_alignments_clean %>%
  ggplot() + 
  aes(x = sample, y = num_sequences, fill = sequence) %>%
  geom_bar(stat = "identity", position = "dodge") + 
  theme_bw(base_size = 15) + 
  scale_fill_manual(values = met.brewer("Redon", n = 3, type = "discrete"), 
                    labels = c("Raw Read Number", "Reads Mapped", "Reads Unmapped")) + 
  xlab("Sample") + 
  ylab("Read Number") + 
  theme(legend.title = element_blank(), 
        axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), 
        legend.position = "top") 

ggsave(snakemake@output[["alignment_stat_plot"]], alignment_statistics_plot)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
library(tidyverse)

genus_delim <- function(){
  if(snakemake@params[["kraken_db_type"]]){
    return("g__")
  }else{
    return("g__g__")
  }
}


spp_delim <- function(){
  if(snakemake@params[["kraken_db_type"]]){
    return("s__")
  }else{
    return("s__s__")
  }
}

genus <- genus_delim()
spp <- spp_delim()

data <- read_tsv(snakemake@input[["combined_report"]])

##### Get Domain #####
kingdom_data <- data %>%
  filter(!grepl("p__|g__|s__|c__|o__|f__", `#Classification`)) %>%
  separate(`#Classification`, c("junk", "kingdom"), sep = "^k__") %>%
  select(!junk)

kingdom_data_long <- kingdom_data %>%
  pivot_longer(cols = !kingdom, names_to = "study", values_to = "read_number") 

##### Get genus  #####
genus_data <- data %>%
  filter(grepl("g__", `#Classification`)) %>%
  separate(`#Classification`, c("junk", "genus"), sep = genus) %>%
  select(!junk) %>%
  filter(!grepl("s__", genus)) 

genus_data_long <- genus_data %>%
  pivot_longer(cols = !genus, names_to = "study", values_to = "read_number") 

##### Get species  #####
species_data <- data %>% 
  filter(grepl("s__", `#Classification`)) %>%
  separate(`#Classification`, c("junk", "species"), sep = spp) %>%
  select(!junk)

species_data_long <- species_data %>% 
  pivot_longer(cols = !species, names_to = "study", values_to = "read_number") 

##### How many reads per study? #####
total_reads_per_study <- species_data_long %>% 
  group_by(study) %>%
  summarise(total_reads = sum(read_number)) 

classified_reads_plot <- total_reads_per_study %>% 
  ggplot() + 
  aes(x = study, y = total_reads) + 
  geom_col() + 
  theme_classic() + 
  coord_flip()

##### Species Classification Passed a specific threshold (unfinished) ##### 
species_data_filtered <- species_data_long %>%
  filter(read_number > 1000)

genus_data_filtered <- genus_data_long %>%
  filter(read_number > 1000)

#### Export Dataframes to TSV files ####
write.table(kingdom_data_long, file = snakemake@output[["kingdom_table"]], sep = "\t", row.names = FALSE)
write.table(genus_data_long, file = snakemake@output[["genus_table"]], sep = "\t", row.names = FALSE)
write.table(species_data_long, file = snakemake@output[["spp_table"]], sep = "\t", row.names = FALSE)
write.table(total_reads_per_study, file = snakemake@output[["classified_reads"]], sep = "\t", row.names = FALSE)

#### Export Classified Reads Plot ####
pdf(snakemake@output[["classified_reads_plot"]])
print(classified_reads_plot)
dev.off()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
library(tidyverse)

metaphlan_tbl <- read_tsv(snakemake@input[["concat_tbl"]], skip = 1) %>%
  select(!NCBI_tax_id)

#### Get Kingdom ####
metaphlan_tbl_king <- metaphlan_tbl %>%
  filter(!grepl("p__|g__|s__|c__|o__|f__", clade_name)) %>%
  mutate_at("clade_name", str_replace, "k__", "")

metaphlan_tbl_king_long <- metaphlan_tbl_king %>%
  pivot_longer(cols = !clade_name, 
               names_to = "samples", 
               values_to = "rel_abund")

#### Get Genus ####
metaphlan_tbl_genus <- metaphlan_tbl %>%
  filter(grepl("g__", clade_name)) %>%
  separate(clade_name, c("junk", "genus"), sep = "g__") %>%
  select(!junk) %>%
  filter(!grepl("s__", genus))

metaphlan_tbl_genus_long <- metaphlan_tbl_genus %>% 
  pivot_longer(cols = !genus, 
               names_to = "samples", 
               values_to = "rel_abund")

#### Get Species ####
metaphlan_tbl_spp <- metaphlan_tbl %>%
  filter(grepl("s__", clade_name)) %>%
  separate(clade_name, c("junk", "species"), sep = "s__") %>%
  select(!junk)

metaphlan_tbl_spp_long <- metaphlan_tbl_spp %>% 
  pivot_longer(cols = !species,
               names_to = "samples", 
               values_to = "rel_abund")

#### Export Dataframes to TSV files ####
write.table(metaphlan_tbl_king_long, file = snakemake@output[["kingdom_table"]], sep = "\t", row.names = FALSE)
write.table(metaphlan_tbl_genus_long, file = snakemake@output[["genus_table"]], sep = "\t", row.names = FALSE)
write.table(metaphlan_tbl_spp_long, file = snakemake@output[["spp_table"]], sep = "\t", row.names = FALSE)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
library(tidyverse)
# Get colour pallette
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# Set Read Filter
filter <- as.character(snakemake@params[["bracken_threshold"]])
filter <- as.numeric(filter)

#### Load Data #####
conc_data <- read_tsv(snakemake@input[["conc_input"]], col_names = c("species",
                                                                       "taxa_id", 
                                                                       "taxa_lvl", 
                                                                       "kraken_assigned_reads", 
                                                                       "added_reads", 
                                                                       "new_est_reads", 
                                                                       "fraction_total", 
                                                                       "file_path")) %>%
  filter(species != "name")

conc_data_clean <- conc_data %>%
  separate(file_path, c("file", "sample"), sep = "out/") %>%
  separate(sample, c("sample", "junk"), sep = "_bracken") %>%
  select(sample, species, taxa_id, taxa_lvl, kraken_assigned_reads, added_reads, new_est_reads, fraction_total) %>%
  mutate_at(c("taxa_id", "kraken_assigned_reads", "added_reads", "new_est_reads", "fraction_total"), as.numeric)


#### Added Reads Across Samples ####

sample_summary <- conc_data_clean %>%
  group_by(sample) %>%
  summarise(kraken_reads = sum(kraken_assigned_reads), 
            added_reads = sum(added_reads), 
            total_new_reads = sum(new_est_reads))

sample_summary_long <- sample_summary %>%
  pivot_longer(cols = !sample, 
               names_to = "classification_type", 
               values_to = "num_reads") 

sample_summary_plot <- sample_summary %>%
  pivot_longer(cols = !sample, 
               names_to = "classification_type", 
               values_to = "num_reads") %>%
  filter(classification_type != "total_new_reads") %>%
  ggplot() + 
  aes(reorder(x = sample, desc(num_reads)), y = num_reads, fill = classification_type) + 
  geom_bar(stat = "identity", position = "stack") + 
  coord_flip() + 
  scale_fill_manual(values = cbPalette) + 
  theme_classic(base_size = 25) + 
  xlab("Sample") + 
  ylab("Number of Reads")

ggsave(snakemake@output[["added_reads_plot"]], height = 15, width = 20)

#### Stratified Heatmap of Relative Abundance Per Per Sample ####

conc_data_clean$species <- sub(" ", "_", conc_data_clean$species)

conc_data_clean_sep <- conc_data_clean %>%
  separate(species, into = c("Genus", "Species"), sep = "_") %>%
  filter(kraken_assigned_reads > filter)

conc_data_clean_genus_group <- conc_data_clean_sep %>%
  group_by(Genus) %>%
  summarise(n())

plot_stratified_heatmap <- function(){
  for(i in conc_data_clean_genus_group$Genus) {
  print(
    conc_data_clean_sep %>%
      filter(Genus == i) %>% 
      ggplot() + 
      aes(x = sample, y = Species) + 
      geom_tile(aes(fill = new_est_reads), linejoin = "round") + 
      scale_fill_gradientn(colours = c("white", "#F0E442", "#E69F00")) +
      facet_grid(rows = "Genus") + 
      theme_classic(base_size = 25) + 
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + 
      xlab("Sample"))
  }
}

pdf(snakemake@output[["bracken_stratified_heatmap"]], height = 20, width = 20)
plot_stratified_heatmap() 
dev.off()

##### Heatmap of Species #####
plot_heatmap_per_sample <- function(data, filter_thresh){
  data %>%
    filter(new_est_reads > filter_thresh) %>%
    select(sample, species, fraction_total) %>%
    ggplot() +
    aes(x = sample, y = species) +
    geom_tile(aes(fill = log(fraction_total))) + 
    scale_fill_gradientn(colours = c("white", "orange", "blue")) + 
    theme_classic() + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + 
    ggtitle(paste("Log Relative Abundance of Microbial Species filter: Filtered > ", filter_thresh, sep = ""))
}

pdf(snakemake@output[["bracken_overall_heatmap"]], height = 20, width = 20)
plot_heatmap_per_sample(conc_data_clean, filter)
dev.off()
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
library(tidyverse)
library(MetBrewer)

# Load Appropriate Columns 
columns <- c("bin", "marker_lineage", "num_genomes", "num_markers", "num_marker_sets", "completeness", "contamination",
             "strain_heterogen", "genome_size_bp", "num_ambiguous_bases", "num_scaffolds", "num_contigs", "N50_scaffold", "N50_contig",
             "mean_scaffold_length", "mean_contig_length", "longest_scaffold", "longest_contig", 
             "GC", "GC_std", "coding_density", "translation_table", "num_predicted_genes", "0", "1", "2", "3", "4", "5+", "sample")

# Read Data, skip first row and remove the redundant header columns from previous step
bin_stats <- read_tsv(snakemake@input[["checkm_conc_result"]], col_names = columns, skip = 1) %>%
  filter(bin != "Bin Id")

# Conver to Numeric Columns
bin_stats$completeness <- as.numeric(bin_stats$completeness)
bin_stats$contamination <- as.numeric(bin_stats$contamination)
bin_stats$genome_size_bp <- as.numeric(bin_stats$genome_size_bp)
bin_stats$num_contigs <- as.numeric(bin_stats$num_contigs)
bin_stats$mean_contig_length <- as.numeric(bin_stats$mean_contig_length)
bin_stats$N50_contig <- as.numeric(bin_stats$N50_contig)

# Get breaks for plotting 
bin_stats$col <- cut(bin_stats$completeness, 
                        breaks = c(-Inf, 80, Inf), 
                        labels = c("<80", ">=80"))

bin_stats$contig_cut <- cut(bin_stats$num_contigs, 
                     breaks = c(-Inf, 250, Inf), 
                     labels = c("<=250", ">250"))

# Plot Completeness Vs Contamination Per Sample
CompletenessVsContam <- bin_stats %>%
  ggplot() + 
  aes(x = completeness, y = contamination, size = genome_size_bp, col = col) %>%
  geom_point(alpha = 0.5) + 
  scale_x_continuous(limits = c(0, 100)) +
  theme_minimal() + 
  scale_color_manual(values = c("blue", "orange"))

pdf(snakemake@output[["CompletenessVsContam"]])
print(CompletenessVsContam)
dev.off()

#### Plot Completeness Vs Contamination Per Sample
bin_stats_sample_split <- bin_stats %>%
  separate(bin, into = c("sample", "bin_num"), sep = "\\.")

compvscontam_samp <- bin_stats_sample_split %>%
  ggplot() + 
  aes(x = completeness, y = contamination, col = sample) %>%
  geom_point(alpha = 0.8, size = 4) + 
  scale_x_continuous(limits = c(0, 100 )) +
  geom_vline( xintercept = 80, linetype = "dotted", 
              color = "green", size = 1) + 
  geom_vline(xintercept = 50, linetype = "dotted", 
             color = "orange", size = 1) + 
  theme_bw(base_size = 18) + 
  scale_color_manual(values = met.brewer("Redon", 10, type = "discrete")) + 
  xlab("Completeness") + 
  ylab("Contamination") + 
  labs(colour = "Sample") + 
  theme(legend.position = "bottom", 
        legend.direction = "horizontal", 
        legend.text = element_text(size = 10))

pdf(snakemake@output[["CompletenessVsContam_Per_Sample"]])
print(compvscontam_samp)
dev.off()

#### Plot Bar Chart with Completeness vs Contamination Per Sample
bin_stats_bar <- bin_stats_sample_split %>%
  unite("sample_bin", sample:bin_num) %>%
  select(completeness, contamination, sample_bin) %>%
  pivot_longer(c(completeness, contamination), 
              names_to = "bin_rank", 
              values_to = "value") %>%
  ggplot() + 
  aes(reorder(x = sample_bin, desc(value)), y = value, fill = bin_rank) + 
  geom_bar(stat = "identity", position = "dodge") + 
  theme_bw(base_size = 15) + 
  coord_flip() + 
  scale_fill_manual(values = met.brewer("Redon", 2, type = "discrete"), labels = c("Completeness", "Contamination")) + 
  geom_hline(yintercept = 80, linetype = "dotted", 
              color = "green", size = 1) + 
  geom_hline(yintercept = 50, linetype = "dotted", 
             color = "orange", size = 1) + 
  xlab("Sample") + 
  ylab("Contamination and Completion Score (%)") + 
  labs(fill = "CheckM Completion vs Contamination Score") + 
  theme(legend.position = "bottom", 
        legend.direction = "vertical")

pdf(snakemake@output[["BarChartCompletenessContamination"]])
print(bin_stats_bar)
dev.off()

#### Plot the number of contigs vs completeness
NumContigsVsCompleteness <- bin_stats %>%
  ggplot() + 
  aes(x = num_contigs, y = completeness, size = mean_contig_length, col = contig_cut) + 
  geom_point(alpha = 0.5) + 
  theme_minimal() + 
  scale_color_manual(values = c("blue", "orange"))

pdf(snakemake@output[["NumContigsVsCompleteness"]])
print(NumContigsVsCompleteness)
dev.off()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
library(tidyverse)
library(MetBrewer)

#### Import Kraken Data
concatenated_kraken_summaries <- read_tsv(snakemake@input[["concatenated_kraken_summary"]], 
                                            col_names = c("proportion_of_reads", 
                                                        "number_of_reads", 
                                                        "number_of_reads2",
                                                        "read_code", 
                                                        "read_code2", 
                                                        "classification_status", 
                                                        "file_path"))

concatenated_kraken_summaries_clean <- concatenated_kraken_summaries %>%
  separate(file_path, into = c("junk", "filename"), sep = "classified_summary/") %>%
  separate(filename, into = c("sample", "junk2"), sep = "_classified_summary.txt") %>%
  select(sample, proportion_of_reads, number_of_reads, classification_status) 


#### Plot Proportion of Classified vs Unclassified Reads from Kraken
plot_kraken_proportion <- function(kraken_table){

  concatenated_kraken_summaries_clean_plot <- kraken_table %>%
    ggplot() + 
    aes(x = sample, y = number_of_reads, fill = classification_status) + 
    geom_bar(stat = "identity", position = "fill") + 
    theme_bw(base_size = 15) + 
    scale_fill_manual(values = met.brewer("Redon", n = 2, type = "continuous"), 
                      labels = c("Classified Reads", "Unclassified Reads")) + 
    xlab("Sample") + 
    ylab("Proportion of Reads") + 
    theme(legend.title = element_blank(), 
          axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), 
          legend.position = "top")

  return(concatenated_kraken_summaries_clean_plot)

}

kraken_classification_proportions <- plot_kraken_proportion(concatenated_kraken_summaries_clean)

#### Save file
ggsave(snakemake@output[["classified_proportions"]], kraken_classification_proportions)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
library(tidyverse)
library(MetBrewer)

##### Import Data #####

spp_tbl <- read_tsv(snakemake@input[["spp_table"]])
genus_tbl <- read_tsv(snakemake@input[["genus_table"]]) 

##### Define User Read Threshold #####
filter_threshold <- as.character(snakemake@params[["strat_thresh"]])
filter_threshold <- as.numeric(filter_threshold)

genus_thres <- as.character(snakemake@params[["genus_thresh"]])
genus_thres <- as.numeric(genus_thres)

spp_thres <- as.character(snakemake@params[["species_thresh"]])
spp_thres <- as.numeric(spp_thres)

##### Clean and Prep Data #####
genus_tbl_clean <- genus_tbl %>%
  separate(study, into = c("sample", "junk"), sep = "_") %>%
  select(!junk) %>%
  rename("phylo_level" = genus)

spp_tbl_clean <- spp_tbl %>%
  separate(study, into = c("sample", "junk"), sep = "_") %>%
  select(!junk) %>%
  rename("phylo_level" = species)

##### Plot Function #####

kraken_heatmap <- function(clean_kraken_table, read_thresh, type){
  heatmap <- clean_kraken_table %>%
    filter(read_number > read_thresh) %>%
    filter(phylo_level != "Homo_sapiens") %>%
    ggplot() + 
    aes(x = sample, y = phylo_level) + 
    geom_tile(aes(fill = log(read_number)), linejoin = "round") + 
    scale_fill_gradientn(colours = met.brewer("Hiroshige", 4, type = "discrete")) +
    theme_bw(base_size = 20) +
    theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), 
          legend.position = "top", 
          legend.text = element_text(size = 15), 
          legend.title = element_text(size = 15)) + 
    labs(fill = "Read Number (log)") + 
    xlab("Sample") + 
    ylab(type)

  return(heatmap)

}

species_heatmap <- kraken_heatmap(spp_tbl_clean, spp_thres, "Species")
genus_heatmap <- kraken_heatmap(genus_tbl_clean, genus_thres, "Genus")

ggsave(snakemake@output[["genus_heatmap"]], genus_heatmap, height = 20, width = 15)
ggsave(snakemake@output[["species_heatmap"]], species_heatmap, height = 20, width = 15)

##### Faceted Plot By Genus #####
spp_tbl_split <- spp_tbl_clean %>% 
  separate(phylo_level, into = c("Genus", "Species1", "Species2"), sep = "_") %>%
  unite("Species", Species1:Species2, na.rm = TRUE) 

genus_list <- spp_tbl_split %>%
  group_by(Genus) %>%
  summarise(n()) 

plot_species_facets <- function() {
  for(i in genus_list$Genus) {
    print(
      spp_tbl_split %>%
        filter(Genus == i) %>% 
        ggplot() + 
        aes(x = sample, y = Species) + 
        geom_tile(aes(fill = read_number), linejoin = "round") + 
        scale_fill_gradientn(colours = c("white", "#F0E442", "#E69F00")) +
        facet_grid(rows = "Genus") + 
        theme_classic(base_size = 25) + 
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
        )
    }
}

pdf(snakemake@output[["stratified_heatmap"]], height = 20, width = 20)
plot_species_facets()
dev.off()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
library(tidyverse)
library(MetBrewer)
library(treemapify)

##### Import Data #####

spp_tbl <- read_tsv(snakemake@input[["spp_table"]])
genus_tbl <- read_tsv(snakemake@input[["genus_table"]]) 

##### Define User Read Threshold #####
filter_threshold <- as.character(snakemake@params[["strat_thresh"]])
filter_threshold <- as.numeric(filter_threshold)

genus_thres <- as.character(snakemake@params[["genus_thresh"]])
genus_thres <- as.numeric(genus_thres)

spp_thres <- as.character(snakemake@params[["species_thresh"]])
spp_thres <- as.numeric(spp_thres)

##### Clean and Prep Data #####
genus_tbl_clean <- genus_tbl %>%
  separate(study, into = c("sample", "junk"), sep = "_") %>%
  select(!junk) %>%
  rename("phylo_level" = genus)

spp_tbl_clean <- spp_tbl %>%
  separate(study, into = c("sample", "junk"), sep = "_") %>%
  select(!junk) %>%
  rename("phylo_level" = species)

##### Plot Spatial Plots #####
plot_spatial_classification <- function(classificaiton_tbl, read_thresh){
  tbl_clean_sum <- classificaiton_tbl %>%
    group_by(phylo_level) %>%
    summarise(summed_read_num = sum(read_number)) %>%
    filter(summed_read_num > read_thresh)

  spatial_plot <- tbl_clean_sum %>%
    filter(phylo_level != "Homo_sapiens") %>%
    filter(summed_read_num > read_thresh) %>%
    ggplot(aes(area = summed_read_num, fill = summed_read_num, label = phylo_level, subgroup = phylo_level)) +
    geom_treemap() + 
    geom_treemap_subgroup_border(colour = "white", size = 1) + 
    geom_treemap_text(colour = "white", place = "center", reflow = T) + 
    scale_fill_gradientn(colours = met.brewer("Hiroshige", 10, type = "discrete"), name = "Total Read Number") +
    theme_minimal(base_size = 15) + 
    theme(legend.position = "right")

  return(spatial_plot)
}

species_spatial_plot <- plot_spatial_classification(spp_tbl_clean, spp_thres)
genus_spatial_plot <- plot_spatial_classification(genus_tbl_clean, genus_thres)

ggsave(snakemake@output[["genus_spatial_plot"]], genus_spatial_plot, height = 20, width = 15)
ggsave(snakemake@output[["species_spatial_plot"]], species_spatial_plot, height = 20, width = 15)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "[email protected]"
__license__ = "MIT"

from os import path

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


import tempfile
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts


# Extract arguments.
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
samtools_opts = get_samtools_opts(snakemake)
java_opts = get_java_opts(snakemake)


index = snakemake.input.idx
if isinstance(index, str):
    index = path.splitext(snakemake.input.idx)[0]
else:
    index = path.splitext(snakemake.input.idx[0])[0]


# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")


if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))


# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view {samtools_opts}"

elif sort == "samtools":
    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}"

elif sort == "picard":
    # Sort alignments using picard SortSam.
    pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}"

else:
    raise ValueError(f"Unexpected value for params.sort ({sort})")

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(bwa mem"
        " -t {snakemake.threads}"
        " {extra}"
        " {index}"
        " {snakemake.input.reads}"
        " | " + pipe_cmd + ") {log}"
    )
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " --cores {snakemake.threads}"
    " {adapters}"
    " {extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
__author__ = "Tessa Pierce"
__copyright__ = "Copyright 2018, Tessa Pierce"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")


mode = snakemake.params.get("mode")
assert mode in [
    "genome",
    "transcriptome",
    "proteins",
], "invalid run mode: only 'genome', 'transcriptome' or 'proteins' allowed"


lineage = lineage_opt = snakemake.params.get("lineage", "")
if lineage_opt:
    lineage_opt = f"--lineage {lineage_opt}"


with tempfile.TemporaryDirectory() as tmpdir:
    dataset_dir = snakemake.input.get("dataset_dir", "")
    if not dataset_dir:
        dataset_dir = f"{tmpdir}/dataset"

    shell(
        "busco"
        " --cpu {snakemake.threads}"
        " --in {snakemake.input}"
        " --mode {mode}"
        " {lineage_opt}"
        " {extra}"
        " --download_path {dataset_dir}"
        " --out_path {tmpdir}"
        " --out output"
        " {log}"
    )

    if snakemake.output.get("short_txt"):
        assert lineage, "parameter 'lineage' is required to output 'short_tsv'"
        shell(
            "cat {tmpdir}/output/short_summary.specific.{lineage}.output.txt > {snakemake.output.short_txt:q}"
        )
    if snakemake.output.get("short_json"):
        assert lineage, "parameter 'lineage' is required to output 'short_json'"
        shell(
            "cat {tmpdir}/output/short_summary.specific.{lineage}.output.json > {snakemake.output.short_json:q}"
        )
    if snakemake.output.get("full_table"):
        assert lineage, "parameter 'lineage' is required to output 'full_table'"
        shell(
            "cat {tmpdir}/output/run_{lineage}/full_table.tsv > {snakemake.output.full_table:q}"
        )
    if snakemake.output.get("miss_list"):
        assert lineage, "parameter 'lineage' is required to output 'miss_list'"
        shell(
            "cat {tmpdir}/output/run_{lineage}/missing_busco_list.tsv > {snakemake.output.miss_list:q}"
        )
    if snakemake.output.get("out_dir"):
        shell("mv {tmpdir}/output {snakemake.output.out_dir:q}")
    if snakemake.output.get("dataset_dir"):
        shell("mv {dataset_dir} {snakemake.output.dataset_dir:q}")
ShowHide 34 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/aidanfoo96/MINUUR
Name: minuur
Version: v1.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...