Behavioral Genetics in Drosophila: RNA-Seq analysis pipeline

public public 1yr ago 0 bookmarks

Software used to analyze data for:

Deanhardt, B., Duan, Q., Du, C., Soeder, C., Morlote, A., Garg, D., Saha, A., Jones, C. D., & Volkan, P. C. (2023). Social experience and pheromone receptor activity reprogram gene expression in sensory neurons. G3 Genes|Genomes|Genetics , 00 (March), 2006–2021. https://doi.org/10.1093/g3journal/jkad072

Social experience and pheromone receptor activity reprogram behavioral switch and neuromodulatory gene expression in sensory neurons Bryson Deanhardt, Qichen Duan, Chengcheng Du, Charles Soeder, Alec Morlote, Deeya Garg, Corbin D. Jones, Pelin Cayirlioglu Volkan bioRxiv 2021.06.18.449021; doi: https://doi.org/10.1101/2021.06.18.449021 latest version: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v3

Changes in splicing and neuromodulatory gene expression programs in sensory neurons with pheromone signaling and social experience Deanhardt Bryson, Duan Qichen, Du Chengcheng, Soeder Charles, Jones Corbin, C. Volkan Pelin bioRxiv 2021.06.18.449021; doi: https://doi.org/10.1101/2021.06.18.449021 version 1: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v1 version 2: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v2

Analysis downstream of this workflow was performed by Quichen Duan.

Dependencies

The following were required to run the full workflow:

module load python/3.6.6 bedtools bedops samtools r/3.6.0 rstudio/1.1.453 bowtie sratoolkit subread

as well, the Snakefile includes rules with hard paths that will need changing:

fastp_clean_sample_pe	:	fastp
write_report	:	pandoc_path

Data

Not all the data in the config.yaml have been published; so full results summary won't run :( Intermediate results can be generated though.

Use

Deanhardt et al. 2021

The results summary from Deanhardt et al. 2021 can be generated: (untested!)

nohup time snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2021 --configfile config.Deanhardt2021.yaml &

This recursively generates the differential expression analysis from raw data, summarizes/vizualizes it in a PDF, and bundles/timestamps the results summary (".pdf")

The underlying differential expression and differential exon use data from Deanhardt et al. 2021 can be generated:

snakemake diff_expr/grpWtVs47b/grpWtVs47b.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVs67d/grpWtVs67d.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVsFru_smolFru/grpWtVsFru_smolFru.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVsMut/grpWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de 
snakemake diff_exon_use/dex_grpWtVs47b/grpWtVs47b.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVs67d/grpWtVs67d.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVsFru_smolFru/grpWtVsFru_smolFru.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVsMut/grpWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.M.de

Full Results Summary

The full results summary ("VolkanLab_BehaviorGenetics.05_Apr_2021.pdf") was generated simply by running

snakemake --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G "

Bypassing the Hard Parts

Users may wish to avoid the time and memory consuming steps of read downloading, cleaning, mapping, and counting. The raw counts underlying the results of Deanhardt et al. 2021 are stored in the NCBI GEO ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE179213 ); a shortcut script is included to download this data and prepare it such that it's where the Snakemake workflow anticipates. Currently only the "multi" alignment is available.

The shortcut script is run thus:

bash utils/shortcut.bash

Fruitless replicate #1 was problematic and not used in the final analysis; it is not included in the download by default. To include it, run:

bash utils/shortcut.bash -f

The main time bottleneck is the "rando" alignment, which downsamples but does not entirely remove multimapping reads. Alignement strategy ultimately had little impact on results, so users may wish to omit them by generating the "noRando" PDF (untested!)

nohup time snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2021 --configfile config.Deanhardt2021.yaml results/VolkanLabBehaviorGenetics.Deanhardt2021.noRando.pdf &

Deanhardt et al. 2023

In order to respond to reviewer comments, the population genetics of the experimental flies were compared to those of published DGRP lines sequenced in Huang et al. (2014). Calling the variants in parallel required some platform-dependent tweaks which will need to be modified; unless these population genetics are of specific interest it is probably best to use the 2021 workflow. However, the 5 May 2023 results summary can be rebuilt with the updated workflow:

snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2023 &

In particular, scripts/freebayes-parallel requires a hard path to the GNU parallel utility and to the vcfstreamsort utility from vcflib.

Code of Note

├── config.yaml
│ └── (configuration file for the pipeline)
├── dev
│ ├── ...
│ └── (files from earlier stages in project development, including an older FAIRE-seq experiment)
├── misc
│ ├── ...
│ └── (mostly metadata re: sequencing)
├── README.md
├── quichen_duan
│ ├── ...
│ └── (scripts for followup analyses from Quichen Duan)
├── scripts
│ ├── bam_summarizer.mapspliced.py
│ ├── collapsedSignalMerger.py
│ ├── correlate.R
│ ├── deSeqer.R
│ │ └── (differential expression testing script)
│ ├── dexSeqe.R
│ │ └── (differential exon use testing script)
│ ├── edgeHog.py
│ ├── expressionOmete.R
│ ├── faire_results.Rmd
│ ├── fastp_reporter.py
│ ├── geneOntologe.R
│ ├── multimap_spreader.py
│ ├── overlapSignalCollapser.py
│ ├── read_Hollower.py
│ ├── references.bib
│ ├── replicateCorrelations.png
│ └── RNAseq_results.Rmd
│ └── (summarize results)
├── Snakefile
│ └── (rules file for the pipeline)
├── utils
│ └── annotations
│ ├── ...
│ └── (includes lists of genes of interest)
└── VolkanLab_BehaviorGenetics.05_Apr_2021.pdf └── (pipeline walkthrough and results summary)

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
args = commandArgs(trailingOnly=TRUE)
# 1: counts file in
# 2: prefix for deseq out
# 3: name of contrast


library("DGCA")
library("dplyr")
library("yaml")
library("readr")
library("tidyr")
library("stringr")
library("magrittr")

xpr_in <- args[1]
nombre <- args[2]
cor_pre <- args[3]




#	build the sample DF
trammel <- read_yaml("config.yaml")
data_sets.df <- plyr::ldply(trammel$data_sets, data.frame)
data_sets.df$name <- as.factor(data_sets.df$name)
data_sets.df$day<- as.factor(data_sets.df$day)
data_sets.df$subgroups<- as.factor(data_sets.df$subgroups)
data_sets.df$rep<- as.factor(data_sets.df$rep)
data_sets.df$housing<- as.factor(data_sets.df$housing)
data_sets.df$genotype<- as.factor(data_sets.df$genotype)
data_sets.df$tissue<- as.factor(data_sets.df$tissue)

#	build the correlation experiment
corr.df <- plyr::ldply(trammel$correlations, data.frame) %>% filter(name == nombre )

# prep the sample metadata
sbgrp <- (corr.df %>% select(filt))[1,] %>% as.character
#coldata <- data_sets.df %>% as.data.frame() %>% ungroup()%>% filter(subgroups == sbgrp)
#coldata <- data_sets.df %>% ungroup() %>% group_by(name) %>%  mutate(genotype = paste0(as.character(genotype), collapse = "," ))  %>% filter(subgroups == sbgrp)
coldata <- data_sets.df%>% group_by(name) %>%  mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup()  %>% filter(subgroups == sbgrp)  %>% as.data.frame() 

rownames(coldata) <- coldata$name
dsgn.df <- coldata %>% select(name, corr.df$vars %>% as.character() )%>% mutate(present =1) %>% spread(key=genotype, value=present, fill=0) 
rownames(dsgn.df) <- dsgn.df$name
dsgn.df %<>% select(-c("name"))
dsgn.mat <- dsgn.df %>% as.matrix()




#xpr_in <- "expression/rotund.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.rpkm"

#xpr_in <- "expression/rotund.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.rpkm"


#load and format the count data
xpr.df <- read_delim(xpr_in, "\t", escape_double = FALSE, trim_ws = TRUE)
xpr.df %<>% as.data.frame()
rownames(xpr.df) <- xpr.df$Geneid
xpr.df %<>% select(rownames(dsgn.df)) 

ddcor_res <- ddcorAll(inputMat = xpr.df, design = dsgn.mat, compare = c("wt", "rn"),adjust = "none", heatmapPlot = FALSE, nPerm = 0)

#Write to TSV
write_delim(ddcor_res, paste(cor_pre, "corr","tsv", sep="."), delim = "\t")
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
args = commandArgs(trailingOnly=TRUE)
# 1: counts file in
# 2: prefix for deseq out
# 3: name of contrast


library("DESeq2")
library("dplyr")
library("yaml")
library("readr")
library("tidyr")
library("stringr")

counts_in <- args[1]
dsq_pre <- args[2]
nombre <- args[3]
config <- args[4]

#	build the sample DF
#trammel <- read_yaml("config.yaml")
trammel <- read_yaml(config)
data_sets.df <- plyr::ldply(trammel$data_sets, data.frame)
data_sets.df$name <- as.factor(data_sets.df$name)
data_sets.df$day<- as.factor(data_sets.df$day)
data_sets.df$subgroups<- as.factor(data_sets.df$subgroups)
data_sets.df$rep<- as.factor(data_sets.df$rep)
data_sets.df$housing<- as.factor(data_sets.df$housing)
data_sets.df$genotype<- as.factor(data_sets.df$genotype)
data_sets.df$tissue<- as.factor(data_sets.df$tissue)

#	build the contrast experiment
contrasts.df <- plyr::ldply(trammel$contrasts, data.frame) %>% filter(name == nombre )

# prep the sample metadata
sbgrp <- (contrasts.df %>% select(filt))[1,] %>% as.character
#coldata <- data_sets.df %>% as.data.frame() %>% ungroup()%>% filter(subgroups == sbgrp)
#coldata <- data_sets.df %>% ungroup() %>% group_by(name) %>%  mutate(genotype = paste0(as.character(genotype), collapse = "," ))  %>% filter(subgroups == sbgrp)
coldata <- data_sets.df%>% group_by(name) %>%  mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup()  %>% filter(subgroups == sbgrp) %>% unique() %>% as.data.frame() 

rownames(coldata) <- coldata$name
coldata <- coldata %>% select(contrasts.df$vars %>% as.character() )

#load and format the count data
counts.df <- read_delim(counts_in, "\t", escape_double = FALSE, trim_ws = TRUE)
counts.df.gath <- counts.df %>% select(-c(Chr, Start, End, Strand, Length)) %>% gather(key = "sample", value = "count", -one_of("Geneid"))
counts.df.sprud <- counts.df.gath %>% spread(key = sample, value = count) %>% as.data.frame() 
rownames(counts.df.sprud) <- counts.df.sprud$Geneid
counts.df.sprud <- counts.df.sprud %>%  select(rownames(coldata))

# convert counts, metadata, design into DESeq data object
#counts.dds <- DESeqDataSetFromMatrix(countData = counts.df.sprud, colData = coldata, design = ~ housing) 
desgn <- as.character(contrasts.df$design[1])
counts.dds <- DESeqDataSetFromMatrix(countData = counts.df.sprud, colData = coldata, design = eval(parse(text=desgn)))

#prefilter to remove low-count rows
keep <- rowSums(counts(counts.dds)) >= 10
counts.dds <- counts.dds[keep,]

#relevel 
#counts.dds$condition <- relevel(counts.dds$condition, ref = "untreated")
for (i in seq(nrow(contrasts.df))){
	var <- contrasts.df[i,]$vars %>% as.character()
	ruf <-contrasts.df[i,]$ref %>% as.character()
	counts.dds[[var]] <- relevel(counts.dds[[var]], ref = ruf)
}

#Run DESeq
counts.dds.dsq <- DESeq(counts.dds)

# run the vanilla DE
counts.dds.res <- results(counts.dds.dsq) %>% as.data.frame()
counts.dds.res$geneid <- rownames(counts.dds.res)

# Effect-size shrinkage
coff <- resultsNames(counts.dds.dsq)[2]
counts.dds.res.shrunk <- lfcShrink(counts.dds.dsq, coef=coff, type="apeglm") %>% as.data.frame()
#explore shrinkage types??
counts.dds.res.shrunk$geneid <- rownames(counts.dds.res.shrunk)

# scale by gene length
counts.dds.res<-inner_join(counts.dds.res, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length))
counts.dds.res.shrunk<-inner_join(counts.dds.res.shrunk, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length))

#Write to TSV
counts.dds.res.full <- inner_join(counts.dds.res %>%  as.data.frame(), counts.dds.res.shrunk %>%  as.data.frame(), by =c("geneid"="geneid"), suffix=c(".fresh",".shrunk")) %>% select(geneid, everything())
write_delim(counts.dds.res.full, paste(dsq_pre,"de",sep="."), delim = "\t")


###		Now run the itemized DE

#get the factors in the design
desgn.splt <- (desgn %>% str_replace_all("~","")%>% str_replace_all(" ","") %>% strsplit("+", fixed=TRUE))[[1]]

#first, collect the geneids and the basemeans: just run the first factor and its first nonref level
fact <- desgn.splt[1]
ref_lev <- (contrasts.df %>% filter(vars == fact))$ref
all_levs<-levels(counts.dds.dsq[[fact]])
alt_levs<- all_levs[all_levs != ref_lev]

basic.df <- results(counts.dds.dsq, contrast=c(fact, as.character(ref_lev), alt_levs[1])) %>% as.data.frame() %>% select(c("baseMean"))
basic.df$geneid <- rownames(basic.df)
basic.df <- select(basic.df, c("geneid", "baseMean"))

#for each factor in factors....

dds.dsq.byFactor.df <- select(basic.df, c("geneid") )


for (fact in desgn.splt){

	ref_lev <- as.character((contrasts.df %>% filter(vars == fact))$ref)
	all_levs<-levels(counts.dds.dsq[[fact]])
	alt_levs<- all_levs[all_levs != ref_lev]


#running basemean by level:
#https://support.bioconductor.org/p/63567/

	#	# gather baseMeans for each level in factor.levels:
	factor.baseMeanPerLvl <- sapply( levels(counts.dds.dsq[[fact]]), function(lvl) rowMeans( counts(counts.dds.dsq,normalized=TRUE)[,counts.dds.dsq[[fact]] == lvl, drop=F] ) ) %>% as.data.frame()

	names(factor.baseMeanPerLvl ) <- paste0(paste("baseMean.",fact,".", sep=""), names(factor.baseMeanPerLvl))
	factor.baseMeanPerLvl$geneid <- rownames(factor.baseMeanPerLvl)
	factor.baseMeanPerLvl <-inner_join(factor.baseMeanPerLvl, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid"))

	for (levi in all_levs){

		factor.baseMeanPerLvl$x <- factor.baseMeanPerLvl[[paste("baseMean",fact,levi,sep=".")]]
		factor.baseMeanPerLvl[[paste("expression",fact,levi,sep=".")]] <- factor.baseMeanPerLvl$x/factor.baseMeanPerLvl$Length

	}

	factor.baseMeanPerLvl <- factor.baseMeanPerLvl %>% select(-c("x", "Length"))

	dds.dsq.byFactor.df <- full_join(dds.dsq.byFactor.df, factor.baseMeanPerLvl, by = c("geneid"="geneid"))

#	#for each alt_level in factor.levels:
	for (aLev in alt_levs){

		#dds.res_factor_level <- results(counts.dds.dsq, contrast=c(fact, ref_lev, aLev))
		dds.res_factor_level.shrunk <- lfcShrink(counts.dds.dsq, coef=paste(fact, aLev, "vs", ref_lev, sep = "_"), type="apeglm") %>% as.data.frame()

		dds.res_factor_level.shrunk$geneid <- rownames(dds.res_factor_level.shrunk)
		dds.res_factor_level.shrunk<-inner_join(dds.res_factor_level.shrunk, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length))
		rownames(dds.res_factor_level.shrunk) <- dds.res_factor_level.shrunk$geneid
		dds.res_factor_level.shrunk <- dds.res_factor_level.shrunk %>% select(-c(geneid))
		names(dds.res_factor_level.shrunk ) <- paste0(names(dds.res_factor_level.shrunk),paste(".",fact,".vs_",aLev,".apeglm", sep=""))
		dds.res_factor_level.shrunk$geneid <- rownames(dds.res_factor_level.shrunk)

		dds.dsq.byFactor.df <- full_join(dds.dsq.byFactor.df, dds.res_factor_level.shrunk, by = c("geneid"="geneid"))
#		counts.dds.res_factor_level <- results(counts.dds.dsq, contrast=c(factor1, ref_level, alt_level1)) 
#		counts.dds.res_factor_level$geneid <- rownames(counts.dds.res_factor_level)

	}

}

#Write to TSV
dds.dsq.byFactor.df <- dds.dsq.byFactor.df %>% select(geneid, everything())
write_delim(dds.dsq.byFactor.df, paste(dsq_pre, "itemized.de", sep="."), delim = "\t")
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
args = commandArgs(trailingOnly=TRUE)
# 1: counts file in
# 2: prefix for deseq out
# 3: name of contrast

source("utils/DEXSeq/Subread_to_DEXSeq/load_SubreadOutput.R")


library("dplyr")
library("yaml")
library("readr")
library("tidyr")
library("magrittr")
library("stringr")

counts_in <- args[1] #dxCounts/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts
dsq_pre <- args[2] #diff_exon_use/dex_wtHousing/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M
dsqx_gtf <- args[3] #utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf
nombre <- args[4] #dex_wtHousing
mPhatic <- args[5] #"chr3R_FBgn0004652-"
#mPhatic <- "chr3R_FBgn0004652-"
#use example:
#	Rscript scripts/deSeqer.R {input.fc_in} diff_exon_use/{wildcards.contrast}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} {wildcards.contrast} {gene_id}
#	Rscript scripts/deSeqer.R dxCounts/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts diff_exon_use/dex_wtHousing/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf dex_wtHousing "_FBgn0004652-"


#counts_in <- "counts/DEXSeq/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts"
#dsqx_gtf <- "utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf"

#	build the sample DF
trammel <- read_yaml("config.yaml")
data_sets.df <- plyr::ldply(trammel$data_sets, data.frame)
data_sets.df$name <- as.factor(data_sets.df$name)
data_sets.df$day<- as.factor(data_sets.df$day)
data_sets.df$subgroups<- as.factor(data_sets.df$subgroups)
data_sets.df$rep<- as.factor(data_sets.df$rep)
data_sets.df$housing<- as.factor(data_sets.df$housing)
data_sets.df$genotype<- as.factor(data_sets.df$genotype)
data_sets.df$tissue<- as.factor(data_sets.df$tissue)
print("metadata frame built")

#	build the contrast experiment
contrasts.df <- plyr::ldply(trammel$xcontrasts, data.frame) %>% filter(name == nombre )
print("contrast defined")

# prep the sample metadata
sbgrp <- (contrasts.df %>% select(filt))[1,] %>% as.character
fact <- (contrasts.df %>% select(vars))[1,] %>% as.character
coldata.df <- data_sets.df%>% group_by(name) %>%  mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup()  %>% filter(subgroups == sbgrp) %>% as.data.frame() 
row.names(coldata.df) <- coldata.df$name
coldata.df %<>% dplyr::select(fact)
names(coldata.df) <- "condition"
print("experimental metadata selecteted")


coldata.df$condition <- factor(coldata.df$condition, ordered = FALSE)

coldata.df$condition = relevel(coldata.df$condition, ref = as.character(contrasts.df$ref))



#relevel 
#dxd.fc$condition <- factor(dxd.fc$condition, ordered=FALSE)
#eg, https://support.bioconductor.org/p/74520/
#dxd.fc$condition <- relevel(dxd.fc$condition, ref = as.factor(contrasts.df$ref))



# #counts.dds$condition <- relevel(counts.dds$condition, ref = "untreated")
# for (i in seq(nrow(contrasts.df))){
# 	var <- contrasts.df[i,]$vars %>% as.character()
# 	ruf <-contrasts.df[i,]$ref %>% as.character()
# 	counts.dds[[var]] <- relevel(counts.dds[[var]], ref = ruf)
# }















# Run DEXSeq
#desgn <- "~ sample + exon + condition:exon"
desgn <- as.character(contrasts.df$design[1])


dxd.fc = DEXSeqDataSetFromFeatureCounts(counts_in,flattenedfile = dsqx_gtf, sampleData =coldata.df, design= eval(parse(text=desgn)))
print("DEXSeq Dataset Built")













dxd.fc = estimateSizeFactors( dxd.fc )
print("size factors estimated")

dxd.fc = estimateDispersions( dxd.fc )
print("dispersion estimated")


png(filename = paste(dsq_pre,".dispersionPlot.png", sep=""), height=500, width=800)
plotDispEsts( dxd.fc , main = paste("DEXSeq Dispersion Plot: ",nombre," Contrast", sep=""))
dev.off()

dxd.fc = testForDEU( dxd.fc )
print("DEU tested for")


dxd.fc = estimateExonFoldChanges( dxd.fc, fitExpToVar="condition")
print("Exon fold changes estimated")


dxd.fc.results = DEXSeqResults( dxd.fc )
print("Results gathered")


png(filename = paste(dsq_pre,".",gsub("-","",gsub("\\+","",gsub("_","",mPhatic))),".differentialExonUsePlot.png", sep=""), height=500, width=800)
plotDEXSeq( dxd.fc.results, mPhatic, expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2, FDR=0.01 )
dev.off()

png(filename = paste(dsq_pre,".",gsub("-","",gsub("\\+","",gsub("_","",mPhatic))),".differentialExonExpressionPlot.png", sep=""), height=500, width=800)
plotDEXSeq( dxd.fc.results, mPhatic, expression=TRUE, splicing=FALSE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 , FDR=0.01)
dev.off()


dxd.fc.results.df <- dxd.fc.results %>% as_tibble()


#Write to TSV
#dxd.fc.results.df <- dxd.fc.results.df %>% select(geneid, everything())
write_delim(dxd.fc.results.df %>% mutate(transcripts=as.character(transcripts)), paste(dsq_pre, "de", sep="."), delim = "\t")
print("Results written")

print("................... DONE!!!")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import argparse
import gffutils


parser = argparse.ArgumentParser()
parser.add_argument("-i", "--gtf_in", help="gene model to cronch")
parser.add_argument("-o", "--output_prefix", help="file to write ORFs to")
parser.add_argument("-n", "--gene_name", help="name of gene to cronch ")
#parser.add_argument("-c", "--length_cutoff", help="ignore ORFs with fewer internal (ie, not start or stops) codons", type=int, default=75)
#parser.add_argument("-x", "--extend_across_gaps", help="extend the ORF even if it encounters a gap/N", action="store_true", default=False)
#parser.add_argument("-a", "--include_all_starts", help="report all ORFs when multiple start codons present (else choose longest)", action="store_true", default=False)

args = parser.parse_args()


file_out = args.output_prefix

#path2file="/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/utils/annotations/fru.gtf"
path2file="/Users/csoeder/Research/VolkanLab_BehaviorGenetics/utils/annotations/fru.gtf"
path2file=args.gtf_in


fn = gffutils.example_filename(path2file)
print(open(fn).read()) 

#db = gffutils.create_db(fn, dbfn='/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True)
#db = gffutils.create_db(fn, dbfn='/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, id_spec=['ID', 'Name'] ,disable_infer_genes=True)

path2gtf = "%s.db" % tuple([path2file])
#db = gffutils.create_db(fn, dbfn='/Users/csoeder/Research/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True)
db = gffutils.create_db(fn, dbfn=path2gtf, force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True)

gene_name="FBgn0004652"
gene_name=args.gene_name

gene = db[gene_name]


transcription_edges = []

exon_to_transcript = {}


for transcript in db.children(gene, featuretype='transcript'): 
	transcription_edges.append(transcript.start)
	transcription_edges.append(transcript.stop)
	for exon in db.children(transcript, featuretype='exon'): 
		if exon.start in exon_to_transcript.keys():
			exon_to_transcript[exon.start].append(transcript.id)
		else:
			exon_to_transcript[exon.start] = [transcript.id]
		if exon.stop in exon_to_transcript.keys():
			exon_to_transcript[exon.stop].append(transcript.id)
		else:
			exon_to_transcript[exon.stop] = [transcript.id]

# currently ignoring the *transcript* start/stop sites .... kick all of them out


startStop = []
for transcript in db.children(gene, featuretype='transcript'): 
	startStop.extend([transcript.start, transcript.stop])
startStop = list(set(startStop))

for term in startStop:
	exon_to_transcript.pop(term)



isoid_definitions = {}
isoid_num=0
for iso in exon_to_transcript.values():
	if iso in isoid_definitions.values():
		pass
	else:
		isoid_definitions["isoid_%s"%tuple([isoid_num])] = iso
		isoid_num +=1


isoids = {}
for key in isoid_definitions.keys():
	isoids[key]={"transcripts":isoid_definitions[key],"junctions":[]}
	for junk in exon_to_transcript.keys():
		if exon_to_transcript[junk] == isoid_definitions[key]:
			#print(junk, key, exon_to_transcript[junk])
			isoids[key]["junctions"].append(junk)




phial_name = "%s.isoids.gtf" % tuple([file_out])
phial_out = open(phial_name, "w")


for isoid_name in isoids.keys():
#isoid_name='isoid_0'
	isoid = isoids[isoid_name]
	j_list =isoid['junctions']

	j_list.sort()


	gene_gtf_list = ["edgeHog","gene",min(j_list), max(j_list), 0, gene.strand, ".", "gene_id '%s'; transcript_id '%s';\n" %tuple([isoid_name, isoid_name]) ]
	gene_gtf_str="\t%s"*len(gene_gtf_list) % tuple(gene_gtf_list)
	gene_gtf_str = "%s%s" % tuple([gene.chrom , gene_gtf_str])


	junk_strungs = []
	j_count=0
	for jay in j_list:
		junct_gtf_list = ["edgeHog","exon",jay, jay, 0, gene.strand, ".", "gene_id '%s'; transcript_id '%s'; junction_id j_%s;\n" %tuple([isoid_name, isoid_name, j_count]) ]
		junct_gtf_str="\t%s"*len(junct_gtf_list) % tuple(junct_gtf_list)
		junct_gtf_str = "%s%s" % tuple([gene.chrom , junct_gtf_str])
		junk_strungs.append(junct_gtf_str)
		j_count += 1


	phial_out.write(gene_gtf_str)
	for junk_line in junk_strungs:
		phial_out.write(junk_line)

phial_out.close()


phial_name = "isoids.list"
phial_name = "%s.isoids.list" % tuple([file_out])

phial_out = open(phial_name, "w")

for isoid_name in isoids.keys():
	for trans in isoids[isoid_name]["transcripts"]:
		phial_out.write("'%s'\t%s\n" %tuple([isoid_name, trans]))


phial_out.close()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
args = commandArgs(trailingOnly=TRUE)
# 1: counts file in
# 2: prefix for deseq out
# 3: name of contrast


library("dplyr")
#library("yaml")
library("readr")
library("tidyr")
library("magrittr")

#library("stringr")

counts_in <- args[1] # counts from featureCount
#counts_in <- "counts/all.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.counts"
aln_meta <- args[2] # metadata for alignments (to get million mapped reads)
#aln_meta <- "summaries/alignments.vs_dm6main.mapspliceMulti.summary"
exp_out <- args[3] # output file prefix to write expression
#exp_out <- "expression.potato"

counts.df <- read_delim(counts_in, "\t", escape_double = FALSE, trim_ws = TRUE)

RPK.df <- counts.df %>% select(-c(Chr, Start, End, Strand)) %>% mutate(Length = Length/1000) # convert bases to kb
#turn counts into per-kb rates
#https://stackoverflow.com/a/48693978
RPK.df[,-(1:2)] %<>% sapply(`/`, RPK.df[,2])
RPK.df.gath <- RPK.df  %>% gather(key = "sample", value = "count", -one_of("Geneid", "Length"))


#collect the million reads mapped (proper pairs, given the current featureCounts settings)
aln_meta.df <- read_delim(aln_meta, "\t", escape_double = FALSE, trim_ws = TRUE,col_names = F)
names(aln_meta.df)<- c("ref_genome","aligner","sample","measure","value")

aln_meta.df %<>% filter(measure == "properly_paired_count") %<>% select(-c("measure", "aligner", "ref_genome"))


RPKM.df.gath <- inner_join(RPK.df.gath, aln_meta.df, by =c("sample"="sample"))

RPKM.df.gath %<>% mutate(MMR = value/1000000, expression = count/MMR )  %>% select(c("Geneid", "sample", "expression"))

RPKM.df.sprud <- RPKM.df.gath %>% spread(key = "sample", value = "expression")

write_delim(RPKM.df.sprud, paste(exp_out,".rpkm", sep=""), delim = "\t", col_names = T)





RPK.df <- counts.df %>% select(-c(Chr, Start, End, Strand)) %>% mutate(Length = Length/1000) # convert bases to kb
# TPM definition https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
#				https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
RPK.df[,-(1:2)] %<>% sapply(`/`, RPK.df[,2])

RPK.gath.df <- RPK.df %>% select(-c("Length")) %>% gather(key = "sample", value = "rpk", -one_of("Geneid"))
RPK.gath.grupt.sum.df <- RPK.gath.df  %>% group_by(sample) %>% summarise(total=sum(rpk))

TPM.df <- inner_join(RPK.gath.df, RPK.gath.grupt.sum.df, by=c("sample"="sample")) %>% mutate(TPM = rpk/(total/1000000)) %>% select(-c("rpk","total")) %>% spread(key=sample, value=TPM) 

write_delim(TPM.df, paste(exp_out,".tpm", sep=""), delim = "\t", col_names = T)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import json
#	https://stackoverflow.com/questions/19483351/converting-json-string-to-dictionary-not-list
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("json_in", help="JSON file to be condensed")
parser.add_argument("flat_out", help="flatfile summary")
parser.add_argument("-t", "--tag", help="line-name for the processed JSON", default=None)
args = parser.parse_args()

def prune_jason(jsn):
	sparse_dict = {}
	sparse_dict = dict([(k,{}) for k in ['prefiltered', 'postfiltered', 'filtration', 'duplication']])

	for i in ['gc_content','q20_rate', 'q30_rate', 'read1_mean_length','total_reads']:
		try:
			sparse_dict['prefiltered'][i] =  jsn['summary']['before_filtering'][i] 
		except KeyError:
			sparse_dict['prefiltered'][i] = "NA"

		try:
			sparse_dict['postfiltered'][i] =  jsn['summary']['after_filtering'][i] 
		except KeyError:
			sparse_dict['postfiltered'][i] = "NA"

	for i in ['adaptor_trimmed_reads']:
		try: 
			sparse_dict['filtration'][i] = jsn['adapter_cutting']['adapter_trimmed_reads']
		except KeyError:
			sparse_dict['filtration'][i] = "NA"

	for i in ['corrected_reads','low_quality_reads','passed_filter_reads']:
		try:
			sparse_dict['filtration'][i] = jsn['filtering_result'][i]
		except KeyError:
			sparse_dict['filtration'][i] = "NA"

	for i in ['rate']:
		sparse_dict['duplication'][i] = jsn['duplication'][i]

	return sparse_dict

def print_pruned(prn):
	lines2write = [ [x, y, prn[x][y]]  for x in prn.keys() for y in prn[x]]
	if args.tag:
		[ ell.insert(0, args.tag) for ell in lines2write ]
	phial_out = open(args.flat_out, 'w')
	for preline in lines2write:
		field_count = len(preline)
		line = ("%s" + "\t%s"*(field_count-1) + "\n") % tuple(preline)
		phial_out.write(line)
	phial_out.close()

jason_file=open(args.json_in)
jason_str = jason_file.read()
jason = json.loads(jason_str)

jason_condensed = prune_jason(jason)
print_pruned(jason_condensed)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
args = commandArgs(trailingOnly=TRUE)

library("readr")
library("biomaRt")
library("org.Dm.eg.db")
library("topGO")
library("dplyr")
library("yaml")
library("stringr")
library("magrittr")
data("geneList")

deSeq_file <- args[1]
nombre <- args[2]
go_out <- args[3]
config <- args[4]

#nombre <- "hausWtVsMut"


sig_thresh <- 0.01

#hausWtVsMut_vs_dm6main_dm6_genes_mapspliceMulti_MpBC_itemized <- read_delim("diff_expr/hausWtVsMut/hausWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de", "\t", escape_double = FALSE, trim_ws = TRUE)
DESeq.itemized<- read_delim(deSeq_file, "\t", escape_double = FALSE, trim_ws = TRUE)

trammel <- read_yaml(config)
#trammel <- read_yaml("config.yaml")
#	build the contrast experiment
contrasts.df <- plyr::ldply(trammel$contrasts, data.frame) %>% filter(name == nombre )
#get the factors in the design
desgn <- as.character(contrasts.df$design[1])
desgn.splt <- (desgn %>% str_replace_all("~","")%>% str_replace_all(" ","") %>% strsplit("+", fixed=TRUE))[[1]]

query_genes.list <- DESeq.itemized$geneid %>% unique() %>% as.list()

#ensembl_dm = useMart("ensembl",dataset="dmelanogaster_gene_ensembl")
marty <- useDataset("dmelanogaster_gene_ensembl",  useMart("ensembl",  host = "useast.ensembl.org") )
G_list <- getBM(attributes= c("flybase_gene_id", "ensembl_gene_id", "external_gene_name", "go_id", "name_1006"), mart= marty, useCache= FALSE) %>% mutate(go_name = name_1006) %>% select(-c("name_1006"))
#G_list <- getBM(attributes= c("flybase_gene_id", "ensembl_gene_id", "external_gene_name", "go_id", "name_1006"), mart= marty) %>% mutate(go_name = name_1006) %>% select(-c("name_1006"))


### ### ### ### ### ### ### ### 
### plz fix this
go_bad <- c("GO:0110109","GO:0120176","GO:0120177","GO:0120170","GO:0062023")
bad_boiz <- G_list %>% filter(go_id %in% go_bad) %>% dplyr::select("flybase_gene_id")
bad_boiz <- rbind(bad_boiz, "FBgn0050046", "FBgn0033710",  "FBgn0050203", "FBgn0026721")
#mask the baddies
G_list.unq <- G_list %>% filter(!flybase_gene_id %in% bad_boiz$flybase_gene_id ) %>% dplyr::select("flybase_gene_id") %>% unique()
DESeq.itemized %<>% filter(!(geneid %in% bad_boiz$flybase_gene_id))
### ### ### ### ### ### ### ### 

#present_gene_list <- G_list.unq$flybase_gene_id %in% DESeq.itemized$geneid %>% as.numeric()
#names(present_gene_list) <- G_list.unq$flybase_gene_id

jeanOnt.mega.df <- as.data.frame(c())

#for each factor in factors....

for (fact in desgn.splt){

	ref_lev <- (contrasts.df %>% filter(vars == fact))$ref
	#all_levs<-levels(counts.dds.dsq[[fact]])

	#subset the DE data to the factor
#	DESeq.itemized.factor <- DESeq.itemized[,!is.na(str_locate(names(DESeq.itemized), paste(fact,".vs_" ,sep=""))[,1])]
	DESeq.itemized.factor <- DESeq.itemized[,!is.na(str_match(names(DESeq.itemized), paste(fact,".vs_(.*?).apeglm" ,sep=""))[,1])]

	#pull out the alt levels from the data
#	alt_levs<- (str_split_fixed(names(DESeq.itemized.factor), paste(fact,".vs_", sep=""), n=2)[,2] %>% str_split_fixed("\\.", n=2))[,1] %>% unique()

	noms <- str_match(names(DESeq.itemized), paste(fact,".vs_(.*?).apeglm" ,sep=""))[,2]
	alt_levs <- unique(noms[!is.na(noms)])
	DESeq.itemized.factor$geneid <- DESeq.itemized$geneid

	#	#for each alt_level in factor.levels:

	for (alt in alt_levs) {
		print(alt)
		#subset the DE data to that level
		DESeq.itemized.level <- DESeq.itemized.factor[,!is.na(str_locate(names(DESeq.itemized.factor), paste(fact,"\\.vs_",alt,"\\." ,sep=""))[,1])]
		DESeq.itemized.level$geneid <- DESeq.itemized.factor$geneid
		#pull out the genes & significance values
		sig_gene_list <- DESeq.itemized.level[paste("padj.",fact,".vs_",alt,".apeglm",sep="")][[1]]# < sig_thresh)[,1] #%>% as.numeric()
		sig_gene_list[is.na(sig_gene_list)] <- 1
		names(sig_gene_list) <- DESeq.itemized.level$geneid

		jeanOnt.df <- as.data.frame(c())

		for (ont in c("MF","BP","CC") ){

			#topDiffGenes uses p cutoff of 0.01
			GOdata <- new("topGOdata", ontology = ont, allGenes = sig_gene_list, geneSel = topDiffGenes, annot = annFUN.org, mapping = "org.Dm.eg.db", ID="Ensembl")
			fishy <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
			kissy <- runTest(GOdata, algorithm = "classic", statistic = "ks")

			goRes.tmp <- GenTable(GOdata, classicFisher =fishy, classicKS = kissy,  topNodes = 50) %>% mutate(ontology = as.factor(ont), classicFisher = as.numeric(classicFisher), classicKS = as.numeric(classicKS)) %>% select(-c(Term), Term) %>% as_tibble() 
			goRes.tmp %<>% left_join(G_list%>% select(c("go_id", "go_name")) %>% unique(), by =c("GO.ID" = "go_id")) %>% select(-c("Term"))
			jeanOnt.df <- rbind(jeanOnt.df, goRes.tmp)
			#store & bind the topGO results. 

		}

		jeanOnt.df$variable <- as.factor(fact)
		jeanOnt.df$alt_level <- as.factor(alt)

		jeanOnt.mega.df <- rbind(jeanOnt.mega.df,jeanOnt.df)

	}

}

write.table(jeanOnt.mega.df, file=args[3], row.names=FALSE, col.names = TRUE, sep = "\t", quote=F)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import pysam
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-i", "--samfile_in", help="file in SAM format to ")
# parser.add_argument("-i", "--idxstat_in", help="samtools idxstat report")
# parser.add_argument("-g", "--split_coverage", help="bedtools genomecov report, with split reads")
# parser.add_argument("-G", "--spanning_coverage", help="bedtools genomecov report, with spanning reads")
# parser.add_argument("-d", "--depthstats_in", help="samtools depth report")
# parser.add_argument("-D", "--dupestats_in", help="samtools markdup stats report")
# parser.add_argument( "-m", "--mapping_multiplicity", help="histogram of mapping multiplicity scraped from the IH:: tags")
# parser.add_argument( "-c", "--mapped_count", help="count of mapped reads")
parser.add_argument("-o", "--samfile_out", help="flatfile summary")
# parser.add_argument("-t", "--tag", help="line-name for the flatfile", default=None)
args = parser.parse_args()

sam_in = args.samfile_in
sammy = pysam.AlignmentFile(sam_in,"rb")  


sam_out = args.samfile_out
samuel = pysam.AlignmentFile(sam_out, "w", template=sammy)


# cigartuples encoding:
# M = 0
# I = 1
# D = 2
# N = 3
# S = 4


for read in sammy.fetch():
	cig = read.cigar
	nu_cig = []
	suffix = []
	if cig[0][0] == 4:
		cig = cig[1:]
	if cig[-1][0] == 4:	#discard the softclips
		cig = cig[:-1]
	front_pad_tally = 0
	while cig[0][0] in [0,1,2]:
		front_pad_tally += cig[0][1]
		cig.pop(0)
	back_pad_tally = 0
	while cig[-1][0] in [0,1,2]:
		back_pad_tally += cig[-1][1]
		cig.pop(-1)

	nu_cig.extend([(3,front_pad_tally-1),(0,1)]) # new cigar starts with 5' end clipped down to the actual splice

	# if cig[0][0] == 0: #take the terminal match blocks
	# 	num_match = cig[0][-1]
	# 	prefix = [(3,num_match-1),(0,1)]
	# 	nu_cig.extend(prefix) #add a bunch of Ns to pad ; ignore 5' and 3' read ends. 
	# 	cig = cig[1:]
	# if cig[-1][0] == 0:
	# 	num_match = cig[-1][1]
	# 	suffix = 
	# 	cig = cig[:-1]

	rill_tally = 0
	mids = []
	for rillo in cig:
		if rillo[0] in [0,1,2]:
			rill_tally += rillo[1] #consolidate exons
		else:
			if rill_tally > 0:
				mids.extend([(0,1),(3,rill_tally-2),(0,1)]) #write hollowed exon
			mids.extend([rillo]) # write intron
			rill_tally=0 # reset tally
	nu_cig.extend(mids)
	nu_cig.extend([(0,1),(3,back_pad_tally-1),]) # end cigar with splice and Ns to 3' end

	read.cigar = nu_cig
	read.seq = "*"
	samuel.write(read)

samuel.close()
60
61
62
63
64
65
66
67
68
run:
	shell(""" mkdir -p results/figures/; touch results/figures/null.png; for fig in results/figures/*png; do mv $fig $(echo $fig| rev | cut -f 2- -d . | rev ).$(date +%d_%b_%Y).png; done;  rm results/figures/null.*.png; """)
	shell(""" mkdir -p results/figures/supp/ ; touch results/figures/supp/null.png; for fig in results/figures/supp/*png; do mv $fig $(echo $fig| rev | cut -f 2- -d . | rev ).$(date +%d_%b_%Y).png; done; rm results/figures/supp/null.*.png; """)

	shell(""" mkdir -p results/tables/ ; touch results/tables/null.tmp ; for phial in $(ls -p results/tables/ | grep -v /); do pre=$(echo $phial | rev | cut -f 2- -d . | rev ); suff=$(echo $phial | rev | cut -f 1 -d . | rev ); mv results/tables/$phial results/tables/$pre.$(date +%d_%b_%Y).$suff; done ; rm results/tables/null.*.tmp; """)
	shell(""" mkdir -p results/tables/supp/ ; touch results/tables/supp/null.tmp ; for phial in $(ls -p results/tables/supp/ | grep -v /); do pre=$(echo $phial | rev | cut -f 2- -d . | rev ); suff=$(echo $phial | rev | cut -f 1 -d . | rev ); mv results/tables/supp/$phial results/tables/supp/$pre.$(date +%d_%b_%Y).$suff; done ; rm results/tables/supp/null.*.tmp; """)

	shell(""" mv results/VolkanLabBehaviorGenetics.full.pdf results/VolkanLab_BehaviorGenetics.full.$(date +%d_%b_%Y).pdf """)
	shell(""" tar cf results.full.$(date +%d_%b_%Y).tar results/ """)
79
80
81
82
83
84
85
86
87
88
run:
	print("%s%s" % tuple(["FASTQ/", wildcards.path]))
	try:
		sra = sra_by_fqpath["%s%s/" % tuple(["FASTQ/", wildcards.path])]
		shell(""" mkdir -p FASTQ/{wildcards.path}/ """)
		shell("""
			fasterq-dump  --split-3 --outdir FASTQ/{wildcards.path}/ {sra}
		""")
	except KeyError:
		raise KeyError("Sample is listed as empirical but no reads found and no SRA to download!" )
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
run:
	shell("mkdir -p utils/")
	shell(
		""" 
		cd utils;
		wget http://protocols.netlab.uky.edu/~zeng/MapSplice-v2.1.8.zip ;
		unzip MapSplice-v2.1.8.zip;
		2to3 -w MapSplice-v2.1.8/;
		cd MapSplice-v2.1.8/;
		make;
		cd ..;
		rm MapSplice-v2.1.8.zip;
		"""
	)		
121
122
123
124
125
126
run:
	fai_path = ref_genome_by_name[wildcards.ref_genome]['fai'],
	shell("mkdir -p utils")
	shell(
		'bedtools makewindows -w {wildcards.window_size} -s {wildcards.slide_rate} -g {fai_path} -i winnum | bedtools sort -i - > {output.windowed}'
	)
135
136
137
138
139
run:
	shell("mkdir -p utils/annotations")
	base_path = annotation_by_name[genelist_by_name[wildcards.subname]["base"]]["bed_path"]
	list_file = genelist_by_name[wildcards.subname]["list_path"]
	shell(""" grep -wFf <( tail -n +2 {list_file} | cut -f 2 ) {base_path} > {output.annot_out} """)
SnakeMake From line 135 of master/Snakefile
150
151
152
153
154
shell:
	"""
	mkdir -p summaries/reference_genomes/
	cat {input.fai_in} | awk '{{sum+=$2}} END {{ print "number_contigs\t",NR; print "number_bases\t",sum}}' | sed -e 's/^/{wildcards.ref_gen}\t/g' > {output.report_out};
	"""
SnakeMake From line 150 of master/Snakefile
165
166
shell:
	"cat {input.refgen_reports} > {output.refgen_summary}"
SnakeMake From line 165 of master/Snakefile
178
179
180
181
182
183
run:
	shell(""" mkdir -p summaries/annotations/ """)
	shell(""" rm -f {output.report_out} """)
	shell(""" cat {input.annot} | cut -f 1 | grep -v "chr2110000222\|chrmitochondrion\|Un\|rand\|chrrDNA" | sort | uniq -c | tr -s " " | tr " " "\t" | awk '{{print"count\t"$2"\t"$1}}' >> {output.report_out} """)
	shell(""" cat {input.annot} | wc -l | awk '{{print"count\ttotal\t"$0}}' >> {output.report_out} """)
	shell(""" cat {input.annot} | awk '{{print$3-$2;}}' | awk '{{sum+=$1; sumsq+=$1*$1}} END {{ print "size\ttotal\t",sum; print "size\tavg\t",sum/NR; print "size\tstd\t",sqrt(sumsq/NR - (sum/NR)**2)}}'  >> {output.report_out} """)
SnakeMake From line 178 of master/Snakefile
195
196
197
198
199
run:
	print([a["name"] for a in config['annotations'] if not a["derived"] ])
	shell(""" rm -f {output.refann_summary} """)
	for anne in [a["name"] for a in config['annotations'] if not a["derived"] ]:
		shell(""" cat summaries/annotations/{anne}.stats | awk '{{print"{anne}\t"$0}}' >> {output.refann_summary}""")
SnakeMake From line 195 of master/Snakefile
213
214
run:
	shell(""" python scripts/edgeHog.py --gtf_in $(pwd)/utils/annotations/fru.gtf -o utils/annotations/fru -n FBgn0004652 """)
SnakeMake From line 213 of master/Snakefile
227
228
229
230
231
232
233
run:
	shell(""" rm -f {output.statsOut} """)
	shell(""" echo "count	total	$( tail -n +2 {input.list_in} | wc -l  )" >> {output.statsOut} """)
	annot_path = annotation_by_name[genelist_by_name[wildcards.listicle]["base"]]["bed_path"]
	annot_name = annotation_by_name[genelist_by_name[wildcards.listicle]["base"]]["name"]
	shell(""" echo "count\tannotated\t$(cat {input.bed_in} | wc -l )">> {output.statsOut} """)
	shell(""" cat {input.bed_in} | awk '{{print$3-$2;}}' | awk '{{sum+=$1; sumsq+=$1*$1}} END {{ print "size\ttotal\t",sum; print "size\tavg\t",sum/NR; print "size\tstd\t",sqrt(sumsq/NR - (sum/NR)**2)}}'  >> {output.statsOut} """)
SnakeMake From line 227 of master/Snakefile
245
246
247
248
run:
	for listicle in genelist_by_name.keys():
		annot = annotation_by_name[genelist_by_name[listicle]["base"]]["name"]
		shell(""" cat summaries/geneLists/{listicle}.stats | awk '{{print"{listicle}\t{annot}\t"$0}}' >> {output.allStats} """)
SnakeMake From line 245 of master/Snakefile
268
269
shell:
	"/nas/longleaf/home/csoeder/modules/fastp/fastp {params.common_params} {params.pe_params} --in1 {input.fileIn[0]} --out1 {output.fileOut[0]} --in2 {input.fileIn[1]} --out2 {output.fileOut[1]}"
284
285
286
287
288
shell:
	"""
	cp {input.jason} summaries/FASTP/{wildcards.samplename}.json
	python3 scripts/fastp_reporter.py {input.jason} {output.jason_pruned} -t {wildcards.samplename}
	"""
SnakeMake From line 284 of master/Snakefile
301
302
shell:
	"cat {input.jasons_in} > {output.summary}"
SnakeMake From line 301 of master/Snakefile
318
319
320
321
322
323
324
325
326
327
328
329
run:
	ref_genome_split = ref_genome_by_name[wildcards.ref_genome]['split'],
	ref_genome_bwt = ref_genome_by_name[wildcards.ref_genome]['bowtie'],
	shell(""" mkdir -p mapped_reads/mapspliceRaw/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 
		python utils/MapSplice-v2.1.8/mapsplice.py -c {ref_genome_split} -x {ref_genome_bwt} -1 {input.reads_in[0]} -2 {input.reads_in[1]} -o mapped_reads/mapspliceRaw/{wildcards.sample} 
		""")
	shell(""" 
		samtools view -Sbh mapped_reads/mapspliceRaw/{wildcards.sample}/alignments.sam | samtools sort - > mapped_reads/mapspliceRaw/{wildcards.sample}/{wildcards.sample}.vs_{wildcards.ref_genome}.mapspliceRaw.sort.bam;
		samtools index {output.raw_bam};
		rm mapped_reads/mapspliceRaw/{wildcards.sample}/alignments.sam;
		""")
346
347
348
349
350
351
run:
	shell(""" mkdir -p mapped_reads/mapspliceMulti/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 
		samtools sort -n {input.raw_bam} | samtools fixmate -m - - | samtools sort - | samtools markdup {params.dup_flg} - - | samtools view -bh {params.quality} > {output.multi_bam} ;
		samtools index {output.multi_bam} 
		""")
366
367
368
369
370
371
run:
	shell(""" mkdir -p mapped_reads/mapspliceUniq/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 
		samtools view {input.multi_bam} | grep -w "IH:i:1" | cat <( samtools view -H {input.multi_bam} ) - | samtools view -Sbh > {output.uniq_bam};
		samtools index {output.uniq_bam}
		""")
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
run:
	shell(""" mkdir -p mapped_reads/mapspliceRando/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 

		samtools sort -n {input.multi_bam} | samtools view | grep -v "IH:i:[01]" > {output.rando_bam}.multi.unsampled

		rm -f {output.rando_bam}.multi.sampled
		for idx in $(cut -f 1 {output.rando_bam}.multi.unsampled | sort | uniq); do 
			grep -w $idx {output.rando_bam}.multi.unsampled | sort  --random-sort | head -n 1 >> {output.rando_bam}.multi.sampled
		done;

		cat <( samtools view -h {input.uniq_bam} ) {output.rando_bam}.multi.sampled | samtools view -Sbh | samtools sort - > {output.rando_bam};
		samtools index {output.rando_bam}
		rm {output.rando_bam}.multi.unsampled {output.rando_bam}.multi.sampled
		""")
416
417
418
419
420
421
422
423
424
425
run:
	shell(""" mkdir -p mapped_reads/{wildcards.aligner}_SpliceOnly/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 
		cat <(samtools view -H {input.in_bam}) <(samtools view {input.in_bam}| awk '($6 ~ /N/)' | less) | samtools view -hbS > {output.out_bam}.tmp;
		samtools index {output.out_bam}.tmp;
		python scripts/read_Hollower.py -i {output.out_bam}.tmp -o {output.out_bam}.tmp.sam;
		cat {output.out_bam}.tmp.sam |  sed -e 's/\tN\t/\t*\t/g' | samtools view -hbS > {output.out_bam};
		samtools index {output.out_bam};
		rm {output.out_bam}.tmp {output.out_bam}.tmp.sam;
		""")
440
441
442
443
	run:
 		ref_genome_idx=ref_genome_by_name[wildcards.ref_genome]['fai']
 		shell(""" which samtools """)
 		shell("samtools idxstats {input.bam_in} > {input.bam_in}.idxstats")
475
476
shell:
	"""cat {input.bam_reports} | awk '{{print "{wildcards.ref_genome}\t{wildcards.aligner}\t"$0}}'> {output.full_report}"""
SnakeMake From line 475 of master/Snakefile
491
492
493
494
run:
	fai = ref_genome_by_name[wildcards.ref_genome]['fai']
	shell(""" mkdir -p summaries/coverage/ """)
	shell(""" samtools view -hb {input.bam_in} {params.location} | bedtools genomecov -split -bga -ibam - -g {fai} > {output.bg_out} """)
510
511
run:
	shell(""" touch {output} """)
SnakeMake From line 510 of master/Snakefile
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
run:
	flug_str = wildcards.flags
	flug_lst = [ s for s in flug_str if s != "_"]
	flug_str = " -%s "*len(flug_lst) % tuple(flug_lst)


	sed_suff = ".vs_%s.%s.sort.bam" % tuple([wildcards.ref_genome, wildcards.aligner])		
	sed_cmd =  " sed -e 's/mapped_reads\/[a-zA-Z0-9\/_]*\///g' | sed -e 's/%s//g' " % tuple([sed_suff])

	annot_gtf = annotation_by_name[wildcards.annot]["gtf_path"]
	shell(""" mkdir -p counts summaries/counts/""")
	shell("""
		featureCounts {flug_str} {params.fc_params} -t exon -g gene_id -F GTF -a <(cat {annot_gtf} | awk '{{print"chr"$0}}') -o {output.counted_features}.tmp {input.alignments_in}
		""")
	shell("""
		cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp | tail -n +2 | {sed_cmd} > {output.counted_features}
		cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp.jcounts | {sed_cmd} > counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.jcounts
		cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp.summary | {sed_cmd} > counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.summary
		rm counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp* 
		""")
563
564
565
566
567
run:
	shell("""
		head -n 1 {input.sum_in} > {output.count_sum};
		cat {input.sum_in} | grep -w "Assigned\|Unassigned_NoFeatures\|Unassigned_Ambiguity" >> {output.count_sum};
		""")
SnakeMake From line 563 of master/Snakefile
583
584
585
run:
	shell(""" mkdir -p expression/ """)
	shell(""" Rscript  scripts/expressionOmete.R {input.counted_features} {input.aln_rprt} expression/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags} """)
SnakeMake From line 583 of master/Snakefile
599
600
601
run:
	shell(""" mkdir -p utils/ """)
	shell(""" touch {output.expFlg} """)
SnakeMake From line 599 of master/Snakefile
617
618
619
620
621
622
run:
	#cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"]
	shell(""" mkdir -p diff_expr/{wildcards.contrast} """)
	shell(""" Rscript scripts/deSeqer.R {input.fc_in} diff_expr/{wildcards.contrast}/{wildcards.contrast}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} {wildcards.contrast} config.yaml """)
	#shell(""" mkdir -p meta/DESeq2_methods/ """)
	#shell(""" mv diff_expr/{wildcards.contrast}/{wildcards.contrast}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag}.method meta/DESeq2_methods/ """)
SnakeMake From line 617 of master/Snakefile
637
638
639
640
run:
	#cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"]
	shell(""" mkdir -p gene_ont/{wildcards.contrast}/ """)
	shell(""" Rscript scripts/geneOntologe.R {input.deSq_itemized} {wildcards.contrast} {output.deSq_go} config.yaml """)
SnakeMake From line 637 of master/Snakefile
656
657
658
659
run:
	#cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"]
	shell(""" mkdir -p correlations/{wildcards.correlation}/ """)
	shell(""" Rscript scripts/correlate.R {input.xpr_in} {wildcards.correlation} correlations/{wildcards.correlation}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} """)
SnakeMake From line 656 of master/Snakefile
673
674
675
676
677
run:
	shell(
		"""
		mkdir -p utils/DEXSeq/;
		""")
SnakeMake From line 673 of master/Snakefile
702
703
704
705
706
707
run:
	shell(
		"""	
			mkdir -p utils/annotations/DEXSeq/;
			python utils/DEXSeq/Subread_to_DEXSeq/dexseq_prepare_annotation2.py {params.prep_params} <( cat {input.anne}| awk '{{print"chr"$0}}' ) -f {output.fc_ready} {output.nu_gff}
		""")
SnakeMake From line 702 of master/Snakefile
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
run:
	flug_str = wildcards.flags
	flug_lst = [ s for s in flug_str if s != "_"]
	flug_str = " -%s "*len(flug_lst) % tuple(flug_lst)

	sed_suff = ".vs_%s.%s.sort.bam" % tuple([wildcards.ref_genome, wildcards.aligner])		
	sed_cmd =  " sed -e 's/mapped_reads\/[a-zA-Z0-9\/_]*\///g' | sed -e 's/%s//g' " % tuple([sed_suff])

	shell(""" mkdir -p dxCounts summaries/dxCounts/""")
	shell("""
		featureCounts {flug_str} {params.fc_params} -t exon -g gene_id -F GTF -a {input.dxq_ann} -o {output.counted_features}.tmp {input.alignments_in}
		""")
	shell("""
		cat {output.counted_features}.tmp | tail -n +2 | {sed_cmd} > {output.counted_features}
		cat {output.counted_features}.tmp.summary | {sed_cmd} > {output.count_stats}
		rm {output.counted_features}.tmp* 
		""")
	##cat {output.counted_features}.tmp.jcounts | {sed_cmd} > counts/DEXSeq/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.anne}.{wildcards.aligner}.{wildcards.flags}.counts.jcounts
756
757
758
759
run:
	#cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"]
	shell(""" mkdir -p diff_exon_use/{wildcards.contrast}/ """)
	shell("""  Rscript scripts/dexSeqe.R {input.counted_features} diff_exon_use/{wildcards.contrast}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.anne}.{wildcards.aligner}.{wildcards.flags} {input.dxq_ann} {wildcards.contrast} 'chr3R_FBgn0004652-'  """)
SnakeMake From line 756 of master/Snakefile
790
791
792
793
794
run:
	pandoc_path="/nas/longleaf/apps/rstudio/1.0.136/bin/pandoc"
	pwd = subprocess.check_output("pwd",shell=True).decode().rstrip()+"/"
	shell("""mkdir -p results/figures/supp/ results/tables/supp/""")
	shell(""" R -e "setwd('{pwd}');Sys.setenv(RSTUDIO_PANDOC='{pandoc_path}')" -e  "peaDubDee='{pwd}'; rmarkdown::render('scripts/RNAseq_results.Rmd',output_file='{pwd}{output.pdf_out}')"  """)
SnakeMake From line 790 of master/Snakefile
ShowHide 35 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/csoeder/VolkanLab_BehaviorGenetics
Name: volkanlab_behaviorgenetics
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...