ATAC-Seq and ChIP-Seq Analysis Pipeline for Chromatin Accessibility

public public 1yr ago Version: 2 0 bookmarks

ATAC-Seq-Pipeline

Project Overview

This pipeline is designed to take a set of input BAM files and perform ATAC-seq/chip-seq analysis on the data. The pipeline will determine accessible regions of chromatin, motif sequences within those regions, indicative of transcription factor binding sites (TFBS), compare those TFBS to known TFBS from the JASPAR database, align those sequences to the genome, create tracks to dynamically visualize the data, and perform statistical analysis to verify our results.

The entire pipeline is run using the Python tool "Snakemake". Snakemake automatically determines missing input files, conda environment requirements/dependencies, and more. Below is a diagram of the files in this directory and there importance. Most notably, " example_files " contains a subset of the formated data you should expect from the inputs and outputs of this pipeline. "Config" contains a series of parameters and the species reference genome, allowing the user to quickly test how different parameters modify the final results. "Results" contains all of the final outputs in folders separated for each BAM file input into the pipeline. "Workflow" contains the scripts and functions used to run the pipeline in a streamline manner. We do not advise modifying the "Workflow" folder. Instead, rely on the configuration files and Snakemake parameters to modify the pipeline to your needs, if possible.

ImageOfProjectDirectory

Requirements

Snakemake is the only required installation to run this pipeline. Configuration files within this project will automatically be installed using Snakemake based on the dependencies that existed during the creation of this project. In short, we automatically install all of the requirements for you, exactly when each script calls on a dependency. The following commands will install Snakemake to a separate environment for you:

Option 1: Standard Conda Installation

  • conda activate base

  • conda create -c conda-forge -c bioconda -n snakemake snakemake

  • conda activate snakemake

  • snakemake --help

Option 2: Mamba Installation

Mamba is a faster version of conda that snakemake recommends. Both Conda and Mamba work equally well in our experience.

  • conda install -n base -c conda-forge mamba

  • conda activate base

  • mamba create -c conda-forge -c bioconda -n snakemake snakemake

  • conda activate snakemake

  • snakemake --help

If you have any issues with installation, see more info at: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html

Workflow Overview

ImageOfWorkflow

This repository assumes you have already processed your raw fastq files, and your data is formatted as a series of .bam files, one per sample. Afterwards we use MACS2 to perform Peak Calling , a technique that identifies accessible regions of chromatin in DNA. From there we perform Motif Identification to determine putative transcription factor binding sites with these peaks. JASPAR is used as a database of known transcription factors for Motif Comparison . Site Matching is performed to align these sequences against our FASTA sequences to determine which transcription factors bind upstream to which genes. Finally, all of these files can be incorperated together into a Track Based Visualization like the example seen below:

Track

The entire pipeline can be run using Snakemake. If you encounter difficulties, documentation for each command can be seen in the sections below.

How to Use

To run the entire pipeline, use the following commands:

  • cd workflow

  • sh scripts/createCondaConfigurations.sh

  • snakemake ../results/{"A","B","C","D","E"}.results/matchingSiteTable.csv --use-conda --cores N

Where A, B, C, D, and E are any number of BAM files you wish to analyze, delimited by commas. --cores N is used to indicate the number of cores you wish to use. Snakemake will automatically determine which steps of the pipeline can be performed based on the files that have been created, and start a new processes using new cores to optimize the speed of the workflow.

The pipeline must be run from within the "workflow" directory (where Snakefile is located). The shell scripts (scripts/createCondaConfigurations) will download all necessary dependencies in a series of conda environments for you. We've precreated several conda configuration files that perfectly replicate the original conditions of the experiment on macOS. If you would rather try those dependencies, go into "config/config.yaml" and change the "perfectCondaReplication" parameter to "TRUE". Additional parameters for the pipeline can be modified in the "config/config.yaml" file, including file types, p-value thresholds, gtf files, and more. The location of sample BAM files {A,B,C,D,E}, should also be modified in this location before running the pipeline.

Once finished, your tracks.ini file can be used to visualize open chromatin regions side-by-side with bp gene locations. When you want to plot a region of the genome, use: pyGenomeTracks --tracks tracks.ini --region chr15:2,500,000-2,800,000 - bigwig.png , with variables adjusted to visualize the correct chromosome region. I recommend using a base pair range of ~300,000 so that the gene names are easily readible.

The tracks.ini file has many visual customization options. Here is some example code to give you an idea of what objects you might want to include: https://pygenometracks.readthedocs.io/en/latest/content/examples.html

ImageOfWorkflow

Each of the programs, inputs, and outputs for this program can be visualized in the diagrams seen above. Each step of the pipeline has been color coded based on the goal we are trying to achieve (See matching color diagram in "Workflow Overview"). If you choose to perform the pipeline manually, there is no set order when executing each script. Instead, it is easier to follow the arrows on the diagram, and ensure each required input exists in your directory before executing the next script ( Note that Snakemake does this for you! ). In the section below, you will find two sections for Scripts and Input/Output Files , each containing explanations and requirements for each object in the diagram. The diagram below is a copy of the diagram above, except Input/output files have been colored yellow, scripts are colored blue, and databases are colored red.

Scripts

ImageOfWorkflow

MACS2

Command: macs2 callpeak -t {sample} -g {params.genomeType} -f {params.peakFileType} -p {params.peakPVal} --seed {params.seed} --outdir {params.dirName}

For the command above, {sample} refers to the location of the BAM files you're processing.

MACS2 is our peakcalling method. Open chromatin regions can be identified by regions where reads have piled up more than the background coverage.

--'t' is the parameter used for our input file.

--'g hs' identifies that we're using a human sequence.

--'f' identifies our input file type, in this case bam.

--'p' is our pvalue for identifying if the piled reads are significantly more accessible than the background coverage. These results are tabulated in a narrowPeaks file for future usage.

--'seed' is a randomly generated started value which can be used to create reproducible results.

--'bdg' indicates that we'd like to also create a bedgraph file, which we later convert to a bigwig file, a common file type for visualizing open chromatin regions. The existing pipeline uses a slightly different approach, and we will instead create a bigwig file using the BamCoverage command from DeepTools.

--'outdir' is our ouput directory. By including a dot, '.', we are telling the computer to send all output files to the current directory.

The final output is a narrowPeak file listing a series of peaks and some information about the significance/relative fold change of each peak. An example of the output can be found here: https://github.com/CalebPecka/ATAC-Seq-Pipeline/blob/master/ example_files /NA_peaks%5BEXAMPLE%5D.narrowPeak

NarrowPeakSummitTracker.R

Command: Rscript scripts/NarrowPeakSummitTracker.R {input.gtf} {input.NApeak} {output}

The R scripts in this pipeline function off of an ordered series of arguments that is passed to the script.

--'{input.gtf}' is a precreated file in the GitHub repository that lists the locations of genes relative to the Hg38 Homo sapiens reference genome.

--'{input.NApeak}' identifies that we're using a human sequence.

--'{output}' is the location where the output file will be saved..

The goal of this function is to locate every instance of an accessible chromatin region that is within a 1000 base upstream region of genes in the human genome (likely locations to include a transcription factor binding site). The GFFgenes.bed file is used to find start locations of genes, and find summits from the MACS2 output (NA_peaks.narrowPeak) that are within the desired range of the gene. The output file upstreamPeaks.tsv stores information about the accessible chromatin regions location, including chromosome number, and start/stop base-pair (bp) numbers. upstreamPeaks.tsv also stored valuable information such as the name of genes that were discovered downstream, and the q-score of the peak region where it was identified. The q-score helps us estimate how confident the program was that it identified a chromatin accessible peak region.

An example of the output can be found here: https://github.com/CalebPecka/ATAC-Seq-Pipeline/blob/master/ example_files /upstreamPeaks%5BEXAMPLE%5D.tsv

SummitFASTA.R

Command: curl -LJO ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz > {output} Rscript scripts/SummitFASTA.R {input.refGenome} {input.upstreamPeaks} {params.motifSize} {output}

This command requires 2 steps. First, we check if you currently have the Hg38 reference genome build installed. If not, it is downloaded to a separate directory in the Config folder of this repository. Only then do we call the R script for the next step.

--'{input.refGenome}' is the location of the reference genome we just installed.

--'{input.upstreamPeaks}' is the output of NarrowPeakSummitTracker, detailing the locations of peaks that are likely to regulate downstream genes.

--'{params.motifSize}' is the size of the FASTA sequences surrounding the summit location of the peaks from the input.upstreamPeaks file. For example, if we were to pick a value of 100, the script would output a FASTA sequence 100 base-pairs long surrounding the highest point of chromatin accessible from every row in the input.upstreamPeaks file.

--'{output}' is the location where the output file will be saved..

Once we've collected the LOCATION of potential transcription factor binding sites (TFBS) from NarrowPeakSummitTracker.R , we need to determine the sequence of the TFBS. Problematically, the real TFBS can be located anywhere within the accessible chromatin region, which can sometimes be thousands of bases long. At the peak location of each accessible chromatin region (defined by the highest density of sequence fragment overlap), the surrounding nucleotide sequence is stored in a new FASTA file called upstreamPeak_sequences.tsv . The size of the nucleotide search space can be modified by the user. By default, the script searches for a 100 bp region surrounding each summit. A value of 100 is consistent with the default requirements of other motif-searching tools like the tools used by MEME Suite. This value can be modified by changing the MotifSize variable within the SummitFASTA.R script.

An example of the output can be found here: https://github.com/CalebPecka/ATAC-Seq-Pipeline/blob/master/ example_files /upstreamPeak_Sequences%5BEXAMPLE%5D.fasta

BCrank.R

Command: Rscript scripts/BCrank.R {params.seedSize} {params.restartSize} {input.seq} {output}

--'{params.seedSize}' is used to determine the number of motifs generated for each sample. The total number of motifs equals {params.seedSize} multiplied by {params.restartSize}. This value has a maximum of 20 .

--'{params.restartSize}' is used to determine the number of motifs generated for each sample. The total number of motifs equals {params.seedSize} multiplied by {params.restartSize}. For example, a seedSize of 3 and a restartSize of 5 will output 3*5=15 seeds .

--'{input.seq}' is a FASTA file output from SummitFASTA.R.

--'{output}' is the location where the output file will be saved..

BCRANK is tool that, when provided a list of FASTA sequences, determines frequently repeated motif sequences. In our case, repeat motif sequences are likely to be important structural components to a TFBS. BCRank requires the input to be ordered according to confidence level. Our script automatically sorts the data according to q-score, the confidence level provided by MACS2. By default, the BCrank program is repeated 12 times with different randomly generated seeds, each producing 100 motifs. The motifs are stored in a new directory BCrankOutput/ .

An example of the output can be found here: https://github.com/CalebPecka/ATAC-Seq-Pipeline/tree/master/ example_files /BCrankOutput%5BEXAMPLE%5D

BCrankProcessing.R

Command: Rscript scripts/BCrankProcessing.R {params.seedSize} {input.directory} {input.seq} {input.NApeak} {output}

--'{params.seedSize}' is used to determine the number of motifs generated for each sample. The total number of motifs equals {params.seedSize} multiplied by {params.restartSize}. This value should be the same as used in BCrank.R.

--'{input.directory}' is the directory of BCrank motifs output from BCrank.R.

--'{input.seq}' is a FASTA file output from SummitFASTA.R.

--'{input.seq}' is the NarrowPeak data output from MACS2.

--'{output}' is the location where the output file will be saved..

The results from the BCrankOutput/ directory are useless by themselves. The BCrankProcessing.R script concatenates and subsets the files within the directory to obtain a list of all 1200 motifs that BCrank.R previously identified. If you modify the BCrankOutput/ directory in any way, it is possible you can interfere with the programming in BCrankProcessing.R. Once the motifs have been extracted, the script searches for instances of every motif within the upstreamPeak_sequences.tsv file. upstreamPeak_sequences.tsv is a necessary file to run BCrankProcessing.R, but the file has been removed from the GitHub repository diagram to decrease clutter. BCrankProcessing.R also verifies that each matching site is location within the region of a narrow peak. This calculation is performed by comparing the genomic location data with the previously created file, NA_peaks.narrowPeak . The output file matchingSites.csv stores any information necessary for remaining scripts.

An example of the outpu can be found here: https://github.com/CalebPecka/ATAC-Seq-Pipeline/blob/master/ example_files /matchingSiteTable%5BEXAMPLE%5D.csv

Analysis.R

Command: ...

Finally, the results can be analyzed. Analysis should be performed uniquely according to the design of each experiment. In our case, researchers were interested in verifying the discovery of TFBS by comparing the results to known differential gene expression patterns. Example files have been included in this project to illustrate the creation of a chi-squared analysis, as well as a normal curve distribution for the occupancy of TFBS at each gene.

BamCoverage

Command: bamCoverage -b {wildcards.sample} -o {output}

For the command above, {wildcards.sample} refers to the location of the BAM files you're processing.

BamCoverage is a method of converting a bedgraph file to a bigwig file. Our visualization tool, PyGenomeTracks, requires an input file BigWig type. If you don't wish to visualize your results, this step can be ignored.

--'b' is the parameter used for our input file.

--'o' is your output directory. In the example command above, we are naming our file to "bigWig_coverage.bw" using the "-o" output parameter.

Make Tracks

Command: make_tracks_file --trackFiles bigWig_coverage.bw NA_peaks.narrowPeak GFFgenes.bed -o tracks.ini

"Make Tracks" layers a series of rows together for the PyGenomeTracks visualization. Using chromosomal positioning, the "tracks.ini" output stores the position of relevant information relative to the genome. The resulting "tracks.ini" file is easily editable, see documentation here: https://pygenometracks.readthedocs.io/en/latest/content/examples.html. Only one parameter is used, 'trackFiles'. Each file you include AFTER the parameter is layered on top of the data from the previous file. In this case, our visualization includes the bigWig coverage (peak visualization), narrowPeaks (bp location of peak visualized as a box plot), and a bed file for the start/stop regions of human genes. The command can be modified to include other results throughout the pipeline, and only requires 1 or more file to be included as a parameter.

We've already created a version of the tracks.ini output for you that's compatible with all samples from the pipeline. This command is not rerun during the pipeline.

PyGenomeTracks

Command: pyGenomeTracks --tracks tracks.ini --region chr1:1000000-4000000 -o image.png

PyGenomeTracks is the visualization command for the pipeline. It will plot a layered track of the data from the input "track.ini".

--'tracks' is the parameter to indicate your input file. Must be type .ini.

--'region' indicates the chromosomal location you want to visualize. It is formated where the abbreviated chromosome name is included first, followed by the base-pair region, separating start and stop position by a hyphen. The chromosomal range shown above starts at 1 million and ends at 4 million.

--'o' is your output directory/file name.

TomTom (Part 1)

Command: meme upstreamPeak_Sequences.fasta -dna -oc . -nostatus -time 18000 -mod zoops -nmotifs 100 -minw 6 -maxw 25 -objfun classic -revcomp -markov_order 0

MEME is another powerful tool for identifying motif sequences. The tool is hosted under the website: https://meme-suite.org/meme/tools/meme. MEME can also be ran using a command line, see software versions here: https://meme-suite.org/meme/doc/download.html. After running the command above, you will receive a series MEME.txt file which can be converted using the MEMEtxtToPWMconverter.R file. Simply run the Rscript after modifying the input variable within the script. This will return a series of position weighted matrices of predicted motifs, similar to BCrank. In BCrank, we extracted the consensus sequences from the position weighted matrices.

TomTom (Part 2)

Command: tomtom -o dir BCRANK_motifs.meme jasperMemeFormatted.meme

TomTom is tool that compares a list of motifs against a database of known motifs. The tool is hosted under the website: https://meme-suite.org/meme/doc/tomtom.html?man_type=web, and can be installed from the same software versions in "TomTom (Part 1)". The steps from TomTom are not automatically ran by the pipeline, and are instead suggested as a useful tool for understand how predicted motifs compare to known transcription factor motifs.

Input/Output Files

An example of each input/output file has been included in the Example Inputs and Outputs directory. By removing the "[EXAMPLE]" text from each file name and moving the files into the main directory, you can follow along and run each script to see how they function.

Narrow Peaks

Real File Name: NA_peaks.narrowPeak

A file describing the genomic location of chromatin accessible regions. Chromatin accessible regions are discovered by using DNA fragment-based pile-ups, producing "peaks". You can read more about fragment pile-ups and peak-calling here: https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html. The MACS2 narrowPeak file type includes 10 columns. Each row designates a separate chromatin region. Column 1 describes the chromosomal location. Column 2 describes the start location of the peak. Column 3 describes the stop location of the peak. Column 10 is the summit location in terms of base pairs from the start location. In other words, the value of column 10 plus the value of column 2 is the summit location for the peak.

GTF Genes

Real File Name: GTFgenes.bed

A file describing the genomic location of genes. Genes are identified according to each transcript signature (ENST), one ENST for each row. The file includes a header, and this header must have the exact column names or the pipeline will fail. Column 1 describes the chromosomal location. Column 2 describes the start location of the ENST. Column 3 describes the stop location of the ENST. Column 4 is the name of each ENST. Later steps of the program will remove any data contained AFTER the period in each value of the GeneName column, (i.e. ENST00000450305.2 will be converted to ENST00000450305 ).

Upstream Peaks

Real File Name: upstreamPeaks.tsv

A file containing the identity of peak summits located within a 1000 base upstream region of genes. Regions 1000 bases upstream of genes are likely to contain transcription factor binding sites (TFBS), locations that regulate the nearby gene. We expect that the accessible chromatin regions defined by each row of the upstreamPeaks.tsv file contain TFBS. Column 1 describes the chromosomal location. Column 2 describes the ENST that is being regulated (found 1000 bases downstream of peak). Column 3 describes the base pair position of the summit. Column 4 describes the Q score of the summit, a representation of the quality of peak. The Q score is a product of MACS2.

Hg38 Chromosome Sequences

Real File Name: HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

A file containing the human genome in FASTA format. This is downloaded in the initialization.sh script so we can collect FASTA sequences from regions that are chromatin accessible.

Upstream Peak Sequences

Real File Name: upstreamPeak_sequences.tsv

A FASTA file of the DNA region surrounding the peaks identified from upstreamPeaks.tsv . The header for each sequence is colon delimited, separating the genomic location from the name of each ENST.

BCrankOutput/

Real File Name: BCrankOutput/

A directory containing motif sequences (repeat sequences of DNA that are speculated to be TFBS). The default pipeline runs BCrank with 12 different seeds, identifying 100 motifs in each seed. The outputs are processed and sent to the BcrankOutput. The number preceding each file name in the BCrankOutput directory is used to distinguish each seed trial. The directory includes two types of files, .txt and .meme. Each file type includes the same set of 100 motifs, but formatted for different programs. TomTom only accepts MEME formatted files for position weighted matrices, and many FASTA search tools only accept consensus sequences (.txt).

Matching Site Table

Real File Name: matchingSiteTable.csv

A file containing instances of each motif in the upstreamPeak_sequences.tsv file. Column 1 describes the consensus sequence. Column 2 describes the chromosomal location of the sequence. Column 3 describes the ENST that is found 1000 bases downstream of the peak. Column 4 describes the start location of the sequence. Column 5 describes the end location of the sequence. Column 6 describes whether or not the sequence is contained within the peak region of the narrowPeak results from MACS2. This is an error check, and no matching sites are included if they do not have a peak region associated with their chromosomal location.

Code Snippets

13
14
shell:
	"bamCoverage -b {wildcards.sample} -o {output}"
23
24
shell:
	"Rscript scripts/BCrankProcessing.R {params.seedSize} {input.directory} {input.seq} {input.NApeak} ../results/{wildcards.sample}.results/matchingSiteTable.csv"
22
23
24
shell:
	"mkdir -p {output} ;"
	"Rscript scripts/BCrank.R {params.seedSize} {params.restartSize} {input.seq} ../results/{wildcards.sample}.results/BCrankOutput/"
25
26
shell:
	"macs2 callpeak -t /{sample} -g {params.genomeType} -f {params.peakFileType} -p {params.peakPVal} --seed {params.seed} --outdir {params.dirName}"
20
21
shell:
	"Rscript scripts/NarrowPeakSummitTracker.R {input.gtf} {input.NApeak} ../results/{wildcards.sample}.results/upstreamPeaks.tsv"
15
16
shell:
	"curl -LJO ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz > {output}"
30
31
shell:
	"Rscript scripts/SummitFASTA.R {input.refGenome} {input.upstreamPeaks} {params.motifSize} ../results/{wildcards.sample}.results/upstreamPeak_Sequences.fasta"
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
library(BCRANK)

args <- commandArgs(trailingOnly = T) # End line command arguments.

# All of the seeds are separated in the output from BCrank.R. The goal of this 
# program is to merge the outputs from BCrank into a single datafile. The 
# consensus sequences also have a chance of being incompatible with a future 
# program. This program tests each BCrank consensus sequence to remove any 
# problematic data. This program accepts a maximum of 20 for seedSize.

seedSize <- as.numeric(args[1])
file <- args[2]
file <- paste0(file, "/")

BC_1 <- read.table(paste0(file, "1_BC_consensusSequences.txt"))
dfnames <- c("BC_1") # "dfnames" keeps track of the BCrank motif objects 
                      # in short term memory.

if (seedSize >= 2){
  BC_2 <- read.table(paste0(file, "2_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_2")
}
if (seedSize >= 3){
  BC_3 <- read.table(paste0(file, "3_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_3")
}
if (seedSize >= 4){
  BC_4 <- read.table(paste0(file, "4_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_4")
}
if (seedSize >= 5){
  BC_5 <- read.table(paste0(file, "5_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_5")
}
if (seedSize >= 6){
  BC_6 <- read.table(paste0(file, "6_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_6")
}
if (seedSize >= 7){
  BC_7 <- read.table(paste0(file, "7_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_7")
}
if (seedSize >= 8){
  BC_8 <- read.table(paste0(file, "8_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_8")
}
if (seedSize >= 9){
  BC_9 <- read.table(paste0(file, "9_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_9")
}
if (seedSize >= 10){
  BC_10 <- read.table(paste0(file, "10_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_10")
}
if (seedSize >= 11){
  BC_11 <- read.table(paste0(file, "11_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_11")
}
if (seedSize >= 12){
  BC_12 <- read.table(paste0(file, "12_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_12")
}
if (seedSize >= 13){
  BC_13 <- read.table(paste0(file, "13_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_13")
}
if (seedSize >= 14){
  BC_14 <- read.table(paste0(file, "14_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_14")
}
if (seedSize >= 15){
  BC_15 <- read.table(paste0(file, "15_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_15")
}
if (seedSize >= 16){
  BC_16 <- read.table(paste0(file, "16_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_16")
}
if (seedSize >= 17){
  BC_17 <- read.table(paste0(file, "17_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_17")
}
if (seedSize >= 18){
  BC_18 <- read.table(paste0(file, "18_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_18")
}
if (seedSize >= 19){
  BC_19 <- read.table(paste0(file, "19_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_19")
}
if (seedSize >= 20){
  BC_20 <- read.table(paste0(file, "20_BC_consensusSequences.txt"))
  dfnames <- c(dfnames, "BC_20")
}


# MAIN FUNCTION
dfENV <- mget(dfnames, .GlobalEnv)

# Creates a function that extracts consensus sequences from the BCrank files. 
# Every other line is a sequence score and is removed. 
dataExtraction <- function(df) {
  # w <- df[c(FALSE, TRUE),]
  # rowSize <- round(nrow(w) / 2) + 1
  # w <- w[c(1:(rowSize)),]
  # w <- unlist(w)
  w <- as.character(unlist(unlist(df)))
}

# Applies the previous function to all dataframes in the environment.
extracted <- lapply(dfENV, dataExtraction)

fullConsensusList <- c() # Instantiates a list of consensus sequences.
for (i in dfnames){

  # Goes through each dataframe in the environment and appends the data to the 
  # full consensus list.
  fullConsensusList <- append(fullConsensusList, extracted[[i]])
}

# When complete, fullConsensusList represents a complete list of the motifs 
# identified in all trials produced by BCrank.
fullConsensusList <- unique(fullConsensusList)


############################################
# FASTA Sequence Conversion
############################################
# This section of code tests and removes invalid consensus sequences. 
# For naming conventions, the "fullConsensusList" variable is renamed to "
# bindingSiteSearchList".
bindingSiteSearchList <- fullConsensusList

# Instantiate lists.
consensusSequence <- c()
chromosome <- c()
geneName <- c()
start <- c()
end <- c()
count <- 1
doNotUse <- c()

# Goes through every binding site consensus sequence.
while(count < length(bindingSiteSearchList) + 1){

  # Occaisionally the binding site fails in the "matchingSites" command. The 
  # exact reason can vary. The function below will find instances of each 
  # binding site throughout the original fasta sequence file, 
  # upstreamPeak_sequences.fasta. If an error is produced, the binding site 
  # is ignored.
  tryCatch(
    {
      # Finds locations in upstreamPeak_sequences.fasta where the binding 
      # site exists.
      bindingSiteSearch <- as.character(bindingSiteSearchList[count])
      matching <- matchingSites(args[3], 
                                bindingSiteSearch)

      # Here we split the metadata
      intermediate <- unlist(strsplit(as.character(matching$`Region header`), ':'))

      # Now we need to identify the genomic location of each matching site. This
      # data was stored in the upstreamPeak_sequences.fasta file header in the 
      # SummitFASTA.R script. Here, we extract chromosomal location.
      chrExtraction <- intermediate[seq(1, length(intermediate), 5)]
      geneExtraction <- intermediate[seq(3, length(intermediate), 5)]

      # Finds the start and end location of the motif in each sequence.
      locationExtraction <- intermediate[seq(2, length(intermediate), 5)]
      locationIntermediate <- unlist(strsplit(locationExtraction, '-'))
      startPosition <- as.numeric(locationIntermediate[seq(1, 
                                      length(locationIntermediate), 2)]) + 
                                      as.numeric(matching$Start)
      endPosition <- as.numeric(locationIntermediate[seq(1, 
                                      length(locationIntermediate), 2)]) + 
                                      as.numeric(matching$End)

      # The chromosomal location and FASTA sequence are appended into several 
      # vector lists.
      consensusSequence <- append(consensusSequence, 
                                  rep(bindingSiteSearch, 
                                      length(chrExtraction)))
      chromosome <- append(chromosome, chrExtraction)
      geneName <- append(geneName, geneExtraction)
      start <- append(start, startPosition)
      end <- append(end, endPosition)

      print(paste0("Matching site search complete for binding site ", count))
    },

    # If an error occurs during the previous process, a message tells the user 
    # which consensus sequence failed, and the reason why.
    error = function(cond) {
      print("")
      print(paste("WARNING: The consensus sequence", 
                  bindingSiteSearch, 
                  "produced the following error:"))
      print(cond)
      print(paste(bindingSiteSearch, 
                  "has been removed from the binding site search list."))

      # doNotUse keeps track of which consensus sequences failed.
      doNotUse <- append(doNotUse, count)
    },

    # Whether the program failed or not, the count meter increases by 1, 
    # meaning we move on to the next consensus sequence.
    finally = {
      count <- count + 1
    }
  )
}



############################################
# Dataframe Conversion
############################################
if (length(doNotUse) > 0){
  # If there are any invalid binding sites, removed them.
  bindingSiteSearchList <- bindingSiteSearchList[-c(doNotUse)]
}

# Converts collected vector lists into a single data frame.
out <- cbind(data.frame(consensusSequence), 
             data.frame(chromosome), 
             data.frame(geneName), 
             data.frame(start), 
             data.frame(end))

# Data frame rows are sorted based on chromosome number.
out <- out[order(out$chromosome),]

# Previously, we chose a motif size of 100 nucleotides. This value was chosen to
# meet the minimum default requirements for a few MEME motif database search 
# tools. Now, we want to ensure that every motif is within a region defined as a
# "peak", RATHER than including any matching site NEARBY to a matching site.
narrowPeaks <- read.delim(args[4], sep = "\t", header = F)

uniqueChr <- unique(out$chromosome)
hasPeak <- c()

# To reduce the chances of errors, each chromosome is handled individually.
for (chr in uniqueChr){
  outSubset <- out[which(out$chromosome == chr),]
  narrowPeakSubset <- narrowPeaks[which(narrowPeaks$V1 == chr),]

  # For each gene in each chromsome...
  for(i in 1:nrow(outSubset)) {
    row <- outSubset[i,]

    # Finds the start position (V2) and end position (V3) for each narrow peak.
    narrowPeakTemp <- narrowPeakSubset[which(narrowPeakSubset$V2 <= row$end),]
    narrowPeakTemp <- narrowPeakTemp[which(narrowPeakTemp$V3 >= row$start),]

    # A new dataframe row is created that keeps track of whether or not the 
    # matching site is found within the same region as a peak.
    if (nrow(narrowPeakTemp) >= 1){
      hasPeak <- append(hasPeak, TRUE)
    }
    else {
      hasPeak <- append(hasPeak, FALSE)
    }
  }

  print(paste0("Narrow Peak verification complete for ", chr))
}

# Appends the hasPeak variable to the dataframe.
out$hasPeak <- hasPeak
out <- out[which(out$hasPeak == T),] # Subsets the data to only include hasPeak
                                     # values of true.

write.csv(out, args[5], quote = F, row.names = F)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
library(BCRANK)
library(Biostrings)

args <- commandArgs(trailingOnly = T) # End line command arguments.

# BCrank is used to discover 100 unique motifs in this program (restartSize 
# variable). To repeat this process, you can increase the seedSize variable. 
# The program is repeated using a new random seed a number of times equal to 
# the seed size. For example, a seedSize value of 12 will produce (12 * 100) = 
# 1200 unique motifs.
seedSize <- as.numeric(args[1])
restartSize <- as.numeric(args[2])

if (seedSize > 20){
  print("WARNING: Seed size is greater than 20! This is accepted by BCrank,
        but future steps of the program will not be able to read more than 20
        seeds. All remaining seed generations will be ignored by those steps.")
}



######################
# Program Start
######################
for (seed in 1:seedSize){
  set.seed(as.numeric(seed)) # Set the randomized seed variable.
  BCRANKout <- bcrank(args[3], 
                      restarts = restartSize, 
                      use.P1 = TRUE, 
                      use.P2 = TRUE)

  BCRANKS <- unlist(toptable(BCRANKout))

  # The output from BCRANK is saved to a new file.
  motifList <- ""
  for (i in 1:length(BCRANKout@toplist)){
    motifList <- paste0(motifList, BCRANKout@toplist[[i]]@final@consensus)
    motifList <- paste0(motifList, "\n")
  }
  write.table(data.frame(motifList), paste0(args[4], seed, "_BC_consensusSequences.txt"), col.names = F, row.names = F, quote = F)

  # We are also interested in comparing these motifs to known motif datasets 
  # using Memesuite tools. Memesuite requires a unique file format. The BCrank 
  # motifs are converted to a MEME file format in this section of code.
  outFile <- c("MEME version 5.1.1", "", "ALPHABET= ACGT", "") # Header

  for (i in 1:length(BCRANKout@toplist)){
    outFile <- append(outFile, paste("MOTIF", i, collapse = " "))
    outFile <- append(outFile, "letter-probability matrix: alength= 4")

    # Motifs are formatted as a nucleotide probability matrix (PWM). This "for" 
    # loop cycles through each of the probability matrices created by BCrank.
    for (PWM in 1:ncol(BCRANKout@toplist[[i]]@finalPWM)){
      outFile <- append(outFile, 
                        paste(t(BCRANKout@toplist[[i]]@finalPWM / BCRANKout@toplist[[i]]@finalNrMatch)[PWM,], 
                              collapse = " "))
    }

    # The newly formatted PWM is appended to a growing output file.
    outFile <- append(outFile, "")
  }  

  finalOut <- data.frame(outFile) # Convert the Meme output to a dataframe 
                                  # file type.

  # Write output.
  write.table(finalOut, 
              paste0(args[4], seed, "_BCRANK_motifs.meme"), 
              row.names = F, col.names = F, quote = F)
}
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
args <- commandArgs(trailingOnly = T) # End line command arguments.

genes <- read.delim(args[1], sep = "\t")
summits <- read.delim(args[2], header = F) # Output from MACS2

# List of chromosomes.
chrGroups <- c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9",
               "chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17",
               "chr18","chr19","chr20","chr21","chr22","chrX","chrY","chrM")

# Booleans to indicate whether or not we want to include the positive 
#and negative strands.
includePositive <- T
includeNegative <- T



##########################################################################
# Error check
# Determines if inputs and outputs are giving expected ranges of results

# Checks the gene file is BED formatted.
if (all(colnames(genes) != c("GeneName", "Chromosome", "Start", "Stop", 
                             "Symbol", "Strand"))){
  # Error output message.
  stop(paste0("Input BED file for reference genes not properly formatted.",
    "Make sure the file contains 6 columns labeled (GeneName, Chromosome, ",
    "Start, Stop, Symbol, Strand). Elements should be separated with a single ",
    "space."))

  # Script failed to terminate.
  print("Script did not end! Please manually terminate.")
}

# Checks the summits file is formatted correctly.
if (length(summits) != 10){
  stop("Input Narrow Peak Summits file not properly formatted. ",
    "Make sure the file contains 10 columns, NO header, and is tab separated.")
  print("Script did not end! Please manually terminate.")
}

# Runs a simulation of the code using a data subset (first 100 rows of GTFgenes)
# If any errors exist, they should be caught early.
outFile <- c()
genesTemp <- genes[which(genes$Chromosome == chrGroups[1]),]
summitsTempIntermediate <- summits[which(summits$V1 == chrGroups[1]),]
summitsTemp <- summitsTempIntermediate$V2 + summitsTempIntermediate$V10 
summitsQVals <- summitsTempIntermediate$V9
summitScore <- summitsTempIntermediate$V7

for (j in row.names(genesTemp)[1:100]){
  rowOfInterest <- genesTemp[j,]
  low <- rowOfInterest$Start - 1000
  high <- rowOfInterest$Start - 100

  conditional <- which(summitsTemp >= low & summitsTemp <= high)
  summitsList <- summitsTemp[conditional]
  summitsQValList <- summitsQVals[conditional]
  summitScoreList <- summitScore[conditional]

  if (length(summitsList) != 0){
    for (k in 1:length(summitsList)){
      outFile <- append(outFile, paste(c("chr1", as.character(rowOfInterest$GeneName), 
                                         summitsList[k], 
                                         summitsQValList[k], 
                                         as.character(rowOfInterest$Symbol), 
                                         as.character(rowOfInterest$Strand), 
                                         summitScoreList[k]), 
                                       collapse = " "))
    }
  }
}

# Prints a sample of the output file to be confirmed by the user.
print("#############################################")
print("SAMPLE upstreamPeaks.tsv Outfile")
print("Check that the output is formatted correctly.")
print("#############################################")
print(head(outFile))

testOutFile <- unlist(strsplit(outFile[1], ' '))
if (is.null(testOutFile[3])){
  stop(paste0("Test output returned NULL for chromosome start position. ",
    "Check input. Terminating command. It is possible this error occurred ",
    "because no peaks were found upstream to first 100 genes in your ",
    "reference gene list."))
  print("Script did not end! Please manually terminate.")
}
if (is.null(testOutFile[4])){
  stop(paste0("Test output returned NULL for chromosome end position. ",
  "Check input. Terminating command. It is possible this error occurred ",
  "because no peaks were found upstream to first 100 genes in your ",
  "reference gene list."))
  print("Script did not end! Please manually terminate.")
}
##########################################################################



# Main function. The output file variable is reset.
outFile <- c()

# Run the code for genes on the positive strand.
if (includePositive)
{
  genesStrand <- genes[which(genes$Strand == "+"),]

  # The goal of this function is to locate every instance of an accessible 
  # chromatin region that is within a 1000 base upstream region of genes in the 
  # human genome. To do this, we use the GTFgenes.tsv file to find start 
  # locations of genes, and find summits from the MACS2 output that are within 
  # our desired range of the gene. Each chromosome is calculated individually 
  # to avoid errors in which multiple genes on different chromosomes have the 
  # same base pair (bp) start position.
  for (chromosome in chrGroups){

    # Subset genes on the matching chromosome.
    genesTemp <- genesStrand[which(genesStrand$Chromosome == chromosome),]
    # Subset of peaks(summits) on the matching chromosome.
    summitsTempIntermediate <- summits[which(summits$V1 == chromosome),]
    # Column 10 (V10) of the NA_peaks file includes the bp distance from the 
    # start of the accessible chromatin (V2) to the peak. To calculate the 
    # summit's bp position, we can add the values from these columns together.
    summitsTemp <- summitsTempIntermediate$V2 + summitsTempIntermediate$V10
    # The values in column 9 (V9) include a significance score metric for the 
    # peak. It is saved in memory for future programs.
    summitsQVals <- summitsTempIntermediate$V9
    # The values in column 7 (V7) indicate the MACS2 score for a relative fold 
    # change of the peak.
    summitScore <- summitsTempIntermediate$V7

    # For each gene, we perform a caluclation to find summit locations.
    for (geneID in row.names(genesTemp)){

      rowOfInterest <- genesTemp[geneID,]
      # Here we calculate the range of acceptable bp locations for a summit to 
      # be located near a gene. Summits in this region of more likely to contain 
      # transcription factor binding sites (TFBS).
      low <- rowOfInterest$Start - 1000 # Lower bound bp region.
      high <- rowOfInterest$Start - 100 # Upper bound bp region. TFBSs are 
      # unlikely to be found within the 100bp upstream region of genes.

      # Conditional finds all summits within the accepted bp region.
      conditional <- which(summitsTemp >= low & summitsTemp <= high)
      # Afterwards, our previous variables are subset to include only the values 
      # in conditional.
      summitsList <- summitsTemp[conditional]
      summitsQValList <- summitsQVals[conditional]
      summitScoreList <- summitScore[conditional]

      # Checks that the table is not empty.
      if (length(summitsList) != 0){
        for (k in 1:length(summitsList)){
          # An output file is created that stores important variables delimited 
          # by spaces.
          outFile <- append(outFile, paste(c(chromosome, as.character(rowOfInterest$GeneName), 
                                             summitsList[k], 
                                             summitsQValList[k], 
                                             as.character(rowOfInterest$Symbol), 
                                             as.character(rowOfInterest$Strand),
                                             summitScoreList[k]), 
                                           collapse = "\t"))
        }
      }
    }

    # The chromosome variable is printed upon completion so the user has 
    # an easier time gauging the progress of the program.
    print(paste0(chromosome, " positive strand search completed."))
  }
}



# Run the above code for the negative strand.
if (includeNegative)
{
  genesStrand <- genes[which(genes$Strand == "-"),]

  for (chromosome in chrGroups){

    genesTemp <- genesStrand[which(genesStrand$Chromosome == chromosome),]
    summitsTempIntermediate <- summits[which(summits$V1 == chromosome),]
    summitsTemp <- summitsTempIntermediate$V2 + summitsTempIntermediate$V10
    summitsQVals <- summitsTempIntermediate$V9
    summitScore <- summitsTempIntermediate$V7

    for (geneID in row.names(genesTemp)){

      rowOfInterest <- genesTemp[geneID,]
      # Here we calculate the range of acceptable bp locations for a summit to 
      # be located near a gene. Summits in this region of more likely to contain 
      # transcription factor binding sites (TFBS).
      low <- rowOfInterest$Stop + 100 # Lower bound bp region.
      high <- rowOfInterest$Stop + 1000 # Upper bound bp region. TFBSs are 
      # unlikely to be found within the 100bp upstream region of genes.

      conditional <- which(summitsTemp >= low & summitsTemp <= high)
      summitsList <- summitsTemp[conditional]
      summitsQValList <- summitsQVals[conditional]
      summitScoreList <- summitScore[conditional]

      if (length(summitsList) != 0){
        for (k in 1:length(summitsList)){
          outFile <- append(outFile, paste(c(chromosome, as.character(rowOfInterest$GeneName), 
                                             summitsList[k], 
                                             summitsQValList[k], 
                                             as.character(rowOfInterest$Symbol), 
                                             as.character(rowOfInterest$Strand),
                                             summitScoreList[k]), 
                                           collapse = "\t"))
        }
      }
    }

    print(paste0(chromosome, " negative strand search completed."))
  }
}



# The output of the program is saved.
outFile <- data.frame(outFile)
outFile <- unique(outFile)
write.table(outFile, args[3], 
            row.names = F, 
            col.names = F, 
            quote = F)
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
library(Biostrings)

args <- commandArgs(trailingOnly = T) # End line command arguments.

HG38 <- args[1]

genome <- readDNAStringSet(HG38, format = "fasta")
outFile <- read.delim(args[2], sep = "\t", header = F)

# This variable establishes the size of the motif binding site region you'd 
# like to use. If you don't know what that means, read further.
MotifSize <- as.numeric(args[3])



##########################
##SEQUENCE DETERMINATION
##########################
# In this program, we'd like to find the DNA FASTA sequence around each 
# NarrowPeak summit location determined in the previous step 
# (NarrowPeakSummitTracker.R). To do this, we need to import and initialize 
# a reference genome that is compatible with the same gene position file 
# (GTFgenes). If you modify either GFFgenes.bed OR the files in HG38, you will
# need to verify the data in the genes.bed files matches the correct bp 
# locations in the HG38 file.
chr1 <- genome[1]; chr2 <- genome[2]; chr3 <- genome[3]; chr4 <- genome[4]
chr5 <- genome[5]; chr6 <- genome[6]; chr7 <- genome[7]; chr8 <- genome[8]
chr9 <- genome[9]; chr10 <- genome[10]; chr11 <- genome[11]
chr12 <- genome[12]; chr13 <- genome[13]; chr14 <- genome[14]
chr15 <- genome[15]; chr16 <- genome[16]; chr17 <- genome[17]
chr18 <- genome[18]; chr19 <- genome[19]; chr20 <- genome[20]
chr21 <- genome[21]; chr22 <- genome[22]; chrX <- genome[23]
chrY <- genome[24]; chrM <- genome[25]

# File formatting initialization.
motifFile <- c()
qVal <- c()
MotifSize <- round(MotifSize / 2)

# To make the FASTA file, we convert each acessible chromatin region one at 
# a time.
for (site in 1:nrow(outFile)){

  # Each row of the data is separated by spaces.
  # Subject[1] is the Chromosome.
  # Subject[2] is the GeneName.
  # Subject[3] is the SummitLocation.
  # Subject[4] is the QVal of the original MACS2 Summit.
  subject <- outFile[site,]

  # This segment of code identifies the DNA region surrounding each summit.
  # The size of the search space is determined by the variable "MotifSize", 
  # where the value of MotifSize is the length of the final motif in units 
  # of bp. Each chromsome is handled in a separate switch condition to prevent 
  # undesired bp start site duplications.
  startPos = as.numeric(as.character(subject[[3]])) - MotifSize
  endPos = as.numeric(as.character(subject[[3]])) + MotifSize

  # We must also account for niche scenarios in which the positions exceed
  # the size of the reference genome, due to the +- 1000 bp search range
  # from the NarrowPeakSummitTracker.R script.
  ogStartPos = startPos
  startPos = max(1, startPos)
  endPos = max(2, endPos)
  genomeSize <- switch (as.character(subject[[1]]),
                 "chr1" = length(chr1[[1]]),
                 "chr2" = length(chr2[[1]]),
                 "chr3" = length(chr3[[1]]),
                 "chr4" = length(chr4[[1]]),
                 "chr5" = length(chr5[[1]]),
                 "chr6" = length(chr6[[1]]),
                 "chr7" = length(chr7[[1]]),
                 "chr8" = length(chr8[[1]]),
                 "chr9" = length(chr9[[1]]),
                 "chr10" = length(chr10[[1]]),
                 "chr11" = length(chr11[[1]]),
                 "chr12" = length(chr12[[1]]),
                 "chr13" = length(chr13[[1]]),
                 "chr14" = length(chr14[[1]]),
                 "chr15" = length(chr15[[1]]),
                 "chr16" = length(chr16[[1]]),
                 "chr17" = length(chr17[[1]]),
                 "chr18" = length(chr18[[1]]),
                 "chr19" = length(chr19[[1]]),
                 "chr20" = length(chr20[[1]]),
                 "chr21" = length(chr21[[1]]),
                 "chr22" = length(chr22[[1]]),
                 "chrX" = length(chrX[[1]]),
                 "chrY" = length(chrY[[1]]),
                 "chrM" = length(chrM[[1]])
  )
  startPos = min(genomeSize - 1, startPos)
  endPos = min(genomeSize, endPos)

  # Finally, we collect the sequence based on the chromosome.
  seq <- switch (as.character(subject[[1]]),
                 "chr1" = BStringSet(chr1, start = startPos, end = endPos),
                 "chr2" = BStringSet(chr2, start = startPos, end = endPos),
                 "chr3" = BStringSet(chr3, start = startPos, end = endPos),
                 "chr4" = BStringSet(chr4, start = startPos, end = endPos),
                 "chr5" = BStringSet(chr5, start = startPos, end = endPos),
                 "chr6" = BStringSet(chr6, start = startPos, end = endPos),
                 "chr7" = BStringSet(chr7, start = startPos, end = endPos),
                 "chr8" = BStringSet(chr8, start = startPos, end = endPos),
                 "chr9" = BStringSet(chr9, start = startPos, end = endPos),
                 "chr10" = BStringSet(chr10, start = startPos, end = endPos),
                 "chr11" = BStringSet(chr11, start = startPos, end = endPos),
                 "chr12" = BStringSet(chr12, start = startPos, end = endPos),
                 "chr13" = BStringSet(chr13, start = startPos, end = endPos),
                 "chr14" = BStringSet(chr14, start = startPos, end = endPos),
                 "chr15" = BStringSet(chr15, start = startPos, end = endPos),
                 "chr16" = BStringSet(chr16, start = startPos, end = endPos),
                 "chr17" = BStringSet(chr17, start = startPos, end = endPos),
                 "chr18" = BStringSet(chr18, start = startPos, end = endPos),
                 "chr19" = BStringSet(chr19, start = startPos, end = endPos),
                 "chr20" = BStringSet(chr20, start = startPos, end = endPos),
                 "chr21" = BStringSet(chr21, start = startPos, end = endPos),
                 "chr22" = BStringSet(chr22, start = startPos, end = endPos),
                 "chrX" = BStringSet(chrX, start = startPos, end = endPos),
                 "chrY" = BStringSet(chrY, start = startPos, end = endPos),
                 "chrM" = BStringSet(chrM, start = startPos, end = endPos)
  )

  # All of the data is restructured into a FASTA file with a modified header.
  # The header is a colon separated file included the chromosomal location of 
  # each FASTA sequence and the name of the gene associated with it.
  newHeader <- paste(c(">", as.character(subject[[1]]), ":",
                       as.character(ogStartPos), "-",
                       as.character(endPos), ":",
                       as.character(subject[[5]]), ":",
                       as.character(subject[[6]]), ":",
                       subject[[7]]), collapse = '')

  # To prevent duplicates, we check if the header ID is already in the list,
  # and only keep FASTA sequences that are NOT repeats.
  if (newHeader %in% motifFile == F){
    motifFile <- append(motifFile, newHeader)
    # After the header has been added, we add the DNA sequence.
    motifFile <- append(motifFile, as.character(seq))

    # We keep track of the qValues so we can reorder the data later on.
    qVal <- append(qVal, c(as.numeric(as.character(subject[[4]])), as.numeric(as.character(subject[[4]]))))
  }
}

secondOut <- data.frame(motifFile) # Convert data to dataframe file format.
orderByQVAL <- secondOut[rev(order(as.numeric(qVal))),] # Order the FASTA 
# sequences based on QVal. This step is important for accurate results when 
# we eventually use the program "BCrank".
even <- seq(from=2, to=length(orderByQVAL), by=2)
odd <- even - 1
finalOutput <- orderByQVAL
finalOutput[even] <- orderByQVAL[odd]
finalOutput[odd] <- orderByQVAL[even]
finalOutput <- data.frame(finalOutput)

# Write the output FASTA file.
write.table(finalOutput, args[4], 
            row.names = F, col.names = F, quote = F)
R From line 8 of scripts/SummitFASTA.R
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/CalebPecka/ATAC-Seq-Pipeline
Name: atac-seq-pipeline
Version: 2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...