Using RNA-Seq data to improve microRNA target prediction accuracy in animals

public public 1yr ago Version: v1.3.0 0 bookmarks

FilTar is a tool to integrate RNA-Seq data to pre-existing miRNA target prediction workflows in order to increase prediction accuracy.

It achieves this by:

  1. Removing transcripts which are not expressed or poorly expressed for a given cell type or tissue

  2. Generating 3'UTR annotations specific to a given cell type or tissue

It also operates as a fully functional wrapper around the pre-existing TargetScan7 and miRanda target prediction workflows.

Installation

Instructions on how to install FilTar can be found at the following location: https://tbradley27.github.io/FilTar/

Basic Usage

FilTar can be used by following 2 steps:

  1. Specify the options you would like to use to run FilTar by editing config/basic.yaml .

  2. Run the following command:

snakemake --use-conda --cores $N target_predictions.txt

After running the command, all target predictions are contained inside target_predictions.txt .

The following video presents a concise demonstration of basic FilTar usage:

https://www.youtube.com/watch?v=Xhl-nsg7_xo

More detailed instructions can be found inside the full documentation: https://tbradley27.github.io/FilTar/

Publication

The article describing FilTar can be found in Volume 36, Issue 8 (pages 2410-2416) of the Bioinformatics journal published by Oxford University Press. An online, open access version of the article is available here .

DOI: 10.1093/bioinformatics/btaa007

PMCID: PMC7178423

PMID: 31930382

The default method of citing the article is to use the following:

Thomas Bradley, Simon Moxon, FilTar: using RNA-Seq data to improve microRNA target prediction accuracy in animals, Bioinformatics, Volume 36, Issue 8, 15 April 2020, Pages 2410–2416, https://doi.org/10.1093/bioinformatics/btaa007

Getting Help

In order to ensure your enquiries are seen by the most people possible who may be sharing your problem, it is best to share the problems that you are having publicly. The first port of call is to post questions on the biostars bioinformatics online forum (https://www.biostars.org/). If using biostars, please make sure to use the 'filtar' tag when asking questions to notify me, so I can answer promptly.

If this option doesn't work for whatever reason, I happen to accept correspondence via my academic email address: [email protected]

Reporting bugs, suggested enhancements or any other issues

The issues page of this repository is the best place to post this.

Contributions and Acknowledgements

Simox Moxon came up with the original idea and project proposal for FilTar. The FilTar concept was extended and developed further between Simon Moxon and Thomas Bradley through the course of the latter's BBSRC (Biotechnology and Biological Sciences Research Council) PhD Studentship on the Norwich Research Park (NRP) Bioscience Doctoral Training Partnership (DTP) programme, when Thomas Bradley worked under the primary supervision of Simon Moxon initially predominantly at the Earlham Institute, and then later predominantly at the School of Biological Sciences, University of East Anglia.

Full acknowledgements can be found within the preprinted article associated with FilTar.

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

shell(
    "samtools sort {snakemake.params} -@ {snakemake.threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}")
17
18
19
20
21
22
23
import sys
import os

if 'single_end' in snakemake.output[0]:
	os.system("wget -nv --directory-prefix=data/single_end/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{}/00{}/{}/{}.fastq.gz || wget -nv --directory-prefix=data/single_end/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{}/{}/{}.fastq.gz".format(snakemake.wildcards['accession'][0:6],snakemake.wildcards['accession'][-1],snakemake.wildcards['accession'],snakemake.wildcards['accession'],snakemake.wildcards['accession'][0:6],snakemake.wildcards['accession'],snakemake.wildcards['accession']))
elif 'paired_end' in snakemake.output[0]:
	os.system("wget -nv --directory-prefix=data/paired_end/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{}/00{}/{}/{}_{}.fastq.gz || wget -nv --directory-prefix=data/paired_end/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{}/{}/{}_{}.fastq.gz".format(snakemake.wildcards['accession'][0:6],snakemake.wildcards['accession'][-1],snakemake.wildcards['accession'],snakemake.wildcards['accession'],snakemake.wildcards['mate_number'],snakemake.wildcards['accession'][0:6],snakemake.wildcards['accession'],snakemake.wildcards['accession'],snakemake.wildcards['mate_number']))
24
script: "download_fastq.py" 
SnakeMake From line 24 of ENA/Snakefile
30
script: "download_fastq.py"
SnakeMake From line 30 of ENA/Snakefile
33
shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/gtf/{params}/{wildcards.genus_species}.{wildcards.build}.{config[ensembl_release]}.chr.gtf.gz data/ && gunzip data/{wildcards.genus_species}.{wildcards.build}.{config[ensembl_release]}.chr.gtf.gz && sed 's/^chr//g' data/{wildcards.genus_species}.{wildcards.build}.{config[ensembl_release]}.chr.gtf > tmp && mv tmp data/{wildcards.genus_species}.{wildcards.build}.{config[ensembl_release]}.chr.gtf"
38
39
40
41
42
43
        shell: "sed -r 's/^MT//g' {input} | sed '/^\t/d'  > {output}" # delete mitochondiral records and records not assigned to a chromosome

rule get_transcript_ids:
        input:
                gtf=get_gtf_file
        output: "results/bed/{species}_chr{chrom}_all_transcripts.txt"
44
shell: "scripts/get_all_transcripts.sh {input} {wildcards.chrom} > {output}"
51
shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/fasta/{params}/dna/{wildcards.genus_species}.{wildcards.build}.dna.toplevel.fa.gz data/"
56
shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/fasta/{params}/dna/{wildcards.genus_species}.{wildcards.build}.dna.primary_assembly.fa.gz data/"
62
shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/fasta/{params}/dna/{wildcards.genus_species}.{wildcards.build}.dna.chromosome.{wildcards.chrom}.fa.gz data/"
69
shell: "pigz -p {threads} -d {input}"
75
76
shell:
   "wget --directory-prefix=data/maf_hsa/ -nv http://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/maf/chr{wildcards.chrom}.maf.gz && gunzip {output}.gz"
82
83
shell:
   "wget --directory-prefix=data/maf_mmu/ -nv http://hgdownload.cse.ucsc.edu/goldenPath/mm10/multiz60way/maf/chr{wildcards.chrom}.maf.gz && gunzip {output}.gz"
88
shell: "rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-{config[ensembl_release]}/fasta/{params}/cdna/{wildcards.genus_species}.{wildcards.build}.cdna.all.fa.gz data"
93
shell: "gunzip {input}"
97
shell: 'wget https://sourceforge.net/projects/apatrap/files/APAtrap_Linux.zip/download && mv download scripts && unzip -d scripts/ scripts/download'
24
shell: "fasterq-dump -O data/single_end/ --threads {threads} {wildcards.accession}"
31
shell: "fasterq-dump -O data/paired_end/ --threads {threads} {wildcards.accession}"
36
shell: "gzip {input}"
45
shell: "gzip {input.file1} {input.file2}"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(tidyverse)

# load in target predictions

target_predictions = readr::read_tsv(snakemake@input[['targets']])

utr_genomic_positions = readr::read_tsv(snakemake@input[['bed_file']], col_names=c('chromosome','start','stop','strand','transcript'), 
	col_types='ciicc')

utr_genomic_positions = utr_genomic_positions[!duplicated(utr_genomic_positions$transcript),]

print(target_predictions)
print(utr_genomic_positions)

target_predictions = dplyr::inner_join(target_predictions, utr_genomic_positions, by=c(`Gene ID`='transcript'))
target_predictions$genomic_start = target_predictions$start + target_predictions$`UTR start`
target_predictions$genomic_end = target_predictions$start + target_predictions$`UTR end`

print(target_predictions %>% select(`Gene ID`,`Mirbase ID`,`UTR start`,`UTR end`,start,stop,genomic_start,genomic_end), width=Inf)

write.table(x=target_predictions, file=snakemake@output[[1]], sep='\t', quote=FALSE, col.names=TRUE, row.names=FALSE)
8
script: 'get_target_genomic_coordinates.R'
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import sys
import functools
from functools import reduce
import re
import os
from Bio import AlignIO
from Bio.AlignIO import MafIO
from subprocess import call

def add_alignment():
           global start_pos
           global end_pos
           if strand == -1:
               start_pos = start_pos[::-1] # Reverse Order
               end_pos = end_pos[::-1]
           else:
               pass

           new_multiple_alignment = idx.get_spliced(start_pos, end_pos, strand) # splice through the index
           for i in range(len(new_multiple_alignment)):
              new_multiple_alignment[i].id = re.sub('\..*','', new_multiple_alignment[i].id) # strip chomosome information from id
              new_multiple_alignment[i].id = accession + '\t' + new_multiple_alignment[i].id # label all alignments with relevant transcript accession

           global big_alignment   
           big_alignment += new_multiple_alignment.format('fasta')
           return()

if os.stat(snakemake.input['bed']).st_size == 0:
     target = open(snakemake.output[0], 'w')
else:
    if snakemake.wildcards['species'] == "hsa":    #Identify the species identifier passed through the command line
       build = "hg38"
    elif snakemake.wildcards['species'] == "mmu":
       build = "mm10"
    else:
       build = ''

    idx = AlignIO.MafIO.MafIndex(snakemake.input['maf_index'], snakemake.input['maf'], "{}.chr{}".format(build, snakemake.wildcards['chrom'])  )

    start_pos = []
    end_pos = []
    accession = 'empty'
    with open(snakemake.input['bed'] ) as f:
       big_alignment = ''
       for line in f:    #Open and loop through line-by-line the relevant BED file
          parts = line.split()  

          if accession == 'empty':
             accession = parts[4]   
             start_pos.append(int(parts[1]))
             end_pos.append(int(parts[2]))
             strand = (int(parts[3]))

          elif parts[4] == accession:       # If accession has multiple entries in bed, add additional genome co-ordinates to rel. lists
             start_pos.append(int(parts[1]))
             end_pos.append(int(parts[2]))
             strand = (int(parts[3]))

          else:
             add_alignment()

             start_pos = [] # initialise a new transcript record
             end_pos = []
             start_pos.append(int(parts[1]))
             end_pos.append(int(parts[2]))
             strand = (int(parts[3]))
             accession = parts[4]
       else:

         add_alignment()     

       result = reduce(lambda x, y: x.replace(y, snakemake.config["TaxID"][y]), snakemake.config["TaxID"], big_alignment) # get NCBI taxonomic IDs
       result = result.replace('\n','') # convert from fasta to tsv
       result = result.replace('>','\n') # convert from fasta to tsv
       result = re.sub('(\s[0-9]{4,7})',r'\1\t',result) # convert from fasta to tsv
       result = re.sub('\n','',result, count=1) # remove leading empty line

       target = open(snakemake.output[0], 'w')

       for line in iter(result.splitlines()):
          pattern = re.compile('\.[0-9][0-9]?\sN+')
          pattern2 = re.compile('T[0-9]+\sN+')
          pattern3 = re.compile('^N+$') # when ref transcript is unknown - it leaves a trailing lines of Ns which need to be removed
          if 'delete' in line:
              pass
          elif 'unknown' in line:
              pass
          elif pattern.search(line):
              pass
          elif pattern2.search(line):
              pass
          elif pattern3.search(line):
              pass
          else:
              line2 = line + '\n'
              target.write(line2)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import sys
import re
from Bio import AlignIO
from Bio.AlignIO import MafIO

if snakemake.wildcards['species'] == "hsa":
       build = "hg38" 
elif snakemake.wildcards['species'] == "mmu":
	build = "mm10"
else:
	build = ''

idx = AlignIO.MafIO.MafIndex(snakemake.output[0],  snakemake.input[0], "{}.chr{}".format(build, snakemake.wildcards['chrom'])  )
23
24
script:
   "biopython_maf_processing.py"
36
37
script: 
   "biopython_maf_processing2.py"
3
4
5
6
7
8
if ( file.info(snakemake@input[[1]])$size == 0 ) {
        file.copy(from=snakemake@input[[1]], to=snakemake@output[[1]])
} else {
	output = filtar::MergeFasta(snakemake@input[[1]])
	write.table(output, snakemake@output[[1]], quote=FALSE, sep="\t", col.names=FALSE, row.names=FALSE)
}
29
shell: "scripts/get_bedtools_bed.sh {input} {output}"
37
shell: "bedtools getfasta -name -s -fi {input.dna} -bed {input.bed} -fo {output}"
42
script: "merge_fasta.R"
48
shell: "scripts/convert_fa_to_tsv2.sh {input} {params} {output}"
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from Bio import SeqIO

records = list(SeqIO.parse(snakemake.input[0],'fasta'))
filtered_records = []

for record in records:
	if record.id in snakemake.config['mirnas']:
		filtered_records.append(record)
	elif len(snakemake.config['mirnas']) == 0: # use all records if mirna config entry is empty
		filtered_records.append(record)
	else:
		pass

SeqIO.write(filtered_records,snakemake.output[0],'fasta')
26
shell: "grep -A 1 {wildcards.species} {input} | awk '{{ print $1 }}' | sed 's/--//g' | sed '/^$/d' > {output}"
32
script: "filter_mirna_fasta.py"
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
united_quant = readr::read_tsv(
        file=snakemake@input[[1]],
        col_types=readr::cols(.default = 'd', Name = 'c')
)

united_quant2 = filtar::AvgSalmonQuant(united_quant)

avg_quant = united_quant2[,c('Name','avg')]
colnames(avg_quant) = c('Name','TPM')

write.table(
        x=avg_quant,
        file=snakemake@output[[1]],
        row.names=FALSE,
        col.names=TRUE,
        quote=FALSE,
        sep="\t"
)      
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
united_quant = readr::read_tsv(
        file=snakemake@input[[1]],
        col_types=readr::cols(.default = 'd', Name = 'c')
)

united_quant2 = filtar::AvgSalmonQuant(united_quant)

avg_quant = tibble::tibble(Name=united_quant2$Name, Length=20, EffectiveLength=20.00, TPM=united_quant2$avg, NumReads=20.00)

real_output = paste(snakemake@output[[1]],'quant.sf',sep="/")
dir.create(snakemake@output[[1]])

write.table(
        x=avg_quant,
        file=real_output,
        row.names=FALSE,
        col.names=TRUE,
        quote=FALSE,
        sep="\t"
)      
17
18
19
20
21
22
23
import sys
import os

if len(snakemake.input) == 2:
        os.system('salmon quant --validateMappings --rangeFactorizationBins 4 --seqBias --posBias -p {} -i {} -l A -r {} -o {}'.format(snakemake.threads,snakemake.input['index'],snakemake.input['reads'][0],snakemake.output))
else:
        os.system('salmon quant --validateMappings --rangeFactorizationBins 4 --seqBias --posBias -p {} -i {} -l A -1 {} -2 {} -o {}'.format(snakemake.threads,snakemake.input['index'],snakemake.input['reads'][0],snakemake.input['reads'][1],snakemake.output))     
30
31
shell:
    "salmon index --threads {threads} -t {input} -i {output} --type quasi -k 31"
40
41
shell:
    "salmon index --threads {threads} -t {input} -i {output} --type quasi -k 31"
66
67
script:
	"quant_salmon.py"
79
80
script:
	"quant_salmon.py"
87
shell: "grep 'expected' {input}/lib_format_counts.json | awk '{{print $2}}' | sed 's/\"//g' | sed 's/,//g' > {output}"
101
shell: "salmon quantmerge --quants {input} --names {input} -o {output}"
108
script: "get_average_quant.R"
SnakeMake From line 108 of salmon/Snakefile
122
shell: "salmon quantmerge --quants {input} --names {input} -o {output}"
127
script: "get_average_quant2.R"
SnakeMake From line 127 of salmon/Snakefile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
data = readr::read_tsv(snakemake@input[[1]], col_names=FALSE)
data = tidyr::separate(data, X5, into=c('X5.1','X5.2'), remove=TRUE)
data = tidyr::separate(data, X6, into=c('X6.1','X6.2'), remove=TRUE)

colnames(data) = c('miRNA_ID','transcript_ID','score','energy(kCal/Mol)','miRNA_start','miRNA_end','3UTR_start','3UTR_end','alignment_length','percent_matches', 'percent_matches_and_wobbles')

write.table(
	x=data,
	file=snakemake@output[[1]],
	quote=FALSE,
	row.names=FALSE,
	col.names=TRUE,
	sep="\t"
)
34
shell: "sed 's/(+)//g' {input} | sed 's/(-)//g' > {output}"
43
shell: "miranda {input.mirna} {input.utr} {params} -sc {config[miRanda.minimum_alignment_score]} -en {config[miRanda.minimum_energy_score]} -scale {config[miRanda.5_prime_3_prime_scaling_factor]} -go {config[miRanda.alignment_gap_open_penalty]} -ge {config[miRanda.alignment_gap_extension_penalty]} > {output}"
48
shell: "scripts/convert_miRanda_to_tsv.sh {input} {output}"
55
shell: "cat {input} > {output}"
60
script: "add_miRanda_header.R"
1
2
3
4
5
6
7
if ( file.info(snakemake@input[['miRanda_scores']])$size == 0  ) {
	file.create(snakemake@output[[1]])	
} else {

	filtered_miRanda_scores = filtar::filter_miRanda_scores(snakemake@input[['miRanda_scores']], snakemake@input[['expression_values']],snakemake@params[['tpm_expression_threshold']])
	write.table(filtered_miRanda_scores, snakemake@output[[1]], row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
}
8
script: 'filter_miRanda_scores.R'
18
script: 'filter_miRanda_scores.R'
24
shell: 'cp {input} {output}'
1
2
3
4
5
6
7
if ( file.info(snakemake@input[['miRanda_scores']])$size == 0  ) {
	file.create(snakemake@output[[1]])	
} else {

	filtered_miRanda_scores = filtar::filter_miRanda_scores(snakemake@input[['miRanda_scores']], snakemake@input[['expression_values']],snakemake@params[['tpm_expression_threshold']])
	write.table(filtered_miRanda_scores, snakemake@output[[1]], row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
}
8
script: 'filter_miRanda_scores.R'
19
script: 'filter_miRanda_scores.R'
1
filtar::filter_mir_families(snakemake@input[['mature_mirnas']],snakemake@input[['mirna_families']], snakemake@config[["mirnas"]], snakemake@output[[1]])
1
filtar::filter_mature_mirs(snakemake@input[[1]], snakemake@config[["mirnas"]], snakemake@output[[1]])
1
2
3
4
5
species = snakemake@wildcards$species

test = filtar::get_mirna_context(snakemake@input$mirna_seed, snakemake@input$mature_mirnas, snakemake@config$tax_ids[[species]])

readr::write_tsv(test, snakemake@output[[1]], col_names=FALSE)
1
2
3
4
5
species = snakemake@wildcards$species

mirna_seeds = filtar::get_mirna_family(snakemake@input[[1]], snakemake@wildcards$species, snakemake@config$tax_ids[[species]])

write.table(mirna_seeds, snakemake@output[[1]], sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)
28
29
shell:
   "{input.script} {input.data} > {output}"
36
37
script:
   "get_mirna_family.R"
45
46
shell:
   "{input.script} {input.data} {wildcards.species} {params} > {output}"
54
55
script:
   "get_mirna_context.R"
60
script: "filter_mir_for_context_scores.R"
67
script: "filter_mir_families.R"
78
79
	shell:
           "{input.script} {input.mirna_families} {input.msa} {output}"
89
90
shell:
    "{input.script} {input.msa_3UTR} {params.tax_id}  > {output}"
102
103
shell:
    "{input.script} {input.mirna_family} {input.mirna_sites} {input.branch_lengths} > {output}"
113
114
shell:
    "{input.script} {input.mirna_seeds} {input.CDS} > {output.eightmer_counts}"
134
135
shell:
    "{input.script} {input.mirnas} {input.msa} {input.PCTs} {input.CDS_lengths} {input.eightmer_counts} {output} {params.tax_id} {input.AIRs} {params.RNAplfold_dir}"
140
shell: "cat {input} | sed '1b;/Gene/d' > {output}"
145
shell: "cat {input} | sed '1b;/Gene/d' > {output}"
152
shell: "awk '{{ print $5}}' {input} | sort | uniq > {output}"
158
159
160
       shell:
           "cd scripts/targetscan7 && wget http://www.targetscan.org/vert_72/vert_72_data_download/targetscan_70.zip && unzip targetscan_70.zip && rm UTR_Sequences_sample.txt miR_Family_info_sample.txt\
targetscan_70_output.txt README_70.txt targetscan_70.zip"
168
169
170
      shell:
          "wget http://www.targetscan.org/vert_72/vert_72_data_download/targetscan_70_BL_PCT.zip && unzip targetscan_70_BL_PCT.zip && mv\
TargetScan7_BL_PCT/PCT_parameters data/ && mv TargetScan7_BL_PCT/targetscan_70_BL_bins.pl TargetScan7_BL_PCT/targetscan_70_BL_PCT.pl scripts/targetscan7 && rm -rf targetscan_70_BL_PCT.zip TargetScan7_BL_PCT/"
179
180
shell:
  "wget http://www.targetscan.org/vert_72/vert_72_data_download/TargetScan7_context_scores.zip && unzip TargetScan7_context_scores.zip && mv TargetScan7_context_scores/Agarwal_2015_parameters.txt TargetScan7_context_scores/TA_SPS_by_seed_region.txt TargetScan7_context_scores/targetscan_count_8mers.pl TargetScan7_context_scores/targetscan_70_context_scores.pl scripts/targetscan7/ && rm -rf TargetScan7_context_scores/ TargetScan7_context_scores.zip"
185
shell: "sed -e '40 s/9606/$ARGV[1]/g' {input} | sed -e '38 s;PCT_parameters;data\/PCT_parameters;g' > {output} && chmod +x {output}"
190
shell: "sed -e '55 s;PCT_parameters;data\/PCT_parameters;g' {input} > {output} && chmod +x {output}"
1
2
3
filtered_contextpp_scores = filtar::filter_contextpp_scores(snakemake@input[['contextpp_scores']], snakemake@input[['expression_values']], snakemake@params[['tpm_expression_threshold']])

write.table(filtered_contextpp_scores, snakemake@output[[1]], row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
8
script: 'filter_contextpp_scores.R'
18
script: 'filter_contextpp_scores.R'
24
shell: "cp {input} {output}" 
1
2
3
filtered_contextpp_scores = filtar::filter_contextpp_scores(snakemake@input[['contextpp_scores']], snakemake@input[['expression_values']], snakemake@params[['tpm_expression_threshold']])

write.table(filtered_contextpp_scores, snakemake@output[[1]], row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
8
script: 'filter_contextpp_scores.R'
17
script: 'filter_contextpp_scores.R'
35
36
shell:
   "trim_galore --output_dir results/trimmed_fastq/  --length {config[trim_galore.length]} --stringency {config[trim_galore.stringency]} {input}"
46
47
shell:
   "trim_galore --output_dir results/trimmed_fastq/ --length  {config[trim_galore.length]} --stringency {config[trim_galore.stringency]}  --paired {input[0]} {input[1]}"
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
input = readr::read_tsv(snakemake@input[[1]], col_names=FALSE, col_types='ciiic')

if (length(snakemake@config[['transcripts']]) == 0) {

	file.copy(from=snakemake@input[[1]],to=snakemake@output[[1]])

} else {

	for (transcript in snakemake@config[['transcripts']]) {
		if (!transcript %in% input$X5) {
			write(stringr::str_interp("The transcript identifer '${transcript}' is not a valid identifier for the selected species for ensembl release ${snakemake@config[['ensembl_release']]}"), stderr())
		}
	}

	filtered_input = input[input$X5 %in% snakemake@config[['transcripts']],]

	if (dim(filtered_input)[1] == 0) {
		write("No valid transcript identifiers in input. Halting execution.", stderr())
		quit(save="no", status=1)
	}

	write.table(filtered_input,snakemake@output[[1]], col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t")
}
1
2
3
AIRs = filtar::get_canonical_AIRs(snakemake@input[[1]])

write.table(AIRs, snakemake@output[[1]], quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
utr_lengths = filtar::get_utr_lengths(snakemake@input[[1]]) 

write.table(
	x=utr_lengths,
	file=snakemake@output[[1]],
	sep="\t",
	quote=FALSE,
	row.names=FALSE,
	col.names=TRUE
)
34
35
shell:
    "{input.script} {input.gtf} {wildcards.feature} {output}"
40
script: 'get_utr_lengths.R'
45
script: 'get_canonical_AIR.R'
50
script: 'get_canonical_AIR.R'
57
shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output}'
62
script: 'filter_bed6.R'
67
shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output} || true'
3
4
5
6
7
8
if ( file.info(snakemake@input$normal_bed)$size == 0 | file.info(snakemake@input$extended_bed)$size == 0 ) {
	file.copy(from=snakemake@input$normal_bed, to=snakemake@output[[1]])	
} else {
	full_set_sorted = filtar::get_full_bed(snakemake@input$normal_bed, snakemake@input$extended_bed, snakemake@input$all_transcripts, snakemake@input$tx_quant)
        write.table(full_set_sorted, file=snakemake@output[[1]], quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")
}
17
18
19
20
21
22
23
24
25
26
27
input = readr::read_tsv(snakemake@input[[1]], col_names=FALSE, col_types='ciiciciiiicc')

if (length(snakemake@config[['transcripts']]) == 0) {
        file.copy(from=snakemake@input[[1]],to=snakemake@output[[1]])
} else {
	transcripts = gsub('\\..*','', snakemake@config[['transcripts']])

	filtered_input = input[input$X4 %in% transcripts,]

	write.table(filtered_input,snakemake@output[[1]], col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t")
}
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
input = readr::read_tsv(snakemake@input[[1]], col_names=FALSE, col_types='ciiic')

if (length(snakemake@config[['transcripts']]) == 0) {

	file.copy(from=snakemake@input[[1]],to=snakemake@output[[1]])

} else {

	for (transcript in snakemake@config[['transcripts']]) {
		if (!transcript %in% input$X5) {
			write(stringr::str_interp("The transcript identifer '${transcript}' is not a valid identifier for the selected species for ensembl release ${snakemake@config[['ensembl_release']]}"), stderr())
		}
	}

	filtered_input = input[input$X5 %in% snakemake@config[['transcripts']],]

	if (dim(filtered_input)[1] == 0) {
		write("No valid transcript identifiers in input. Halting execution.", stderr())
		quit(save="no", status=1)
	}

	write.table(filtered_input,snakemake@output[[1]], col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t")
}
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
united_bedgraph = readr::read_tsv(
	file=snakemake@input[[1]],
	col_names=FALSE,
	col_types=readr::cols(.default = 'd', X1 = 'c', X2 = 'i', X3 = 'i')
)

united_bedgraph = filtar::AvgBedgraph(united_bedgraph)

united_bedgraph = united_bedgraph[,c('X1','X2','X3','avg')]

write.table(
	x=united_bedgraph,
	file=snakemake@output[[1]],
	row.names=FALSE,
	col.names=FALSE,
	quote=FALSE,
	sep="\t"
)
1
2
3
4
5
6
if ( file.info(snakemake@input[[2]])$size == 1 ) {
        file.copy(from=snakemake@input[[1]], to=snakemake@output[[1]])
} else {
	AIR_file = filtar::get_AIR_file(snakemake@input[[1]], snakemake@input[[2]])
	write.table(AIR_file, snakemake@output[[1]], sep="\t", col.names=FALSE, row.names=FALSE, quote=FALSE)
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
utr_lengths = filtar::get_utr_lengths(snakemake@input[[1]]) 

write.table(
	x=utr_lengths,
	file=snakemake@output[[1]],
	sep="\t",
	quote=FALSE,
	row.names=FALSE,
	col.names=TRUE
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import sys
import os

index_base_name = snakemake.input['index'].replace('.1.ht2','')

if len(snakemake.input['reads']) == 1:
	print('single_end_processing')
	os.system('hisat2 -x {} -p {} -U {} -S {} {}'.format(index_base_name,snakemake.threads,snakemake.input['reads'],snakemake.output,snakemake.params[0]))
else:
	print('paired_end_processing')
	os.system('hisat2 -x {} -p {} -1 {} -2 {} -S {} {}'.format(index_base_name,snakemake.threads,snakemake.input['reads'][0], snakemake.input['reads'][1], snakemake.output, snakemake.params[0]))
73
shell: "python {input.py_script} {input.gtf} > {output}"
80
shell: "python {input.py_script} {input.gtf} > {output}"
90
shell: "hisat2-build -f -p {threads} --ss {input.splice_sites} --exon {input.exons} {input.assembly} data/{wildcards.species}"
99
shell: "hisat2-build -f -p {threads} {input.assembly} data/{wildcards.species}"
115
116
script:
	"map_reads.py"
SnakeMake From line 115 of hisat2/Snakefile
124
shell: "samtools view -@ {threads} -Sb  {input}  >  {output}"
134
135
wrapper:
    "0.27.0/bio/samtools/sort"
SnakeMake From line 134 of hisat2/Snakefile
1
2
3
4
5
6
import os

if len(snakemake.input) > 1:
	os.system("bedtools unionbedg -i {} > {}".format(snakemake.input, snakemake.output))
else:
	os.system("cp {} {}".format(snakemake.input, snakemake.output))
49
50
shell:
    "{input.script} {input.gtf} {output}"
55
script: 'filter_bed6.R'
60
shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output} || true'
70
71
shell:
    "gtfToGenePred {input.gtf} {output}"
81
82
shell:
    "genePredToBed {input.genepred} {output}"
94
95
shell:
    "genomeCoverageBed -bg -ibam {input.sorted_bam} -split > {output}"
100
shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output}'
118
script: "merge_bedgraphs.py"
127
128
script:
        "get_average_bedgraph.R"
138
script: "merge_bedgraphs.py"
147
148
script:
        "get_average_bedgraph.R"
153
shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output}'
158
script: "filter_bed12.R"
167
shell: "./{input.script} -i {input.bedgraphs} -p {config[APAtrap.min_proportion_of_valid_nucleotides_in_window]} -c {config[APAtrap.min_window_coverage]} -w {config[APAtrap.window_size]} -e {config[APAtrap.utr_extension_size]} -m {input.bed} -o {output}"
177
178
script:
    "extend_bed2.R"
183
shell: "cat {input} > {output}"
194
195
shell:
   "./{input.script} -i {input.bedgraphs} -g 1 -n 1 -d {config[APAtrap.min_cov_variation_between_APA_sites]} -c {config[APAtrap.min_average_cov]} -a {config[APAtrap.min_distance_between_APA_sites]} -w {config[APAtrap.predictAPA_window_size]} -u {input.bed}  -o {output}"
200
shell: "cat {input} | sed '1b;/Gene/d' > {output}"
205
script: 'get_utr_lengths.R'
210
script: 'get_utr_lengths.R'
215
script: "get_tissue_specific_APA_file.R"
225
226
shell:
    "{input.script} {input.gtf} CDS {output}"
231
script: "filter_bed6.R"
236
shell: 'grep -E "^{wildcards.chrom}\s" {input} > {output} || true'
1
sed 's/(+)//g' $1 | sed 's/(-)//g' | sed 's/(-)//g' | tr '\n' '\t' | sed 's/>/\n/g' | sed '/^$/d' | awk -v species=$2 '{OFS="\t"}{ print $1,species,$2}' > $3
1
2
3
4
5
6
7
8
9
grep 'hit' $1
command=$? # get exit code

if [[ $command -eq 0  ]]
then
	grep -A 1 'hit' $1 | sed '/--/d' | grep '>' | sed 's/>//' > $2
else
	touch $2
fi
1
cat $1 | awk '$3 == "transcript"' | awk '{OFS="\t"}{print $1,$14,$16}' | sed 's/"//g' | sed 's/;//g' | sed 's/\t/\./2' | grep -E "^$2	" | awk '{print $2}'
1
awk '{OFS="\t"}{ print $1,$2,$3,$5,$5,$4}' $1 | sed 's/\t1$/\t+/g' | sed 's/\t-1$/\t-/g' > $2
ShowHide 134 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://tbradley27.github.io/FilTar/
Name: filtar
Version: v1.3.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...