Pipeline for the processing of 3' end sequencing libraries

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Pipeline to infer poly(A) site clusters through processing of 3' end sequencing libraries prepared according to various protocols. The pipeline was used for the generation of the PolyASite atlas.

Pipeline schematic

Further information on the implemented processing can be found below and on PolyASite . Pipeline DAG

Requirements

The pipeline was tested on an HPC environment managed by Slurm .

Conda environment

We recommend that to use a Conda environment that contains the necessary software.

The environment was created with:

conda env create \
 --name polyA_atlas_pipeline \
 --file snakemake_run_env_requirements.yaml

Activate the environment with:

source activate polyA_atlas_pipeline

Deactivate the environment with:

source deactivate

Prerequisites

The pipeline has three central elements: the Snakefile , a config file , and a sample table :

The Snakefile does not need to be changed, modified or updated unless bugs, intended updates etc., require changes.
The config file (called config.yaml ) needs to be adjusted according to your needs. It requires a set of samples , an organism , a genome version and an annotation version . Organism and genome/annotation versions are edited directly in the config file. Samples are provided indirectly by indicating the path to the sample table . An example for a config file can be found here: tests/EXAMPLE_config.yaml .
Please see below in the Run section part how you can also run Snakemake without modifying the config file, instead specifying the required information as arguments to command line parameters.
The sample table lists sample-specific information required for the successful run of the pipeline. Follow the format in tests/EXAMPLE_samples.tsv and include one line for each sample to be processed.

Run the pipeline

Local

Local execution is not recommended and should only be used for testing purposes.

snakemake \
 -p \
 --use-singularity \
 --singularity-args "--bind ${PWD}" \
 --configfile config.yaml \
 &>> run_update.Organism_genomeVersion_annotationVersion.log

Slurm

snakemake \
 -p \
 --use-singularity \
 --singularity-args "--bind ${PWD}" \
 --configfile config.yaml \
 --cluster-config cluster_config.json \
 --jobscript jobscript.sh \
 --cores 500 \
 --local-cores 10 \
 --cluster "sbatch --cpus-per-task {cluster.threads} \
 --mem {cluster.mem} --qos {cluster.queue} \
 --time {cluster.time} -o {params.cluster_log} \
 -p [NODE_ID] --export=JOB_NAME={rule} \
 --open-mode=append" \
 &>> run_update.Organism_genomeVersion_annotationVersion.log

General notes on running the pipeline

Instant config changes

If you do not want to change the base config file, you can also specify the appropriate values in the Snakemake command itself, e.g.:

snakemake \
 -p \
 --use-singularity \
 --singularity-args "--bind ${PWD}" \
 --configfile config.yaml \
 --config organism=HomoSapiens genome=GRCh38.90 atlas.release_name:r2.0 -- \
 &>> run_update.Organism_genomeVersion_annotationVersion.log

Running only parts of the pipeline

With the Snakemake option --until you can specify a target rule for running the pipeline. This is useful if you only want to run a subset of rules. For example, the pipeline includes the creation of bigWig- and track-info files for display of data in the UCSC genome browser. If you don't need these files, run with --until complete_clustering . For examples see here .

Generate graph for preprocessing of individual sample of specific protocol

When a target file is provided in the Snakemake command, Snakemake will run the pipeline only until the point at which the desired file is created. This can be used to generate overview graphs on the processing of samples from specific protocols to check their processing steps. For example:

snakemake \
 -p \
 --configfile config.yaml \
 --dag \
 samples/counts/SRX517313_GRCh38.90.ip3pSites.out \
 | dot \
 -T png \
 > graph.SAPAS.png

Protocol-specific notes

3' READS

Pre-processing involves:

Filtering of reads based on the 5' configuration
3' adapter trimming
Reverse complementing
3' trimming of potentially remaining A s from the poly(A) tail

Only reads that start with a specified number of random nucleotides and two T s are considered, with the number of random nucleotides having to be extracted from the corresponding GEO/SRA entries or publication. See here for more information.

According to this paper , sequencing can be done in sense and antisense direction. The samples that are currently processed here were sequenced in antisense direction. Future samples should be checked carefully in order to decide whether current settings are appropriate.

SAPAS

Pre-processing involves:

Combined 5' and 3' adapter trimming
Trimming of remaining Cs at the 3' end (they are a result of template switching reverse transcription)
Reverse complementing

Sequencing libraries are prepared such that sequencing can be done either in antisense direction (Illumina) or in sense direction (454). So far, only samples from Illumina sequencing have been processed. If "sense direction samples" need to be processed, the pipeline must be adapted accordingly.

In the supplementary material of the APASdb paper , the authors state that they only consider reads that have the expected 5' linker sequence 5'-TTTTCTTTTTTCTTTTTT-3' . However, manual comparison of the first reads from an old sample ( SRX026584 ) with one from the mentioned publication revealed that the abundance of this linker is very low in the samples from the APASdb publication. Therefore, we have decided not to be too strict with the 5' linker. Very often, only poly(T)s are found: therefore, for the newer samples only a stretch of Ts is given as 5' adapter and trimmed together with the 3' adapter.

A-seq2

Processing is done according to this protocol , but without using the first nucleotide as barcode information.

A-seq

Pre-processing involves:

3' adapter trimming
Valid reads filtering with additional consideration of a maximum read length (see below).

3'-seq (Mayr)

Pre-processing involves:

3' adapter trimming
Additional trimming of As and Ns at the 3' end
Valid reads filtering with additional consideration of a maximum read length (see below).

PAS-Seq

Pre-processing involves:

Trimming of poly(T) at the 5' end
3' adapter trimming
Trimming of additional Cs at the 3' end
Reverse complementing

DRS

Current pre-processing involves:

Reverse complementing
Correction by 1 nt to obtain the true 3' end position (note that this is directly encoded in the Snakefile, not the config file.)

The protocol facilitates direct sequencing of the RNA 3' end. Due to an initial T -fill step that involves the incorporation of a blocking nucleotide (anything except for a T ), sequencing begins actually one nucleotide upstream of the RNA 3'-most nucleotide. Therefore, a correction of 1 in the downstream direction of the read's reverse complement is necessary to obtain the 3' end (See "Extended Experimental Procedures" in Ozsolak et al for more info.

PolyA-seq

Pre-processing involves:

3' adapter trimming
Reverse complementing

3P-Seq

Pre-processing involves:

Reverse complementing if necessary
Filtering reads: only proceed with reads with at least 2 As at the 3' end
Remove additional A s at the 3' end that might remain from the poly(A) tail

Note that processing for samples of this protocol is sample-specific . In particular, only a subset of samples requires reverse complementing, and hence, each sample has to be checked manually to infer whether reverse complementing is required or not. Once, this information is provided in the design file, the pipeline will process samples accordingly.

An easy way to check whether a file needs to be revese complemented is to count the occurence of the poly(A) signal in the first reads and their revese complements:

zcat samples/GSM1268942/GSM1268942.fa.gz \
 | head -n10000 \
 | tail -n1000 \
 | grep TTTATT | wc -l
zcat samples/GSM1268942/GSM1268942.fa.gz \
 | head -n10000 \
 | tail -n1000 \
 | grep AATAAA | wc -l

Comparing these numbers should give a clear prefernce for one of the two signals.

2P-Seq

Pre-processing involves:

3' adapter trimming
Reverse complementing

Maximum read length

Samples prepared with protocols 3'-seq (Mayr) and A-seq have a restriction on the maximum read length for processed reads to count as valid . As these protocols require sequencing in the sense direction, the length restriction ensures that the 3' end of the transcript is reached.

Code Snippets

shell:
	'''
	(gffread \
	-w {output.fasta} \
	-g {input.fasta} \
	{input.gtf}) \
	&> {log}
	'''

SnakeMake From line 104 of segemehl/Snakefile

shell:
	'''
	(python {input.script} \
	--trim \
	-i {input.fasta} \
	-o {output.fasta}) \
	&> {log}
	'''

SnakeMake From line 131 of segemehl/Snakefile

shell:
	'''
	(segemehl.x \
	-x {output.idx} \
	-d {input.fasta}) \
	&> {log}
	'''

SnakeMake From line 163 of segemehl/Snakefile

shell:
	"(segemehl.x \
	-x {output.idx} \
	-d {input.sequence}) \
	&> {log}"

SnakeMake From line 192 of segemehl/Snakefile

shell:
	'''
	(bash {input.script} \
	-f {input.gtf} \
	-c 3 \
	-p exon \
	-o {output.exons} ) \
	&> {log}
	'''

SnakeMake From line 216 of segemehl/Snakefile

shell:
	'''
	(Rscript {input.script} \
	--gtf {input.exons} \
	-o {output.exons}) \
	&> {log}
	'''

SnakeMake From line 244 of segemehl/Snakefile

shell:
	'''
	(samtools dict \
	-o {output.header} {input.genome}) \
	&> {log}
	'''

SnakeMake From line 269 of segemehl/Snakefile

shell:
	'''
	segemehl.x \
	-i {input.idx} \
	-d {input.genome} \
	-t {threads} \
	-q {input.reads} \
	-outfile {output.gmap}
	'''

SnakeMake From line 312 of segemehl/Snakefile

shell:
	'''
	samtools view \
	{input.gmap} \
	> {output.gmap} \
	2> {log}
	'''

SnakeMake SAMtools From line 345 of segemehl/Snakefile

shell:
	'''
	segemehl.x \
	-i {input.idx} \
	-d {input.transcriptome} \
	-t {threads} \
	-q {input.reads} \
	-outfile {output.tmap}
	'''	

SnakeMake From line 386 of segemehl/Snakefile

shell:
	'''
	samtools view \
	{input.tmap} \
	> {output.tmap} \
	2> {log}
	'''

SnakeMake SAMtools From line 418 of segemehl/Snakefile

shell:
	'''
	(perl {input.script} \
	--in {input.tmap} \
	--exons {input.exons} \
	--out {output.genout}) \
	&> {log}
	'''

SnakeMake From line 457 of segemehl/Snakefile

shell:
	'''
	(cat {input.header} \
	{input.t2gmap} \
	{input.gmap} \
	> {output.catmaps}) \
	&> {log}
	'''

SnakeMake From line 493 of segemehl/Snakefile

shell:
	'''
	(samtools sort \
	-n \
	-o {output.sorted} \
	{input.sam}) \
	&> {log}
	''' 

SnakeMake From line 525 of segemehl/Snakefile

shell:
	'''
	(perl {input.script} \
	--print-header \
	--keep-mm \
	--in {input.sorted} \
	--out {output.remove_inf}) \
	&> {log}
	'''

SnakeMake From line 565 of segemehl/Snakefile

shell:
	'''
	(samtools view \
	-b {input.remove_inf} \
	> {output.bam}) \
	&> {log}
	'''

SnakeMake From line 597 of segemehl/Snakefile

shell:
	'''
	(samtools sort \
	{input.bam} \
	> {output.bam}) \
	&> {log}
	'''

SnakeMake From line 628 of segemehl/Snakefile

shell:
	'''
	(samtools view {input.bam} \
	| python {input.script} \
	--processors {threads} \
	| gzip > {output.reads_bed}) 2>> {log}
	'''

SnakeMake From line 674 of segemehl/Snakefile

shell:
    '''
    mkdir -p {params.cluster_samples_log}
    mkdir -p {params.cluster_countings_log}
    '''

SnakeMake From line 257 of master/Snakefile

shell:
    '''
    mkdir -p {params.cluster_atlas_log}
    '''

SnakeMake From line 280 of master/Snakefile

shell:
    '''
    wget -O {output.temp_genome} \
    {params.url} \
    &> /dev/null &&
    gzip -cd {output.temp_genome} \
    > {output.genome} &&
    sed 's/\s.*//' {output.genome} \
    > {output.clean}
    '''

SnakeMake From line 301 of master/Snakefile

shell:
    '''
    wget -O {output.temp_anno} \
    {params.url} \
    &> /dev/null &&
    gzip -cd {output.temp_anno} \
    > {output.anno}
    '''

SnakeMake From line 327 of master/Snakefile

shell:
    '''
    perl {input.script} \
    --type_id={params.type_id} \
    {params.types} \
    {params.tr_supp_level_id} {params.tr_supp_level} \
    {input.anno} \
    > {output.filtered_anno}
    '''

SnakeMake From line 359 of master/Snakefile

shell:
    '''
    python3 {input.script} \
    --srr_id {params.srr_id} \
    --outdir {params.outdir}
    --paired \
    2> {log}
    '''

SnakeMake From line 417 of master/Snakefile

shell:
    '''
    python3 {input.script} \
    --srr_id {params.srr_id} \
    --outdir {params.outdir} \
    2> {log}
    '''

SnakeMake From line 451 of master/Snakefile

shell:
    '''
    cd {params.file_dir}
    IFS=',' read -ra SRR <<< "{params.sample_srr}"
    if [[ "${{#SRR[@]}}" > "1" ]];then
    first_file="${{SRR[0]}}.fastq.gz"
    for i in $(seq 1 $((${{#SRR[@]}}-1))); do curr_file="${{SRR[$i]}}.fastq.gz"; cat ${{curr_file}} >> ${{first_file}};done
    fi
    ln -fs {params.first_srr}.fastq.gz {params.sample_id}.fq.gz
    cd -
    '''

SnakeMake From line 485 of master/Snakefile

shell:
    '''
    (zcat {input.sample_fq} \
    | {input.script} \
    | fastx_renamer -n COUNT -z \
    > {output.sample_fa})
    2> {log}
    '''

SnakeMake From line 530 of master/Snakefile

run:
    import gzip
    n = 0
    with gzip.open(input.sample_fa, "rt") as infile:
        n = sum([1 for line in infile if line.startswith(">")])
    with open(output.raw_cnt, "w") as out:
        out.write("reads.raw.nr\t%i\n" % n)

SnakeMake From line 558 of master/Snakefile

shell:
    '''
    (zcat {input.sample_fa} \
    | perl {input.script} \
    --adapter={params.adapt} \
    | gzip > {output.selected_5p}) 2> {log}
    '''

SnakeMake From line 619 of master/Snakefile

shell:
    '''
    (zcat {input.sample_fa} \
    | perl {input.script} \
    --adapter={params.adapt} \
    | gzip > {output.trimmed_5p}) 2> {log}
    '''

SnakeMake From line 663 of master/Snakefile

shell:
    '''
    zcat {input.sample_fa} \
    | perl {input.script} \
    --minLen={params.minLen} \
    --nuc={params.adapt} \
    | gzip > {output.nuc_trimmed}
    '''

SnakeMake From line 700 of master/Snakefile

shell:
    '''
    cutadapt \
    -g {params.adapt} \
    --minimum-length {params.minLen} \
    -o {output.no_5p_adapter} \
    {input.in_fa} \
    &> {log}
    '''

SnakeMake Cutadapt From line 736 of master/Snakefile

shell:
    '''
    cutadapt \
    -a {params.adapt} \
    {params.five_p_adapt} \
    --minimum-length {params.minLen} \
    -o {output.no_3p_adapter} \
    {input.in_fa} \
    &> {log}
    '''

SnakeMake Cutadapt From line 774 of master/Snakefile

shell:
    '''
    zcat {input.no_3p_adapter} \
    | perl {input.script} \
    --minLen={params.minLen} \
    --nuc={params.adapt} \
    | gzip > {output.nuc_trimmed}
    '''

SnakeMake From line 815 of master/Snakefile

shell:
    '''
    zcat {input.input_seqs} \
    | fastx_reverse_complement -z \
    -o {output.rev_cmpl} \
    &> {log}
    '''

SnakeMake From line 846 of master/Snakefile

shell:
    '''
    (zcat {input.in_fa} \
    | perl {input.script} \
    --adapter={params.adapt} \
    | gzip > {output.selected_3p}) 2> {log}
    '''

SnakeMake From line 879 of master/Snakefile

shell:
    '''
    cutadapt \
    --adapter {params.adapt} \
    --minimum-length {params.minLen} \
    --overlap {params.min_overlap} \
    -e {params.error_rate} \
    -o {output.no_polyAtail} \
    {input.no_3p_adapter} \
    &> {log}
    '''

SnakeMake Cutadapt From line 917 of master/Snakefile

run:
    import gzip
    n = 0
    with gzip.open(input.in_fa, "rt") as infile:
        n = sum([1 for line in infile if line.startswith(">")])
    with open(output.trimmed_cnt, "w") as out:
        with open(input.prev_cnt, "r") as cnt:
            out.write("%s" % cnt.read() )
        out.write("reads.trim.out\t%i\n" % n)

SnakeMake From line 949 of master/Snakefile

shell:
    '''
    (zcat {input.valid_rds_in} \
    | perl {input.script_filter} \
    --max {params.maxN} --nuc N \
    | perl {input.script_filter} \
    --max {params.maxAcontent} --nuc A  \
    | perl {input.script_last} \
    | gzip > {output.valid_reads}) 2> {log}
    '''

SnakeMake From line 995 of master/Snakefile

shell:
    '''
    (zcat {input.valid_rds_in} \
    | perl {input.script_len_filter} --max={params.maxLen} \
    | perl {input.script_filter} \
    --max={params.maxN} --nuc=N \
    | perl {input.script_filter} \
    --max={params.maxAcontent} --nuc=A  \
    | perl {input.script_last} \
    | gzip > {output.valid_reads}) 2> {log}
    '''

SnakeMake From line 1046 of master/Snakefile

run:
    import gzip
    n = 0
    with gzip.open(input.in_fa, "rt") as infile:
        n = sum([1 for line in infile if line.startswith(">")])
    with open(output.valid_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("reads.valid.nr\t%i\n" % n)

SnakeMake From line 1081 of master/Snakefile

shell:
    '''
    (python {input.script} \
    --bed {input.reads_bed} \
    | gzip > {output.unique_bed}) 2>> {log}
    '''

SnakeMake From line 1151 of master/Snakefile

run:
    import gzip
    unique = 0
    mapped = {}
    with gzip.open(input.reads_bed, "rt") as in_all:
        total_mapped = {line.split("\t")[3]:1 for line in in_all.readlines()}
    with gzip.open(input.unique_bed, "rt") as in_bed:
        unique = sum([1 for line in in_bed])
    multi = len(total_mapped) - unique
    with open(output.mapped_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("reads.mapped.uniqueMappers.nr\t%i\n" % unique)
        out.write("reads.mapped.multiMappers.nr\t%i\n" % multi)

SnakeMake From line 1189 of master/Snakefile

shell:
    '''
    (perl {input.script} \
    {params.exclude_chr} \
    --correction={params.correction} \
    --strict \
    --min_align={params.min_align} \
    {input.unique_bed} \
    | gzip > {output.end_sites}) 2>> {log}
    '''

SnakeMake From line 1242 of master/Snakefile

run:
    import gzip
    plus = 0
    plus_reads = 0
    minus = 0
    minus_reads = 0
    with gzip.open(input.end_sites, "rt") as in_bed:
        for line in in_bed:
            F = line.rstrip().split("\t")
            if F[5] == "+":
                plus += 1
                plus_reads += float(F[4])
            else:
                minus += 1
                minus_reads += float(F[4])
    with open(output.sites_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("sites.highconfidence.number.plus\t%i\n" % plus)
        out.write("sites.highconfidence.number.minus\t%i\n" % minus)
        out.write("sites.highconfidence.reads.plus\t%i\n" % plus_reads)
        out.write("sites.highconfidence.reads.minus\t%i\n" % minus_reads)

SnakeMake From line 1278 of master/Snakefile

shell:
    '''
    (perl {input.script} \
    --genome={input.genome} \
    --upstream={params.upstream_ext} \
    --downstream={params.downstream_ext} \
    {input.ends} \
    | gzip > {output.seqs}) 2>> {log}
    '''

SnakeMake From line 1335 of master/Snakefile

shell:
    '''
    (perl {input.script} \
    --upstream_len={params.upstream_ext} \
    --downstream_len={params.downstream_ext} \
    --consecutive_As={params.consec_As} \
    --total_As={params.tot_As} \
    {params.ds_patterns} \
    {input.seqs} \
    | gzip > {output.ip_assigned}) 2>> {log}
    '''

SnakeMake From line 1380 of master/Snakefile

run:
    import gzip
    plus = 0
    plus_reads = 0
    minus = 0
    minus_reads = 0
    with gzip.open(input.end_sites, "rt") as in_bed:
        for line in in_bed:
            F = line.rstrip().split("\t")
            if F[3] == "IP":
                if F[5] == "+":
                    plus += 1
                    plus_reads += float(F[4])
                else:
                    minus += 1
                    minus_reads += float(F[4])
    with open(output.ip_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("sites.highconfidence.internalpriming.number.plus\t%i\n" % plus)
        out.write("sites.highconfidence.internalpriming.number.minus\t%i\n" % minus)
        out.write("sites.highconfidence.internalpriming.reads.plus\t%i\n" % plus_reads)
        out.write("sites.highconfidence.internalpriming.reads.minus\t%i\n" % minus_reads)

SnakeMake From line 1415 of master/Snakefile

shell:
    '''
    echo '#########################\n \
          Pre-processing completed.\n#########################\n \
          Created "{input.counts}"' \
          > {output.prepro_cmplt}
    '''

SnakeMake From line 1453 of master/Snakefile

shell:
    '''
    (perl {input.script} \
    --noip \
    {input.files} \
    | gzip > {output.pooled_sites}) 2>> {log}
    '''

SnakeMake From line 1503 of master/Snakefile

run:
    import gzip
    n = 0
    with gzip.open(input.pooled_sites, "rt") as infile:
        n = sum([1 for line in infile if not line.startswith("#")])
    with open(output.pooled_sites_cnt, "w") as out:
        out.write("3pSites.pooled:\t%i\n" % n)

SnakeMake From line 1528 of master/Snakefile

shell:
    '''
    (perl {input.script} \
    {params.signals} \
    --genome={params.genome} \
    {input.pooled_sites} \
    | gzip > {output.sites_with_pas}) 2>> {log}
    '''

SnakeMake From line 1568 of master/Snakefile

shell:
    '''
    perl {input.script} \
    --cutoff={params.cutoff} \
    --upstream={params.upstream_reg} \
    --downstream={params.downstream_reg} \
    --sample={wildcards.sample} \
    {input.sites_with_pas} \
    > {output.sites_filtered} 2>> {log}
    '''

SnakeMake From line 1618 of master/Snakefile

run:
    import gzip
    with gzip.open(output.table_filtered, "wt") as out_file, gzip.open(input.table_adjusted, "rt") as infile:
        for line in infile:
            if line.startswith("#"):
                out_file.write(line)
                continue
            line_list = line.rstrip().split("\t")
            read_sum = sum( [1 for i in line_list[3:-2] if float(i) > 0] )
            if read_sum > 0:
                # this site has still read support
                out_file.write(line)

SnakeMake From line 1676 of master/Snakefile

run:
    import gzip
    sites = 0
    reads = 0
    pas = 0
    pas_reads = 0
    col = 0
    with gzip.open(input.noBG_sites,"rt") as all_sites:
        for line in all_sites:
            if line.startswith("#"):
                if params.sample in line:
                    F = line.rstrip().split(";")
                	col = int(F[0].lstrip("#"))
            else:
                if col == 0:
                    print("Column for sample could not be identified!")
                    print(params.sample)
                    exit()
                else:
                    line_list = line.rstrip().split("\t")
                    if line_list[col] != "0":
                        sites += 1
                        reads += int(line_list[col])
                        if line_list[-2] != "NA":
                            pas += 1

SnakeMake From line 1712 of master/Snakefile

run:
    import gzip
    n = 0
    with gzip.open(input.noBG_sites, "rt") as infile:
        n = sum([1 for line in infile if not line.startswith("#")])
    with open(output.noBG_sites_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("3pSites.noBG:\t%i\n" % n)

SnakeMake From line 1769 of master/Snakefile

shell:
    '''
    (perl {input.script} \
    --upstream={params.upstream_ext} \
    --downstream={params.downstream_ext} \
    {input.table_filtered} \
    | gzip > {output.primary_clusters}) 2> {log}
    '''

SnakeMake From line 1808 of master/Snakefile

run:
    import gzip
    n = 0
    with gzip.open(input.clusters, "rt") as infile:
        n = sum([1 for line in infile if not line.startswith("#")])
    with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("clusters.primary:\t%i\n" % n)

SnakeMake From line 1837 of master/Snakefile

shell:
    '''
    (perl {input.script} \
    --minDistToPAS={params.minDistToPAS} \
    --maxsize={params.maxsize} \
    {input.primary_clusters} \
    | gzip > {output.merged_clusters}) 2> {log}
    '''

SnakeMake From line 1888 of master/Snakefile

run:
    import gzip
    n = 0
    with gzip.open(input.clusters, "rt") as infile:
        n = sum([1 for line in infile if not line.startswith("#")])
    with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("clusters.merged:\t%i\n" % n)

SnakeMake From line 1917 of master/Snakefile

shell:
    '''
    python {input.script} \
    --verbose \
    --gtf {input.anno} \
    --ds-range {params.downstream_region} \
    --input {input.merged_clusters} \
    | gzip > {output.clusters_annotated} \
    2> {log}
    '''

SnakeMake From line 1957 of master/Snakefile

shell:
    '''
    python {input.script} \
    --verbose \
    --design={params.design_file} \
    --in {input.clusters_annotated} \
    --out {output.clusters_temp} \
    2> {log} &&
    gzip -c {output.clusters_temp} \
    > {output.clusters_support}
    '''

SnakeMake From line 1997 of master/Snakefile

shell:
    '''
    python {input.script} \
    -i {input.clusters} \
    -s {params.id} \
    -o {output.samples_temp} \
    2> {log} &&
    gzip -c {output.samples_temp} \
    > {output.samples_bed}
    '''

SnakeMake From line 2037 of master/Snakefile

shell:
    '''
    sortBed \
    -i {input.bed} \
    | gzip \
    > {output.sorted_bed}
    '''

SnakeMake From line 2071 of master/Snakefile

run:
    import gzip
    n = 0
    p = 0
    annos = {'TE': 0,
            'EX': 0,
            'IN': 0,
            'DS': 0,
            'AE': 0,
            'AI': 0,
            'AU': 0,
            'IG': 0}

    with gzip.open(input.clusters_bed, "rt") as infile:
        for line in infile:
            # Count clusters
            n += 1
            # Count clusters with PAS
            if not "NA" in line:
                p += 1
            # For each cluster get annotation
            a = line.split('\t')[9]
            annos[a] += 1
    with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("{}".format(cnt.read() ))
        out.write("clusters.all:\t{:d}\n".format(n))
        out.write("clusters.PAS.nr:\t{:d}\n".format(p))
        out.write("clusters.PAS.percent:\t{:d}\n".format(int(p/n*100))) # For put in mongo we need int
        for k in annos.keys():
            out.write("clusters.annos.%s:\t%s\n" % (k, annos[k]))

SnakeMake From line 2100 of master/Snakefile

shell:
    '''
    echo '#########################\n \
    Clustering completed.\n \
    #########################\n \
    Created "{input.noBG_cnt}"\n \
    "{input.cluster_stats}"\n \
    "{input.clusters_bed}"\n \
    "{input.samples_bed}"\n' \
    > {output.clst_cmplt}
    '''

SnakeMake From line 2157 of master/Snakefile

shell:
    '''
    wget -O {output.chr_sizes_ucsc} \
    {params.url} \
    &> /dev/null
    '''

SnakeMake From line 2185 of master/Snakefile

shell:
    '''
    python {input.script} \
    -i {input.clusters} \
    -s {params.id} \
    --chr-names {params.chr_names} \
    -p {output.plus} \
    -m {output.minus} \
    2> {log}
    '''

SnakeMake From line 2223 of master/Snakefile

shell:
    '''
    sortBed \
    -i {input.ucsc_bed} \
    > {output.sorted_bed}
    '''

SnakeMake From line 2259 of master/Snakefile

shell:
    '''
    bedGraphToBigWig \
    {input.ucsc_bed} \
    {input.chr_sizes} \
    {output.bigWig}
    '''

SnakeMake bedGraphToBigWig From line 2291 of master/Snakefile

run:
    with open(output.track_info, "wt") as out:
        out.write('track type=bigWig name="%s: poly(A) clusters plus strand %s" \
                   visibility="full" color="4,177,216" maxHeightPixels="128:60:8"\
                   bigDataUrl="%s/%s"\n'\
                   % (params.name, params.atlas_public_name, params.url, params.plus))
        out.write('track type=bigWig name="%s: poly(A) clusters minus strand %s" \
                   visibility="full" color="241,78,50" maxHeightPixels="128:60:8"\
                   bigDataUrl="%s/%s"\n' \
                   % (params.name, params.atlas_public_name, params.url, params.minus))

SnakeMake From line 2325 of master/Snakefile

shell:
    '''
    echo '#########################\n \
    Track files completed.\n \
    #########################\n \
    Created "{input.atlas_track}"\n \
    "{input.sample_tracks}"\n' \
    > {output.tracks_cmplt}
    '''