Pipeline for the processing of 3' end sequencing libraries

public public 1yr ago 0 bookmarks

Pipeline to infer poly(A) site clusters through processing of 3' end sequencing libraries prepared according to various protocols. The pipeline was used for the generation of the PolyASite atlas.

Pipeline schematic

Further information on the implemented processing can be found below and on PolyASite . Pipeline DAG

Requirements

The pipeline was tested on an HPC environment managed by Slurm .

Conda environment

We recommend that to use a Conda environment that contains the necessary software.

The environment was created with:

conda env create \
 --name polyA_atlas_pipeline \
 --file snakemake_run_env_requirements.yaml

Activate the environment with:

source activate polyA_atlas_pipeline

Deactivate the environment with:

source deactivate

Prerequisites

The pipeline has three central elements: the Snakefile , a config file , and a sample table :

  • The Snakefile does not need to be changed, modified or updated unless bugs, intended updates etc., require changes.

  • The config file (called config.yaml ) needs to be adjusted according to your needs. It requires a set of samples , an organism , a genome version and an annotation version . Organism and genome/annotation versions are edited directly in the config file. Samples are provided indirectly by indicating the path to the sample table . An example for a config file can be found here: tests/EXAMPLE_config.yaml .
    Please see below in the Run section part how you can also run Snakemake without modifying the config file, instead specifying the required information as arguments to command line parameters.

  • The sample table lists sample-specific information required for the successful run of the pipeline. Follow the format in tests/EXAMPLE_samples.tsv and include one line for each sample to be processed.

Run the pipeline

Local

Local execution is not recommended and should only be used for testing purposes.

snakemake \
 -p \
 --use-singularity \
 --singularity-args "--bind ${PWD}" \
 --configfile config.yaml \
 &>> run_update.Organism_genomeVersion_annotationVersion.log

Slurm

snakemake \
 -p \
 --use-singularity \
 --singularity-args "--bind ${PWD}" \
 --configfile config.yaml \
 --cluster-config cluster_config.json \
 --jobscript jobscript.sh \
 --cores 500 \
 --local-cores 10 \
 --cluster "sbatch --cpus-per-task {cluster.threads} \
 --mem {cluster.mem} --qos {cluster.queue} \
 --time {cluster.time} -o {params.cluster_log} \
 -p [NODE_ID] --export=JOB_NAME={rule} \
 --open-mode=append" \
 &>> run_update.Organism_genomeVersion_annotationVersion.log

General notes on running the pipeline

Instant config changes

If you do not want to change the base config file, you can also specify the appropriate values in the Snakemake command itself, e.g.:

snakemake \
 -p \
 --use-singularity \
 --singularity-args "--bind ${PWD}" \
 --configfile config.yaml \
 --config organism=HomoSapiens genome=GRCh38.90 atlas.release_name:r2.0 -- \
 &>> run_update.Organism_genomeVersion_annotationVersion.log

Running only parts of the pipeline

With the Snakemake option --until you can specify a target rule for running the pipeline. This is useful if you only want to run a subset of rules. For example, the pipeline includes the creation of bigWig- and track-info files for display of data in the UCSC genome browser. If you don't need these files, run with --until complete_clustering . For examples see here .

Generate graph for preprocessing of individual sample of specific protocol

When a target file is provided in the Snakemake command, Snakemake will run the pipeline only until the point at which the desired file is created. This can be used to generate overview graphs on the processing of samples from specific protocols to check their processing steps. For example:

snakemake \
 -p \
 --configfile config.yaml \
 --dag \
 samples/counts/SRX517313_GRCh38.90.ip3pSites.out \
 | dot \
 -T png \
 > graph.SAPAS.png

Protocol-specific notes

3' READS

Pre-processing involves:

  1. Filtering of reads based on the 5' configuration

  2. 3' adapter trimming

  3. Reverse complementing

  4. 3' trimming of potentially remaining A s from the poly(A) tail

Only reads that start with a specified number of random nucleotides and two T s are considered, with the number of random nucleotides having to be extracted from the corresponding GEO/SRA entries or publication. See here for more information.

According to this paper , sequencing can be done in sense and antisense direction. The samples that are currently processed here were sequenced in antisense direction. Future samples should be checked carefully in order to decide whether current settings are appropriate.

SAPAS

Pre-processing involves:

  1. Combined 5' and 3' adapter trimming

  2. Trimming of remaining Cs at the 3' end (they are a result of template switching reverse transcription)

  3. Reverse complementing

Sequencing libraries are prepared such that sequencing can be done either in antisense direction (Illumina) or in sense direction (454). So far, only samples from Illumina sequencing have been processed. If "sense direction samples" need to be processed, the pipeline must be adapted accordingly.

In the supplementary material of the APASdb paper , the authors state that they only consider reads that have the expected 5' linker sequence 5'-TTTTCTTTTTTCTTTTTT-3' . However, manual comparison of the first reads from an old sample ( SRX026584 ) with one from the mentioned publication revealed that the abundance of this linker is very low in the samples from the APASdb publication. Therefore, we have decided not to be too strict with the 5' linker. Very often, only poly(T)s are found: therefore, for the newer samples only a stretch of Ts is given as 5' adapter and trimmed together with the 3' adapter.

A-seq2

Processing is done according to this protocol , but without using the first nucleotide as barcode information.

A-seq

Pre-processing involves:

  1. 3' adapter trimming

  2. Valid reads filtering with additional consideration of a maximum read length (see below).

3'-seq (Mayr)

Pre-processing involves:

  1. 3' adapter trimming

  2. Additional trimming of As and Ns at the 3' end

  3. Valid reads filtering with additional consideration of a maximum read length (see below).

PAS-Seq

Pre-processing involves:

  1. Trimming of poly(T) at the 5' end

  2. 3' adapter trimming

  3. Trimming of additional Cs at the 3' end

  4. Reverse complementing

DRS

Current pre-processing involves:

  1. Reverse complementing

  2. Correction by 1 nt to obtain the true 3' end position (note that this is directly encoded in the Snakefile, not the config file.)

The protocol facilitates direct sequencing of the RNA 3' end. Due to an initial T -fill step that involves the incorporation of a blocking nucleotide (anything except for a T ), sequencing begins actually one nucleotide upstream of the RNA 3'-most nucleotide. Therefore, a correction of 1 in the downstream direction of the read's reverse complement is necessary to obtain the 3' end (See "Extended Experimental Procedures" in Ozsolak et al for more info.

PolyA-seq

Pre-processing involves:

  1. 3' adapter trimming

  2. Reverse complementing

3P-Seq

Pre-processing involves:

  1. Reverse complementing if necessary

  2. Filtering reads: only proceed with reads with at least 2 As at the 3' end

  3. Remove additional A s at the 3' end that might remain from the poly(A) tail

Note that processing for samples of this protocol is sample-specific . In particular, only a subset of samples requires reverse complementing, and hence, each sample has to be checked manually to infer whether reverse complementing is required or not. Once, this information is provided in the design file, the pipeline will process samples accordingly.

An easy way to check whether a file needs to be revese complemented is to count the occurence of the poly(A) signal in the first reads and their revese complements:

zcat samples/GSM1268942/GSM1268942.fa.gz \
 | head -n10000 \
 | tail -n1000 \
 | grep TTTATT | wc -l
zcat samples/GSM1268942/GSM1268942.fa.gz \
 | head -n10000 \
 | tail -n1000 \
 | grep AATAAA | wc -l

Comparing these numbers should give a clear prefernce for one of the two signals.

2P-Seq

Pre-processing involves:

  1. 3' adapter trimming

  2. Reverse complementing

Maximum read length

Samples prepared with protocols 3'-seq (Mayr) and A-seq have a restriction on the maximum read length for processed reads to count as valid . As these protocols require sequencing in the sense direction, the length restriction ensures that the 3' end of the transcript is reached.

Code Snippets

104
105
106
107
108
109
110
111
shell:
	'''
	(gffread \
	-w {output.fasta} \
	-g {input.fasta} \
	{input.gtf}) \
	&> {log}
	'''
131
132
133
134
135
136
137
138
shell:
	'''
	(python {input.script} \
	--trim \
	-i {input.fasta} \
	-o {output.fasta}) \
	&> {log}
	'''
163
164
165
166
167
168
169
shell:
	'''
	(segemehl.x \
	-x {output.idx} \
	-d {input.fasta}) \
	&> {log}
	'''
192
193
194
195
196
shell:
	"(segemehl.x \
	-x {output.idx} \
	-d {input.sequence}) \
	&> {log}"
216
217
218
219
220
221
222
223
224
shell:
	'''
	(bash {input.script} \
	-f {input.gtf} \
	-c 3 \
	-p exon \
	-o {output.exons} ) \
	&> {log}
	'''
244
245
246
247
248
249
250
shell:
	'''
	(Rscript {input.script} \
	--gtf {input.exons} \
	-o {output.exons}) \
	&> {log}
	'''
269
270
271
272
273
274
shell:
	'''
	(samtools dict \
	-o {output.header} {input.genome}) \
	&> {log}
	'''
312
313
314
315
316
317
318
319
320
shell:
	'''
	segemehl.x \
	-i {input.idx} \
	-d {input.genome} \
	-t {threads} \
	-q {input.reads} \
	-outfile {output.gmap}
	'''
345
346
347
348
349
350
351
shell:
	'''
	samtools view \
	{input.gmap} \
	> {output.gmap} \
	2> {log}
	'''
386
387
388
389
390
391
392
393
394
shell:
	'''
	segemehl.x \
	-i {input.idx} \
	-d {input.transcriptome} \
	-t {threads} \
	-q {input.reads} \
	-outfile {output.tmap}
	'''	
418
419
420
421
422
423
424
shell:
	'''
	samtools view \
	{input.tmap} \
	> {output.tmap} \
	2> {log}
	'''
457
458
459
460
461
462
463
464
shell:
	'''
	(perl {input.script} \
	--in {input.tmap} \
	--exons {input.exons} \
	--out {output.genout}) \
	&> {log}
	'''
493
494
495
496
497
498
499
500
shell:
	'''
	(cat {input.header} \
	{input.t2gmap} \
	{input.gmap} \
	> {output.catmaps}) \
	&> {log}
	'''
525
526
527
528
529
530
531
532
shell:
	'''
	(samtools sort \
	-n \
	-o {output.sorted} \
	{input.sam}) \
	&> {log}
	''' 
565
566
567
568
569
570
571
572
573
shell:
	'''
	(perl {input.script} \
	--print-header \
	--keep-mm \
	--in {input.sorted} \
	--out {output.remove_inf}) \
	&> {log}
	'''
597
598
599
600
601
602
603
shell:
	'''
	(samtools view \
	-b {input.remove_inf} \
	> {output.bam}) \
	&> {log}
	'''
628
629
630
631
632
633
634
shell:
	'''
	(samtools sort \
	{input.bam} \
	> {output.bam}) \
	&> {log}
	'''
674
675
676
677
678
679
680
shell:
	'''
	(samtools view {input.bam} \
	| python {input.script} \
	--processors {threads} \
	| gzip > {output.reads_bed}) 2>> {log}
	'''
257
258
259
260
261
shell:
    '''
    mkdir -p {params.cluster_samples_log}
    mkdir -p {params.cluster_countings_log}
    '''
SnakeMake From line 257 of master/Snakefile
280
281
282
283
shell:
    '''
    mkdir -p {params.cluster_atlas_log}
    '''
SnakeMake From line 280 of master/Snakefile
301
302
303
304
305
306
307
308
309
310
shell:
    '''
    wget -O {output.temp_genome} \
    {params.url} \
    &> /dev/null &&
    gzip -cd {output.temp_genome} \
    > {output.genome} &&
    sed 's/\s.*//' {output.genome} \
    > {output.clean}
    '''
SnakeMake From line 301 of master/Snakefile
327
328
329
330
331
332
333
334
shell:
    '''
    wget -O {output.temp_anno} \
    {params.url} \
    &> /dev/null &&
    gzip -cd {output.temp_anno} \
    > {output.anno}
    '''
SnakeMake From line 327 of master/Snakefile
359
360
361
362
363
364
365
366
367
shell:
    '''
    perl {input.script} \
    --type_id={params.type_id} \
    {params.types} \
    {params.tr_supp_level_id} {params.tr_supp_level} \
    {input.anno} \
    > {output.filtered_anno}
    '''
SnakeMake From line 359 of master/Snakefile
417
418
419
420
421
422
423
424
shell:
    '''
    python3 {input.script} \
    --srr_id {params.srr_id} \
    --outdir {params.outdir}
    --paired \
    2> {log}
    '''
SnakeMake From line 417 of master/Snakefile
451
452
453
454
455
456
457
shell:
    '''
    python3 {input.script} \
    --srr_id {params.srr_id} \
    --outdir {params.outdir} \
    2> {log}
    '''
SnakeMake From line 451 of master/Snakefile
485
486
487
488
489
490
491
492
493
494
495
shell:
    '''
    cd {params.file_dir}
    IFS=',' read -ra SRR <<< "{params.sample_srr}"
    if [[ "${{#SRR[@]}}" > "1" ]];then
    first_file="${{SRR[0]}}.fastq.gz"
    for i in $(seq 1 $((${{#SRR[@]}}-1))); do curr_file="${{SRR[$i]}}.fastq.gz"; cat ${{curr_file}} >> ${{first_file}};done
    fi
    ln -fs {params.first_srr}.fastq.gz {params.sample_id}.fq.gz
    cd -
    '''
SnakeMake From line 485 of master/Snakefile
530
531
532
533
534
535
536
537
shell:
    '''
    (zcat {input.sample_fq} \
    | {input.script} \
    | fastx_renamer -n COUNT -z \
    > {output.sample_fa})
    2> {log}
    '''
SnakeMake From line 530 of master/Snakefile
558
559
560
561
562
563
564
run:
    import gzip
    n = 0
    with gzip.open(input.sample_fa, "rt") as infile:
        n = sum([1 for line in infile if line.startswith(">")])
    with open(output.raw_cnt, "w") as out:
        out.write("reads.raw.nr\t%i\n" % n)
SnakeMake From line 558 of master/Snakefile
619
620
621
622
623
624
625
shell:
    '''
    (zcat {input.sample_fa} \
    | perl {input.script} \
    --adapter={params.adapt} \
    | gzip > {output.selected_5p}) 2> {log}
    '''
SnakeMake From line 619 of master/Snakefile
663
664
665
666
667
668
669
shell:
    '''
    (zcat {input.sample_fa} \
    | perl {input.script} \
    --adapter={params.adapt} \
    | gzip > {output.trimmed_5p}) 2> {log}
    '''
SnakeMake From line 663 of master/Snakefile
700
701
702
703
704
705
706
707
shell:
    '''
    zcat {input.sample_fa} \
    | perl {input.script} \
    --minLen={params.minLen} \
    --nuc={params.adapt} \
    | gzip > {output.nuc_trimmed}
    '''
SnakeMake From line 700 of master/Snakefile
736
737
738
739
740
741
742
743
744
shell:
    '''
    cutadapt \
    -g {params.adapt} \
    --minimum-length {params.minLen} \
    -o {output.no_5p_adapter} \
    {input.in_fa} \
    &> {log}
    '''
774
775
776
777
778
779
780
781
782
783
shell:
    '''
    cutadapt \
    -a {params.adapt} \
    {params.five_p_adapt} \
    --minimum-length {params.minLen} \
    -o {output.no_3p_adapter} \
    {input.in_fa} \
    &> {log}
    '''
815
816
817
818
819
820
821
822
shell:
    '''
    zcat {input.no_3p_adapter} \
    | perl {input.script} \
    --minLen={params.minLen} \
    --nuc={params.adapt} \
    | gzip > {output.nuc_trimmed}
    '''
SnakeMake From line 815 of master/Snakefile
846
847
848
849
850
851
852
shell:
    '''
    zcat {input.input_seqs} \
    | fastx_reverse_complement -z \
    -o {output.rev_cmpl} \
    &> {log}
    '''
SnakeMake From line 846 of master/Snakefile
879
880
881
882
883
884
885
shell:
    '''
    (zcat {input.in_fa} \
    | perl {input.script} \
    --adapter={params.adapt} \
    | gzip > {output.selected_3p}) 2> {log}
    '''
SnakeMake From line 879 of master/Snakefile
917
918
919
920
921
922
923
924
925
926
927
shell:
    '''
    cutadapt \
    --adapter {params.adapt} \
    --minimum-length {params.minLen} \
    --overlap {params.min_overlap} \
    -e {params.error_rate} \
    -o {output.no_polyAtail} \
    {input.no_3p_adapter} \
    &> {log}
    '''
949
950
951
952
953
954
955
956
957
run:
    import gzip
    n = 0
    with gzip.open(input.in_fa, "rt") as infile:
        n = sum([1 for line in infile if line.startswith(">")])
    with open(output.trimmed_cnt, "w") as out:
        with open(input.prev_cnt, "r") as cnt:
            out.write("%s" % cnt.read() )
        out.write("reads.trim.out\t%i\n" % n)
SnakeMake From line 949 of master/Snakefile
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
shell:
    '''
    (zcat {input.valid_rds_in} \
    | perl {input.script_filter} \
    --max {params.maxN} --nuc N \
    | perl {input.script_filter} \
    --max {params.maxAcontent} --nuc A  \
    | perl {input.script_last} \
    | gzip > {output.valid_reads}) 2> {log}
    '''
SnakeMake From line 995 of master/Snakefile
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
shell:
    '''
    (zcat {input.valid_rds_in} \
    | perl {input.script_len_filter} --max={params.maxLen} \
    | perl {input.script_filter} \
    --max={params.maxN} --nuc=N \
    | perl {input.script_filter} \
    --max={params.maxAcontent} --nuc=A  \
    | perl {input.script_last} \
    | gzip > {output.valid_reads}) 2> {log}
    '''
SnakeMake From line 1046 of master/Snakefile
1081
1082
1083
1084
1085
1086
1087
1088
run:
    import gzip
    n = 0
    with gzip.open(input.in_fa, "rt") as infile:
        n = sum([1 for line in infile if line.startswith(">")])
    with open(output.valid_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("reads.valid.nr\t%i\n" % n)
SnakeMake From line 1081 of master/Snakefile
1151
1152
1153
1154
1155
1156
shell:
    '''
    (python {input.script} \
    --bed {input.reads_bed} \
    | gzip > {output.unique_bed}) 2>> {log}
    '''
SnakeMake From line 1151 of master/Snakefile
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
run:
    import gzip
    unique = 0
    mapped = {}
    with gzip.open(input.reads_bed, "rt") as in_all:
        total_mapped = {line.split("\t")[3]:1 for line in in_all.readlines()}
    with gzip.open(input.unique_bed, "rt") as in_bed:
        unique = sum([1 for line in in_bed])
    multi = len(total_mapped) - unique
    with open(output.mapped_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("reads.mapped.uniqueMappers.nr\t%i\n" % unique)
        out.write("reads.mapped.multiMappers.nr\t%i\n" % multi)
SnakeMake From line 1189 of master/Snakefile
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
shell:
    '''
    (perl {input.script} \
    {params.exclude_chr} \
    --correction={params.correction} \
    --strict \
    --min_align={params.min_align} \
    {input.unique_bed} \
    | gzip > {output.end_sites}) 2>> {log}
    '''
SnakeMake From line 1242 of master/Snakefile
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
run:
    import gzip
    plus = 0
    plus_reads = 0
    minus = 0
    minus_reads = 0
    with gzip.open(input.end_sites, "rt") as in_bed:
        for line in in_bed:
            F = line.rstrip().split("\t")
            if F[5] == "+":
                plus += 1
                plus_reads += float(F[4])
            else:
                minus += 1
                minus_reads += float(F[4])
    with open(output.sites_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("sites.highconfidence.number.plus\t%i\n" % plus)
        out.write("sites.highconfidence.number.minus\t%i\n" % minus)
        out.write("sites.highconfidence.reads.plus\t%i\n" % plus_reads)
        out.write("sites.highconfidence.reads.minus\t%i\n" % minus_reads)
SnakeMake From line 1278 of master/Snakefile
1335
1336
1337
1338
1339
1340
1341
1342
1343
shell:
    '''
    (perl {input.script} \
    --genome={input.genome} \
    --upstream={params.upstream_ext} \
    --downstream={params.downstream_ext} \
    {input.ends} \
    | gzip > {output.seqs}) 2>> {log}
    '''
SnakeMake From line 1335 of master/Snakefile
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
shell:
    '''
    (perl {input.script} \
    --upstream_len={params.upstream_ext} \
    --downstream_len={params.downstream_ext} \
    --consecutive_As={params.consec_As} \
    --total_As={params.tot_As} \
    {params.ds_patterns} \
    {input.seqs} \
    | gzip > {output.ip_assigned}) 2>> {log}
    '''
SnakeMake From line 1380 of master/Snakefile
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
run:
    import gzip
    plus = 0
    plus_reads = 0
    minus = 0
    minus_reads = 0
    with gzip.open(input.end_sites, "rt") as in_bed:
        for line in in_bed:
            F = line.rstrip().split("\t")
            if F[3] == "IP":
                if F[5] == "+":
                    plus += 1
                    plus_reads += float(F[4])
                else:
                    minus += 1
                    minus_reads += float(F[4])
    with open(output.ip_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("sites.highconfidence.internalpriming.number.plus\t%i\n" % plus)
        out.write("sites.highconfidence.internalpriming.number.minus\t%i\n" % minus)
        out.write("sites.highconfidence.internalpriming.reads.plus\t%i\n" % plus_reads)
        out.write("sites.highconfidence.internalpriming.reads.minus\t%i\n" % minus_reads)
SnakeMake From line 1415 of master/Snakefile
1453
1454
1455
1456
1457
1458
1459
shell:
    '''
    echo '#########################\n \
          Pre-processing completed.\n#########################\n \
          Created "{input.counts}"' \
          > {output.prepro_cmplt}
    '''
SnakeMake From line 1453 of master/Snakefile
1503
1504
1505
1506
1507
1508
1509
shell:
    '''
    (perl {input.script} \
    --noip \
    {input.files} \
    | gzip > {output.pooled_sites}) 2>> {log}
    '''
SnakeMake From line 1503 of master/Snakefile
1528
1529
1530
1531
1532
1533
1534
run:
    import gzip
    n = 0
    with gzip.open(input.pooled_sites, "rt") as infile:
        n = sum([1 for line in infile if not line.startswith("#")])
    with open(output.pooled_sites_cnt, "w") as out:
        out.write("3pSites.pooled:\t%i\n" % n)
SnakeMake From line 1528 of master/Snakefile
1568
1569
1570
1571
1572
1573
1574
1575
shell:
    '''
    (perl {input.script} \
    {params.signals} \
    --genome={params.genome} \
    {input.pooled_sites} \
    | gzip > {output.sites_with_pas}) 2>> {log}
    '''
SnakeMake From line 1568 of master/Snakefile
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
shell:
    '''
    perl {input.script} \
    --cutoff={params.cutoff} \
    --upstream={params.upstream_reg} \
    --downstream={params.downstream_reg} \
    --sample={wildcards.sample} \
    {input.sites_with_pas} \
    > {output.sites_filtered} 2>> {log}
    '''
SnakeMake From line 1618 of master/Snakefile
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
run:
    import gzip
    with gzip.open(output.table_filtered, "wt") as out_file, gzip.open(input.table_adjusted, "rt") as infile:
        for line in infile:
            if line.startswith("#"):
                out_file.write(line)
                continue
            line_list = line.rstrip().split("\t")
            read_sum = sum( [1 for i in line_list[3:-2] if float(i) > 0] )
            if read_sum > 0:
                # this site has still read support
                out_file.write(line)
SnakeMake From line 1676 of master/Snakefile
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
run:
    import gzip
    sites = 0
    reads = 0
    pas = 0
    pas_reads = 0
    col = 0
    with gzip.open(input.noBG_sites,"rt") as all_sites:
        for line in all_sites:
            if line.startswith("#"):
                if params.sample in line:
                    F = line.rstrip().split(";")
                	col = int(F[0].lstrip("#"))
            else:
                if col == 0:
                    print("Column for sample could not be identified!")
                    print(params.sample)
                    exit()
                else:
                    line_list = line.rstrip().split("\t")
                    if line_list[col] != "0":
                        sites += 1
                        reads += int(line_list[col])
                        if line_list[-2] != "NA":
                            pas += 1
SnakeMake From line 1712 of master/Snakefile
1769
1770
1771
1772
1773
1774
1775
1776
run:
    import gzip
    n = 0
    with gzip.open(input.noBG_sites, "rt") as infile:
        n = sum([1 for line in infile if not line.startswith("#")])
    with open(output.noBG_sites_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("3pSites.noBG:\t%i\n" % n)
SnakeMake From line 1769 of master/Snakefile
1808
1809
1810
1811
1812
1813
1814
1815
shell:
    '''
    (perl {input.script} \
    --upstream={params.upstream_ext} \
    --downstream={params.downstream_ext} \
    {input.table_filtered} \
    | gzip > {output.primary_clusters}) 2> {log}
    '''
SnakeMake From line 1808 of master/Snakefile
1837
1838
1839
1840
1841
1842
1843
1844
run:
    import gzip
    n = 0
    with gzip.open(input.clusters, "rt") as infile:
        n = sum([1 for line in infile if not line.startswith("#")])
    with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("clusters.primary:\t%i\n" % n)
SnakeMake From line 1837 of master/Snakefile
1888
1889
1890
1891
1892
1893
1894
1895
shell:
    '''
    (perl {input.script} \
    --minDistToPAS={params.minDistToPAS} \
    --maxsize={params.maxsize} \
    {input.primary_clusters} \
    | gzip > {output.merged_clusters}) 2> {log}
    '''
SnakeMake From line 1888 of master/Snakefile
1917
1918
1919
1920
1921
1922
1923
1924
run:
    import gzip
    n = 0
    with gzip.open(input.clusters, "rt") as infile:
        n = sum([1 for line in infile if not line.startswith("#")])
    with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("%s" % cnt.read() )
        out.write("clusters.merged:\t%i\n" % n)
SnakeMake From line 1917 of master/Snakefile
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
shell:
    '''
    python {input.script} \
    --verbose \
    --gtf {input.anno} \
    --ds-range {params.downstream_region} \
    --input {input.merged_clusters} \
    | gzip > {output.clusters_annotated} \
    2> {log}
    '''
SnakeMake From line 1957 of master/Snakefile
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
shell:
    '''
    python {input.script} \
    --verbose \
    --design={params.design_file} \
    --in {input.clusters_annotated} \
    --out {output.clusters_temp} \
    2> {log} &&
    gzip -c {output.clusters_temp} \
    > {output.clusters_support}
    '''
SnakeMake From line 1997 of master/Snakefile
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
shell:
    '''
    python {input.script} \
    -i {input.clusters} \
    -s {params.id} \
    -o {output.samples_temp} \
    2> {log} &&
    gzip -c {output.samples_temp} \
    > {output.samples_bed}
    '''
SnakeMake From line 2037 of master/Snakefile
2071
2072
2073
2074
2075
2076
2077
shell:
    '''
    sortBed \
    -i {input.bed} \
    | gzip \
    > {output.sorted_bed}
    '''
SnakeMake From line 2071 of master/Snakefile
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
run:
    import gzip
    n = 0
    p = 0
    annos = {'TE': 0,
            'EX': 0,
            'IN': 0,
            'DS': 0,
            'AE': 0,
            'AI': 0,
            'AU': 0,
            'IG': 0}

    with gzip.open(input.clusters_bed, "rt") as infile:
        for line in infile:
            # Count clusters
            n += 1
            # Count clusters with PAS
            if not "NA" in line:
                p += 1
            # For each cluster get annotation
            a = line.split('\t')[9]
            annos[a] += 1
    with open(output.clusters_cnt, "w") as out, open(input.prev_cnt, "r") as cnt:
        out.write("{}".format(cnt.read() ))
        out.write("clusters.all:\t{:d}\n".format(n))
        out.write("clusters.PAS.nr:\t{:d}\n".format(p))
        out.write("clusters.PAS.percent:\t{:d}\n".format(int(p/n*100))) # For put in mongo we need int
        for k in annos.keys():
            out.write("clusters.annos.%s:\t%s\n" % (k, annos[k]))
SnakeMake From line 2100 of master/Snakefile
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
shell:
    '''
    echo '#########################\n \
    Clustering completed.\n \
    #########################\n \
    Created "{input.noBG_cnt}"\n \
    "{input.cluster_stats}"\n \
    "{input.clusters_bed}"\n \
    "{input.samples_bed}"\n' \
    > {output.clst_cmplt}
    '''
SnakeMake From line 2157 of master/Snakefile
2185
2186
2187
2188
2189
2190
shell:
    '''
    wget -O {output.chr_sizes_ucsc} \
    {params.url} \
    &> /dev/null
    '''
SnakeMake From line 2185 of master/Snakefile
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
shell:
    '''
    python {input.script} \
    -i {input.clusters} \
    -s {params.id} \
    --chr-names {params.chr_names} \
    -p {output.plus} \
    -m {output.minus} \
    2> {log}
    '''
SnakeMake From line 2223 of master/Snakefile
2259
2260
2261
2262
2263
2264
shell:
    '''
    sortBed \
    -i {input.ucsc_bed} \
    > {output.sorted_bed}
    '''
SnakeMake From line 2259 of master/Snakefile
2291
2292
2293
2294
2295
2296
2297
shell:
    '''
    bedGraphToBigWig \
    {input.ucsc_bed} \
    {input.chr_sizes} \
    {output.bigWig}
    '''
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
run:
    with open(output.track_info, "wt") as out:
        out.write('track type=bigWig name="%s: poly(A) clusters plus strand %s" \
                   visibility="full" color="4,177,216" maxHeightPixels="128:60:8"\
                   bigDataUrl="%s/%s"\n'\
                   % (params.name, params.atlas_public_name, params.url, params.plus))
        out.write('track type=bigWig name="%s: poly(A) clusters minus strand %s" \
                   visibility="full" color="241,78,50" maxHeightPixels="128:60:8"\
                   bigDataUrl="%s/%s"\n' \
                   % (params.name, params.atlas_public_name, params.url, params.minus))
SnakeMake From line 2325 of master/Snakefile
2352
2353
2354
2355
2356
2357
2358
2359
2360
shell:
    '''
    echo '#########################\n \
    Track files completed.\n \
    #########################\n \
    Created "{input.atlas_track}"\n \
    "{input.sample_tracks}"\n' \
    > {output.tracks_cmplt}
    '''
SnakeMake From line 2352 of master/Snakefile
ShowHide 69 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/zavolanlab/polyAsite_workflow
Name: polyasite_workflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Apache License 2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...