Polyadenylation aware read trimming and alternative polyadenylation site analysis

public public 1yr ago Version: v0.2 0 bookmarks

Description

Purpose of the work flow is to run RNA-seq trimming and alignment to a given reference steps and perform initial polyadenylation sites' analysis.

Dependencies

  • julia v0.6.1

  • snakemake 5.2.2

  • BBMAP 38.22

  • STAR 2.6.1a

  • seqkit 0.8.1

  • sambamba 0.6.6

You can install the dependencies manually or through conda environment as indicated below. If you choose to install the required software manually please directly to the step 6 of the Workflow setup.

Workflow setup

  1. Install conda :
 wget -P miniconda https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh &&
 chmod 755 ./miniconda/Miniconda3-latest-Linux-x86_64.sh &&
 ./miniconda/Miniconda3-latest-Linux-x86_64.sh
  1. Add path of miniconda to .bashrc if not selected option to add automatically during installation:
 cd ~/ && pth=$(pwd) &&
 echo "PATH=$PATH:$pth/miniconda3/bin" >> ~/.bashrc
  1. Add channels to conda:
 conda config --add channels conda-forge
 conda config --add channels bioconda
 conda config --add channels anaconda
 conda config --add channels defaults
  1. Create your conda environment:
 conda env create -f envs/3EndD.yaml -n 3EndD
  1. Activate created environment:
 source activate 3EndD
  1. Install the required julia packages:
 bash envs/build_julia_pkgs.sh

Analysis

  • Run snakemake as follows:
 snakemake --configfile config.yaml -j 24 -k -p

This would ran an analysis on a small dataset matching human brain RNA-seq data of 21th chromosome.

PolyAAnalysis tests

  • Run julia :
 source activate 3EndD;
 julia PolyAAnalysis.jl/tests/runtests.jl

Notes for analysis configuration config.yaml

  • Configuration of your working environment:
 INPUT-DIR: dir with fastq files
 SAMPLES: sample names, part of a file name. See example bellow. 
 LANE-REGEX: python regex for finding same sample files split by lanes. eg. use "L\\d\\d\\d_" to find any SAMPLE1_R[1,2]_L00[1,2,3]_001.fastq.gz SAMPLE2_R[1,2]_L00[1,2,3]_001.fastq.gz in INPUT-DIR. Sample names should be unique.
 memory_java: for human 60 GB is recommended.
 threads_julia: specify threads for `julia`. Usually no more then 8. Some scripts scales only reading GZ | BAM files.
 threads_star: specify threads for `STAR` aligner.
 threads_sambamba: specify threads for sambamba tools.
 SCRATCH: location of tmp directory for faster computation (SSD).
 REFERENCE: genome in fasta format.
 GFF3: annotation files, should be the same version.
 GTF: annotation files, should be the same version.
 TRANSCRIPTS: Optional, generated from fasta and annotations if not provided.
  • Supported file names:
 SAMPLE1_L001_R1_001.fastq.gz
 SAMPLE1_L001_R2_001.fastq.gz
 SAMPLE1_L001_1.fq.gz
 SAMPLE1_L001_2.fq.gz
 SAMPLE1_L001_R1_001.sfq
 SAMPLE1_L001_R2_001.sfq
 SAMPLE1_L001_1.sfq
 SAMPLE1_L001_1.sfq

SAMPLES and LANE-REGEX options in your config would be:

 SAMPLES: SAMPLE1
 LANE-REGEX: "L\\d\\d\\d_"
  • If you want to normalize all samples by lowest read number found:
 SUBSAMPLING:
 run: true
  • If you want to subsample to specific read number:
 SUBSAMPLING:
 run: true
 subsample_to: N
  • Additional parameters which are not listed in config can be passed as a string for BBDUK, STAR and PolyA annotator programs.
 additional_params: "passed as a string for specified programs."
  • Specific parameters for polyA | termination sites annotator:
 ANNOTATE-TS:
 k: distance from the cluster center allowed.
 m: minimum distance between clusters allowed. Two adjacent clusters with distance <= m will be merged.
 additional_params: pass -c|--cluster to run clustering, pass -v|--verbose to print proceeding of clustering.
 mappingquality: Only reads with greater than this mapping value will pass.
 strandedness: List your SAMPLES bellow providing for each strandedness.
 HBR_100_Collibri_chr21small_A: If R1 read is the same as a gene sequence: "+". If R2 read is the same as a gene sequence: "-"

Annotated BED files for termination sites

Workflow generates annotated transcription termination sites in ANNOTATE-POLYA directory.

For *annotated_polyA_clusters.bed files:

Column Nr Field Name Explanation
1 Chr Chromosome
2 Start Start (0-based)
3 End End
4 GeneName Gene Name
5 ClusterSize Detected number of reads representing this termination site.
6 Strand Strand
7 Feature Feature
8 ClusterCenter Position of cluster center. 1-based.
9 Biotype Biotype
10 ClusterMedian Median of frequencies of reads representing each termination site in cluster.
11 ClusterMean Mean of frequencies of reads representing each termination site in cluster.
12 ClusterMin Min of frequencies of reads representing each termination site in cluster.
13 ClusterMax Max of frequencies of reads representing each termination site in cluster.
14 Cluster1stQuartile 1st quartile of frequencies of reads representing each termination site in cluster.
15 Cluster3rdQuartile 3rd quartile of frequencies of reads representing each termination site in cluster.
16 TSMedian Median length polyA tail.
17 TSMean Mean length polyA tail.
18 TSMin Min length polyA tail.
19 TSMax Max length polyA tail.
20 TS1stQuartile 1st quartile of lengths of polyA tail.
21 TS3rdQuartile 3rd quartile of lengths of polyA tail.

In addition for *annotated_polyA_clusters_plus_coverage.bed files

Column Nr Field Name Explanation
22 - Sambamba added Read count at Chr:Start-End
23 - Sambamba added mean coverage at Chr:Start-End
24 - Sambamba added Sample Name if any.

For *annotated_polyA.bed files:

Column Nr Field Name Explanation
1 Chr Chromosome
2 Start Start (0-based)
3 End End
4 GeneName Gene Name
5 Count Detected number of reads representing this termination site.
6 Strand Strand
7 Feature Feature
8 Median Median length polyA tail.
9 Min Min length polyA tail.
10 Max Max length polyA tail.
11 Biotype Biotype

Code Snippets

24
25
26
27
28
29
30
for fastq in "$@"; do
        { zcat $fastq 2>/dev/null || cat $fastq; } | \
            #sed -n '1~4p' | \
            sed -n '1,${p;n;n;n;}' | \
            awk '{if (substr($0,0,1)!="@") {exit 1}; nreads+=1;} END {printf nreads " "}'

done
53
54
shell:
    "echo $(git log) |  cut -d ' ' -f 2 &>> {output}"
59
60
61
run:
    with open(logs + "/config.yaml", 'w') as outfile:
        yaml.dump(CONFIG, outfile, default_flow_style=False)
8
9
run:
    install_slimfastq()
17
18
run:
    get_sample_files_named_pipe_script(wildcards.stem)
26
27
shell:
    "./{input} > {output}"
35
36
shell:
    "./{input} > {output}"
18
19
20
shell:
    "fastqc {input} -o {params.out_dir} -t {threads} " +
    "-k {params.kmer_size} 2> {log}"
30
31
shell:
    "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_raw_reads"
41
42
shell:
    "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_trimmed_reads"
52
53
shell:
    "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_processed_polyA_reads --force "
30
31
32
33
34
35
36
37
38
39
40
shell:
    "bbduk.sh in={input[0]} out={output[1]} " +
            "in2={input[1]} out2={output[2]} " +
            "ref={params.ref} threads={threads} " +
            "ktrim={params.ktrim} k={params.k} " +
            "mink={params.mink} hdist={params.hdist} " +
            "minlength={params.minlength} maxns={params.maxns} " +
            "qtrim={params.qtrim} trimq={params.trimq} " +
            "{params.add} threads={threads} " +
            "stats={output[0]} overwrite=t " +
            "-Xmx{params.m}g 2> {log}"
63
64
65
66
67
68
69
70
shell:
    "minimum={params.target_count}; " +
    "seqkit sample -s 11 -j {threads} --two-pass " +
    "-n {params.target_count} {input[0]} -o {output[0]} " +
    "2> {log[0]}; " +
    "seqkit sample -s 11 -j {threads} --two-pass " +
    "-n {params.target_count} {input[1]} -o {output[1]} " +
    "2> {log[1]}"
78
79
shell:
    "cat {input} > {output}"
86
87
shell:
    "cp {input} {output}"
 99
100
101
shell:
    "./scripts/fastq_num_reads.sh {input} > {output}; " +
    "sed -i -e 's/^/{params.prefix}/' {output} ; echo >>  {output}"
120
121
shell:
    """minimum=`python ./scripts/""" +
21
22
23
24
shell:
    "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " +
    "-a {input[0]} -b {input[1]} -o {params.output_stem} " +
    "-r {params.gz} 2>&1 | tee -a {log}"
45
46
47
48
shell:
    "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " +
    "-a {input[0]} -b {input[1]} -o {params.output_stem} " +
    "-g {params.ref} -f {params.gff} 2>&1 | tee -a {log}"
68
69
70
71
shell:
    "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " +
    "-a {input[0]} -b {input[1]} -o {params.output_stem} " +
    "-g {params.ref} 2>&1 | tee -a {log}"
20
21
22
23
24
25
26
27
shell:
    "if [ ! -d {params.genomeDir} ]; then " +
    " mkdir {params.genomeDir}; fi && " +
    " STAR --runMode genomeGenerate " +
        "--runThreadN {threads}  " +
        "--genomeDir {params.genomeDir} " +
        "--genomeFastaFiles {input.ref} " +
        "--sjdbGTFfile {input.gtf} 2> {log}"
49
50
51
52
53
54
55
56
57
58
59
60
61
62
shell:
    "STAR --outTmpDir {params.tmp} --runThreadN {threads} " +
        "--genomeDir {input.genomeDir} --runMode alignReads " +
        "--outSAMunmapped Within " +
        "--outFileNamePrefix {params.prefix} " +
        "--readFilesIn {input[0]} {input[1]} --readFilesCommand zcat " +
        "--genomeLoad NoSharedMemory " +
        "--outSAMstrandField intronMotif --outSAMattributes All " +
        "--outSAMtype BAM Unsorted " +
        "--chimSegmentMin {params.chimSegmentMin} " +
        "{params.additional}; " +
        "mv {params.starPrefix}.bam {output[0]}; " +
        "mv {params.prefix}Log.final.out {output[1]}; " +
        "mv {params.prefix}Log.out {log}"
77
78
79
shell:
    "sambamba sort --tmpdir={params.temp_dir_sambamba} -p " +
    "-t{threads}  -o {output[0]} {input[0]} 2> {log}"
92
93
shell:
    "sambamba index -p -t{threads} {input[0]} 2> {log}"
26
27
28
29
30
shell:
    "export JULIA_NUM_THREADS={threads}; julia --depwarn=no " +
    "PolyAAnalysis.jl/scripts/annotate_polyA.jl -b {input} -o {params.pref} " +
    "-g {params.gff} -k {params.k} -m {params.m} -q {params.q} " +
    "-s {params.s} {params.add_params} &> {log}"
45
46
47
48
shell:
    "sambamba depth region -F 'mapping_quality > {params.q}' " +
    "-t {threads} -L {input[1]} " +
    "{input[0]} | sed '/^#/ d' > {output}"
ShowHide 20 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ThermofisherAndBabraham/polyAterminus
Name: polyaterminus
Version: v0.2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...