Polyadenylation aware read trimming and alternative polyadenylation site analysis

public 1yr ago Version: v0.2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Description

Purpose of the work flow is to run RNA-seq trimming and alignment to a given reference steps and perform initial polyadenylation sites' analysis.

Dependencies

julia v0.6.1
snakemake 5.2.2
BBMAP 38.22
STAR 2.6.1a
seqkit 0.8.1
sambamba 0.6.6

You can install the dependencies manually or through conda environment as indicated below. If you choose to install the required software manually please directly to the step 6 of the Workflow setup.

Workflow setup

Install conda :

 wget -P miniconda https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh &&
 chmod 755 ./miniconda/Miniconda3-latest-Linux-x86_64.sh &&
 ./miniconda/Miniconda3-latest-Linux-x86_64.sh

Add path of miniconda to .bashrc if not selected option to add automatically during installation:

 cd ~/ && pth=$(pwd) &&
 echo "PATH=$PATH:$pth/miniconda3/bin" >> ~/.bashrc

Add channels to conda:

 conda config --add channels conda-forge
 conda config --add channels bioconda
 conda config --add channels anaconda
 conda config --add channels defaults

Create your conda environment:

 conda env create -f envs/3EndD.yaml -n 3EndD

Activate created environment:

 source activate 3EndD

Install the required julia packages:

 bash envs/build_julia_pkgs.sh

Analysis

Run snakemake as follows:

 snakemake --configfile config.yaml -j 24 -k -p

This would ran an analysis on a small dataset matching human brain RNA-seq data of 21th chromosome.

PolyAAnalysis tests

Run julia :

 source activate 3EndD;
 julia PolyAAnalysis.jl/tests/runtests.jl

Notes for analysis configuration `config.yaml`

Configuration of your working environment:

 INPUT-DIR: dir with fastq files
 SAMPLES: sample names, part of a file name. See example bellow. 
 LANE-REGEX: python regex for finding same sample files split by lanes. eg. use "L\\d\\d\\d_" to find any SAMPLE1_R[1,2]_L00[1,2,3]_001.fastq.gz SAMPLE2_R[1,2]_L00[1,2,3]_001.fastq.gz in INPUT-DIR. Sample names should be unique.
 memory_java: for human 60 GB is recommended.
 threads_julia: specify threads for `julia`. Usually no more then 8. Some scripts scales only reading GZ | BAM files.
 threads_star: specify threads for `STAR` aligner.
 threads_sambamba: specify threads for sambamba tools.
 SCRATCH: location of tmp directory for faster computation (SSD).
 REFERENCE: genome in fasta format.
 GFF3: annotation files, should be the same version.
 GTF: annotation files, should be the same version.
 TRANSCRIPTS: Optional, generated from fasta and annotations if not provided.

Supported file names:

 SAMPLE1_L001_R1_001.fastq.gz
 SAMPLE1_L001_R2_001.fastq.gz

 SAMPLE1_L001_1.fq.gz
 SAMPLE1_L001_2.fq.gz

 SAMPLE1_L001_R1_001.sfq
 SAMPLE1_L001_R2_001.sfq

 SAMPLE1_L001_1.sfq
 SAMPLE1_L001_1.sfq

SAMPLES and LANE-REGEX options in your config would be:

 SAMPLES: SAMPLE1
 LANE-REGEX: "L\\d\\d\\d_"

If you want to normalize all samples by lowest read number found:

 SUBSAMPLING:
 run: true

If you want to subsample to specific read number:

 SUBSAMPLING:
 run: true
 subsample_to: N

Additional parameters which are not listed in config can be passed as a string for BBDUK, STAR and PolyA annotator programs.

 additional_params: "passed as a string for specified programs."

Specific parameters for polyA | termination sites annotator:

 ANNOTATE-TS:
 k: distance from the cluster center allowed.
 m: minimum distance between clusters allowed. Two adjacent clusters with distance <= m will be merged.
 additional_params: pass -c|--cluster to run clustering, pass -v|--verbose to print proceeding of clustering.
 mappingquality: Only reads with greater than this mapping value will pass.
 strandedness: List your SAMPLES bellow providing for each strandedness.
 HBR_100_Collibri_chr21small_A: If R1 read is the same as a gene sequence: "+". If R2 read is the same as a gene sequence: "-"

Annotated BED files for termination sites

Workflow generates annotated transcription termination sites in ANNOTATE-POLYA directory.

For *annotated_polyA_clusters.bed files:

Column Nr	Field Name	Explanation
1	Chr	Chromosome
2	Start	Start (0-based)
3	End	End
4	GeneName	Gene Name
5	ClusterSize	Detected number of reads representing this termination site.
6	Strand	Strand
7	Feature	Feature
8	ClusterCenter	Position of cluster center. 1-based.
9	Biotype	Biotype
10	ClusterMedian	Median of frequencies of reads representing each termination site in cluster.
11	ClusterMean	Mean of frequencies of reads representing each termination site in cluster.
12	ClusterMin	Min of frequencies of reads representing each termination site in cluster.
13	ClusterMax	Max of frequencies of reads representing each termination site in cluster.
14	Cluster1stQuartile	1st quartile of frequencies of reads representing each termination site in cluster.
15	Cluster3rdQuartile	3rd quartile of frequencies of reads representing each termination site in cluster.
16	TSMedian	Median length polyA tail.
17	TSMean	Mean length polyA tail.
18	TSMin	Min length polyA tail.
19	TSMax	Max length polyA tail.
20	TS1stQuartile	1st quartile of lengths of polyA tail.
21	TS3rdQuartile	3rd quartile of lengths of polyA tail.

In addition for *annotated_polyA_clusters_plus_coverage.bed files

Column Nr	Field Name	Explanation
22	-	Sambamba added Read count at Chr:Start-End
23	-	Sambamba added mean coverage at Chr:Start-End
24	-	Sambamba added Sample Name if any.

For *annotated_polyA.bed files:

Column Nr	Field Name	Explanation
1	Chr	Chromosome
2	Start	Start (0-based)
3	End	End
4	GeneName	Gene Name
5	Count	Detected number of reads representing this termination site.
6	Strand	Strand
7	Feature	Feature
8	Median	Median length polyA tail.
9	Min	Min length polyA tail.
10	Max	Max length polyA tail.
11	Biotype	Biotype

Code Snippets

for fastq in "$@"; do
        { zcat $fastq 2>/dev/null || cat $fastq; } | \
            #sed -n '1~4p' | \
            sed -n '1,${p;n;n;n;}' | \
            awk '{if (substr($0,0,1)!="@") {exit 1}; nreads+=1;} END {printf nreads " "}'

done

Shell From line 24 of scripts/fastq_num_reads.sh

shell:
    "echo $(git log) |  cut -d ' ' -f 2 &>> {output}"

SnakeMake From line 53 of master/Snakefile

run:
    with open(logs + "/config.yaml", 'w') as outfile:
        yaml.dump(CONFIG, outfile, default_flow_style=False)

SnakeMake From line 59 of master/Snakefile

run:
    install_slimfastq()

SnakeMake From line 8 of snakefiles/1_0_download.smk

run:
    get_sample_files_named_pipe_script(wildcards.stem)

SnakeMake From line 17 of snakefiles/1_0_download.smk

shell:
    "./{input} > {output}"

SnakeMake From line 26 of snakefiles/1_0_download.smk

shell:
    "./{input} > {output}"

SnakeMake From line 35 of snakefiles/1_0_download.smk

shell:
    "fastqc {input} -o {params.out_dir} -t {threads} " +
    "-k {params.kmer_size} 2> {log}"

SnakeMake FastQC From line 18 of snakefiles/2_0_fastqc.smk

shell:
    "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_raw_reads"

SnakeMake MultiQC From line 30 of snakefiles/2_0_fastqc.smk

shell:
    "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_trimmed_reads"

SnakeMake MultiQC From line 41 of snakefiles/2_0_fastqc.smk

shell:
    "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_processed_polyA_reads --force "

SnakeMake MultiQC From line 52 of snakefiles/2_0_fastqc.smk

shell:
    "bbduk.sh in={input[0]} out={output[1]} " +
            "in2={input[1]} out2={output[2]} " +
            "ref={params.ref} threads={threads} " +
            "ktrim={params.ktrim} k={params.k} " +
            "mink={params.mink} hdist={params.hdist} " +
            "minlength={params.minlength} maxns={params.maxns} " +
            "qtrim={params.qtrim} trimq={params.trimq} " +
            "{params.add} threads={threads} " +
            "stats={output[0]} overwrite=t " +
            "-Xmx{params.m}g 2> {log}"

SnakeMake BBMap From line 30 of snakefiles/3_0_processing.smk

shell:
    "minimum={params.target_count}; " +
    "seqkit sample -s 11 -j {threads} --two-pass " +
    "-n {params.target_count} {input[0]} -o {output[0]} " +
    "2> {log[0]}; " +
    "seqkit sample -s 11 -j {threads} --two-pass " +
    "-n {params.target_count} {input[1]} -o {output[1]} " +
    "2> {log[1]}"

SnakeMake seqkit From line 63 of snakefiles/3_0_processing.smk

shell:
    "cat {input} > {output}"

SnakeMake From line 78 of snakefiles/3_0_processing.smk

shell:
    "cp {input} {output}"

SnakeMake From line 86 of snakefiles/3_0_processing.smk

shell:
    "./scripts/fastq_num_reads.sh {input} > {output}; " +
    "sed -i -e 's/^/{params.prefix}/' {output} ; echo >>  {output}"

SnakeMake From line 99 of snakefiles/3_0_processing.smk

shell:
    """minimum=`python ./scripts/""" +

SnakeMake From line 120 of snakefiles/3_0_processing.smk

shell:
    "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " +
    "-a {input[0]} -b {input[1]} -o {params.output_stem} " +
    "-r {params.gz} 2>&1 | tee -a {log}"

SnakeMake From line 21 of snakefiles/3_1_trimming_polyA.smk

shell:
    "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " +
    "-a {input[0]} -b {input[1]} -o {params.output_stem} " +
    "-g {params.ref} -f {params.gff} 2>&1 | tee -a {log}"

SnakeMake From line 45 of snakefiles/3_1_trimming_polyA.smk

shell:
    "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " +
    "-a {input[0]} -b {input[1]} -o {params.output_stem} " +
    "-g {params.ref} 2>&1 | tee -a {log}"

SnakeMake From line 68 of snakefiles/3_1_trimming_polyA.smk

shell:
    "if [ ! -d {params.genomeDir} ]; then " +
    " mkdir {params.genomeDir}; fi && " +
    " STAR --runMode genomeGenerate " +
        "--runThreadN {threads}  " +
        "--genomeDir {params.genomeDir} " +
        "--genomeFastaFiles {input.ref} " +
        "--sjdbGTFfile {input.gtf} 2> {log}"

SnakeMake STAR From line 20 of snakefiles/4_0_alignment.smk

shell:
    "STAR --outTmpDir {params.tmp} --runThreadN {threads} " +
        "--genomeDir {input.genomeDir} --runMode alignReads " +
        "--outSAMunmapped Within " +
        "--outFileNamePrefix {params.prefix} " +
        "--readFilesIn {input[0]} {input[1]} --readFilesCommand zcat " +
        "--genomeLoad NoSharedMemory " +
        "--outSAMstrandField intronMotif --outSAMattributes All " +
        "--outSAMtype BAM Unsorted " +
        "--chimSegmentMin {params.chimSegmentMin} " +
        "{params.additional}; " +
        "mv {params.starPrefix}.bam {output[0]}; " +
        "mv {params.prefix}Log.final.out {output[1]}; " +
        "mv {params.prefix}Log.out {log}"

SnakeMake STAR From line 49 of snakefiles/4_0_alignment.smk

shell:
    "sambamba sort --tmpdir={params.temp_dir_sambamba} -p " +
    "-t{threads}  -o {output[0]} {input[0]} 2> {log}"

SnakeMake Sambamba From line 77 of snakefiles/4_0_alignment.smk

shell:
    "sambamba index -p -t{threads} {input[0]} 2> {log}"

SnakeMake Sambamba From line 92 of snakefiles/4_0_alignment.smk

shell:
    "export JULIA_NUM_THREADS={threads}; julia --depwarn=no " +
    "PolyAAnalysis.jl/scripts/annotate_polyA.jl -b {input} -o {params.pref} " +
    "-g {params.gff} -k {params.k} -m {params.m} -q {params.q} " +
    "-s {params.s} {params.add_params} &> {log}"

SnakeMake From line 26 of snakefiles/5_0_annotate_ts.smk

shell:
    "sambamba depth region -F 'mapping_quality > {params.q}' " +
    "-t {threads} -L {input[1]} " +
    "{input[0]} | sed '/^#/ d' > {output}"