Alternative Splicing Analysis Using PACBIO Targeted Long-Reads Sequencing Data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Description
This snakemake pipeline is for alternative splicing analysis using PACBIO targed long-reads sequencing data. The pipeline may be run on an HPC or in a local environment.
Major steps in the workflow include:
-
General data processing using ISOSEQ3
-
Alignment and splicing analysis of non-collaplsed reads using GMAP and SQANTI3
-
Further QC filtering and splicing analysis of collapsed reads using cDNA_cupcake and SQANTI3
-
Generateing customozied tappAS annotation file with sample merged data using SQANTI3
Expected results include:
-
Detection and quantification of transcripts in targed regions with non-collapsed reads
-
The concise version of detection of transcripts in targed regions with collapsed reads
-
Customized tappAS annotation file
Software Requirements
User's guide
I. Input requirements
-
Edited config/config.yaml
-
Indexed bam files of demultiplexed subreads data
-
Reference genome file and annotation files
II. Editing the config.yaml
Basic parameters:
-
subread: Path to subread data stored directory
-
targeted: Genomic coordinates of the targed region; chromosome:start_position-end_position
-
ref: hg38, hg19, mm10, etc
-
genome: Path to referece genome fasta file
-
gtf: Path to reference annotation gtf file
-
gff3: Path to reference annotation gff3 file
-
tappas_gff3: Path to standard tappAS reference annotation gff3 file
-
tofu: Path to the directory where the tool tofu in cDNA_cupcake package is located
-
sqanti3: Path to the directory where the SQANTI3 package is located
Optional parameters:
-
out: Path to desired directory to store output files, defalut: working directory
-
gmap_ref: Path to GMAP index directory if available
III. To run
- Clone the repository to your working directory
git clone https://github.com/NCI-CGR/Pacbio-Alternative-Splicing.git
-
Install required software
-
Create conda environment using the provided yaml file and activate it after installation:
conda env create -f pacbioAS_env.yaml
conda activate pacbioAS
-
Edit and save config/config.yaml
-
To run on an HPC using slurm job scheduler: Edit config/cluster_config.yaml according to your HPC information Run run.sh to initiate running of the pipeline
-
To run in a local environment:
export PYTHONPATH=$PYTHONPATH:{PATH}/cDNA_Cupcake/sequence/:{PATH}/conda/envs/sqanti3/lib/python3.7/site-packages/ snakemake -p --cores 16 --keep-going --rerun-incomplete --jobs 300 --latency-wait 120 all
-
Look in log directory for logs for each rule
-
To view the snakemkae rule graph:
snakemake --rulegraph | dot -T png > pacbioAS.png
IV. Example output
. user/defined/output_dir
├── ccs_bam # ccs reads
│ ├── {sample_A}_ccs.bam
│ ├── {sample_A}_ccs.bam.pbi
│ ├── {sample_A}_ccs.other_files
│ ├── {sample_B}_ccs.bam
│ ├── {sample_B}_ccs.bam.pbi
│ ├── {sample_B}_ccs.other_files
│ └── merged_ccs.bam
├── cluster # clusted unpolished reads
│ ├── {sample_A}.unpolished.bam
│ ├── {sample_A}.unpolished.bam.pbi
│ ├── {sample_A}.unpolished.cluster_report.csv
│ ├── {sample_A}.unpolished.fasta.gz
│ ├── {sample_A}.unpolished.other_files
│ ├── {sample_B}.unpolished.bam
│ ├── {sample_B}.unpolished.bam.pbi
│ ├── {sample_B}.unpolished.cluster_report.csv
│ ├── {sample_B}.unpolished.fasta.gz
│ ├── {sample_B}.unpolished.other_files
│ ├── merged.unpolished.bam
│ ├── merged.unpolished.bam.pbi
│ ├── merged.unpolished.cluster_report.csv
│ ├── merged.unpolished.fasta.gz
│ └── merged.unpolished.other_files
├── collapsed # collapsed reads
│ ├── {sample_A}.collapsed.abundance.txt
│ ├── {sample_A}.collapsed.fasta
│ ├── {sample_A}.collapsed.filtered.abundance.txt
│ ├── {sample_A}.collapsed.filtered.rep.fasta
│ ├── {sample_A}.collapsed.filtered.rep.fq
│ ├── {sample_A}.collapsed.filtered.other_files
│ ├── {sample_A}.collapsed.rep.fq
│ ├── {sample_A}.collapsed.other_files
│ ├── {sample_B}.collapsed.abundance.txt
│ ├── {sample_B}.collapsed.fasta
│ ├── {sample_B}.collapsed.filtered.abundance.txt
│ ├── {sample_B}.collapsed.filtered.rep.fasta
│ ├── {sample_B}.collapsed.filtered.rep.fq
│ ├── {sample_B}.collapsed.filtered.other_files
│ ├── {sample_B}.collapsed.rep.fq
│ ├── {sample_B}.collapsed.other_files
│ ├── merged.collapsed.abundance.txt
│ ├── merged.collapsed.fasta
│ ├── merged.collapsed.filtered.abundance.txt
│ ├── merged.collapsed.filtered.rep.fasta
│ ├── merged.collapsed.filtered.rep.fq
│ ├── merged.collapsed.filtered.other_files
│ ├── merged.collapsed.rep.fq
│ └── merged.collapsed.other_files
├── kallisto_index
│ ├── complete.txt
│ └── index
├── polished # clustered polished reads
│ ├── {sample_A}.polished.bam
│ ├── {sample_A}.polished.bam.pbi
│ ├── {sample_A}.polished.cluster_report.csv
│ ├── {sample_A}.polished.hq.fasta
│ ├── {sample_A}.polished.hq.fastq
│ ├── {sample_A}.polished.hq.fastq.gz
│ ├── {sample_A}.polished.other_files
│ ├── {sample_B}.polished.bam
│ ├── {sample_B}.polished.bam.pbi
│ ├── {sample_B}.polished.cluster_report.csv
│ ├── {sample_B}.polished.hq.fasta
│ ├── {sample_B}.polished.hq.fastq
│ ├── {sample_B}.polished.other_files
│ ├── merged.polished.bam
│ ├── merged.polished.bam.pbi
│ ├── merged.polished.cluster_report.csv
│ ├── merged.polished.hq.fasta
│ ├── merged.polished.hq.fastq
│ └── merged.polished.other_files
├── sqanti # splicing analysis with non-collpased reads
│ ├── GMST
│ ├── {sample_A}.polished.hq_classification.txt
│ ├── {sample_A}.polished.hq_corrected.fasta
│ ├── {sample_A}.polished.hq_corrected.gtf
│ ├── {sample_A}.polished.hq_corrected_sorted.sam
│ ├── {sample_A}.polished.hq_junctions.txt
│ ├── {sample_A}.polished.hq_sqanti_report.pdf
│ ├── {sample_A}.polished.hq_other_files
│ ├── {sample_B}.polished.hq_classification.txt
│ ├── {sample_B}.polished.hq_corrected.fasta
│ ├── {sample_B}.polished.hq_corrected.gtf
│ ├── {sample_B}.polished.hq_corrected_sorted.sam
│ ├── {sample_B}.polished.hq_junctions.txt
│ ├── {sample_B}.polished.hq_sqanti_report.pdf
│ ├── {sample_B}.polished.hq_other_files
│ ├── merged.polished.hq_classification.txt
│ ├── merged.polished.hq_corrected.fasta
│ ├── merged.polished.hq_corrected.gtf
│ ├── merged.polished.hq_corrected.sorted.sam
│ ├── merged.polished.hq_junctions.txt
│ ├── merged.polished.hq_sqanti_report.pdf
│ ├── merged.polished.hq_other_files
│ └── other_files
├── sqanti_r2 # splicing analysis with collapsed
│ ├── GMST
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_classification.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite.fasta
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite.gtf
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_junctions.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_reasons.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_sqanti_report.pdf
│ ├── {sample_A}.collapsed.filtered.rep_classification.txt
│ ├── {sample_A}.collapsed.filtered.rep_corrected.fasta
│ ├── {sample_A}.collapsed.filtered.rep_corrected.gtf
│ ├── {sample_A}.collapsed.filtered.rep_corrected_sorted.sam
│ ├── {sample_A}.collapsed.filtered.rep_junctions.txt
│ ├── {sample_A}.collapsed.filtered.rep_sqanti_report.pdf
│ ├── {sample_A}.collapsed.filtered.rep_other_files
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_classification.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite.fasta
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite.gtf
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_junctions.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_reasons.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_sqanti_report.pdf
│ ├── {sample_B}.collapsed.filtered.rep_classification.txt
│ ├── {sample_B}.collapsed.filtered.rep_corrected.fasta
│ ├── {sample_B}.collapsed.filtered.rep_corrected.gtf
│ ├── {sample_B}.collapsed.filtered.rep_corrected_sorted.sam
│ ├── {sample_B}.collapsed.filtered.rep_junctions.txt
│ ├── {sample_B}.collapsed.filtered.rep_sqanti_report.pdf
│ ├── {sample_B}.collapsed.filtered.rep_other_files
│ ├── merged.collapsed.filtered.rep_classification.txt
│ ├── merged.collapsed.filtered.rep_corrected.fasta
│ ├── merged.collapsed.filtered.rep_corrected.gtf
│ ├── merged.collapsed.filtered.rep_corrected.sorted.sam
│ ├── merged.collapsed.filtered.rep_junctions.txt
│ ├── merged.collapsed.filtered.rep_sqanti_report.pdf
│ ├── merged.collapsed.filtered.rep_tappAS_annot_from_SQANTI3.gff3
│ ├── merged.collapsed.filtered.rep_other_files
│ └── other_files
├── subread
│ ├── {sample_A}.bam
│ ├── {sample_A}.bam.pbi
│ ├── {sample_A}.subreadset.xml
│ ├── {sample_B}.bam
│ ├── {sample_B}.bam.pbi
│ ├── {sample_B}.subreadset.xml
│ ├── merged.bam
│ └── merged.bam.pbi
└── targeted # targeted non-collapsed reads
├── {sample_A}.polished.hq_corrected_sorted_targeted.sam
├── {sample_B}.polished.hq_corrected_sorted_targeted.sam
└── merged.polished.hq_corrected.sorted.targeted.sam
Code Snippets
12 13 14 15 16 | shell: """ samtools merge {output[0]} {input[0]} 2>log/merge_ccs.err samtools merge {output[1]} {input[1]} 2>log/merge_subread.err """ |
25 26 27 28 | shell: """ isoseq3 cluster {input} {output[0]} -j 32 2>log/cluter.err """ |
42 43 44 45 46 47 48 | shell: """ isoseq3 polish -j 16 {input[0]} {input[1]} {output[0]} 2>log/polish.err gunzip -c {output[1]} > {output[2]} 2>log/gunzip.err gunzip -c {output[3]} > {output[4]} 2>log/gunzip2.err isoseq3 summarize {output[0]} {output[5]} 2>log/cluter_report.err """ |
58 59 60 61 62 | shell: """ python {sqanti3}/sqanti3_qc.py {input[0]} {gtf} {genome} --aligner_choice=gmap -x {gref} --skipORF --force_id_ignore -t 16 -d {out}sqanti/ 2>log/sqanti.err samtools sort {output[0]} -o {output[1]} 2>log/sort.err """ |
70 71 72 73 74 75 76 | run: samfile = pysam.AlignmentFile(str(input), "r") osamfile = pysam.AlignmentFile(str(output), "w", template=samfile) for read in samfile: if read.reference_name == cn and read.reference_start > st and read.reference_start < ed: osamfile.write(read) |
88 89 90 91 92 | shell: """ python {tofu}/collapse_isoforms_by_sam.py --input {input[0]} --fq -s {input[1]} --dun-merge-5-shorter -o {out}collapsed/merged 2>log/collapsed.err fastq_to_fasta -i {output[0]} -o {output[3]} 2>log/fastq_tofasta2.err """ |
102 103 104 105 | shell: """ python {tofu}/get_abundance_post_collapse.py {out}collapsed/merged.collapsed {input[1]} 2>log/count.err """ |
116 117 118 119 120 | shell: """ python {tofu}/filter_away_subset.py {out}collapsed/merged.collapsed 2>log/filter.err fastq_to_fasta -i {output[1]} -o {output[2]} 2>log/fastq_tofasta3.err """ |
131 132 133 134 135 136 | shell: """ python {sqanti3}/sqanti3_qc.py {input} {gtf} {genome} --skipORF --aligner_choice=gmap -x {gref} -t 16 -d {out}sqanti_r2/ --isoAnnotLite --gff3 {tappas_gff3} 2>log/sqanti_r2.err samtools sort {output[1]} -o {output[2]} 2>log/sort2.err touch {output[3]} """ |
145 146 147 148 149 | shell: """ kallisto index -i {out}kallisto_index/index {input[0]} 2>log/kallisto_index.err touch {output} """ |
9 10 11 12 | shell: """ ccs {input} {output} --noPolish --minPasses 1 2>log/{wildcards.sample}_ccs.err """ |
21 22 23 24 | shell: """ isoseq3 cluster {input} {output} -j 32 2>log/{wildcards.sample}_cluter.err """ |
38 39 40 41 42 43 44 | shell: """ isoseq3 polish -j 16 {input[0]} {input[1]} {output[0]} 2>log/{wildcards.sample}_polish.err gunzip -c {output[1]} > {output[2]} 2>log/{wildcards.sample}_gunzip.err gunzip -c {output[3]} > {output[4]} 2>log/{wildcards.sample}_gunzip2.err isoseq3 summarize {output[0]} {output[5]} 2>log/{wildcards.sample}_cluter_report.err """ |
52 53 54 55 56 57 58 | run: global gref if gref == '': shell("""gmap_build -d {ref} {genome} -t 16 -D {out}gmap_index 2>log/gmap_index.err;touch {output}""") gref = out + "gmap_index/" + ref else: shell("""touch {output}""") |
68 69 70 71 72 | shell: """ python {sqanti3}/sqanti3_qc.py {input[0]} {gtf} {genome} --aligner_choice=gmap -x {gref} --skipORF --force_id_ignore -t 16 -d {out}sqanti/ 2>log/{wildcards.sample}_sqanti.err samtools sort {output[0]} -o {output[1]} 2>log/{wildcards.sample}_sort.err """ |
80 81 82 83 84 85 86 | run: samfile = pysam.AlignmentFile(str(input), "r") osamfile = pysam.AlignmentFile(str(output), "w", template=samfile) for read in samfile: if read.reference_name == cn and read.reference_start > st and read.reference_start < ed: osamfile.write(read) |
99 100 101 102 103 104 | shell: """ python {tofu}/collapse_isoforms_by_sam.py --input {input[0]} --fq -s {input[1]} --dun-merge-5-shorter -o {out}collapsed/{wildcards.sample} 2>log/{wildcards.sample}_collapsed.err fastq_to_fasta -i {output[0]} -o {output[3]} 2>log/{wildcards.sample}_fastq_tofasta2.err touch {output[4]} """ |
114 115 116 117 | shell: """ python {tofu}/get_abundance_post_collapse.py {out}collapsed/{wildcards.sample}.collapsed {input[1]} 2>log/{wildcards.sample}_count.err """ |
128 129 130 131 132 | shell: """ python {tofu}/filter_away_subset.py {out}collapsed/{wildcards.sample}.collapsed 2>log/{wildcards.sample}_filter.err fastq_to_fasta -i {output[1]} -o {output[2]} 2>log/{wildcards.sample}_fastq_tofasta3.err """ |
143 144 145 146 147 | shell: """ python {sqanti3}/sqanti3_qc.py {input} {gtf} {genome} --skipORF --aligner_choice=gmap -x {gref} -t 16 -d {out}sqanti_r2/ 2>log/{wildcards.sample}_sqanti_r2.err samtools sort {output[1]} -o {output[2]} 2>log/{wildcards.sample}_sort2.err """ |
156 157 158 159 160 | shell: """ python {sqanti3}/sqanti3_RulesFilter.py {input[0]} {input[1]} {gff3} 2>log/{wildcards.sample}_sqanti_r2_filter.err touch {output} """ |
Support
- Future updates
Related Workflows





