Alternative Splicing Analysis Using PACBIO Targeted Long-Reads Sequencing Data

public public 1yr ago 0 bookmarks

Description

This snakemake pipeline is for alternative splicing analysis using PACBIO targed long-reads sequencing data. The pipeline may be run on an HPC or in a local environment.

Major steps in the workflow include:

  1. General data processing using ISOSEQ3

  2. Alignment and splicing analysis of non-collaplsed reads using GMAP and SQANTI3

  3. Further QC filtering and splicing analysis of collapsed reads using cDNA_cupcake and SQANTI3

  4. Generateing customozied tappAS annotation file with sample merged data using SQANTI3

Expected results include:

  • Detection and quantification of transcripts in targed regions with non-collapsed reads

  • The concise version of detection of transcripts in targed regions with collapsed reads

  • Customized tappAS annotation file

Glossary of related terms

Software Requirements

User's guide

I. Input requirements

  • Edited config/config.yaml

  • Indexed bam files of demultiplexed subreads data

  • Reference genome file and annotation files

II. Editing the config.yaml

Basic parameters:

  • subread: Path to subread data stored directory

  • targeted: Genomic coordinates of the targed region; chromosome:start_position-end_position

  • ref: hg38, hg19, mm10, etc

  • genome: Path to referece genome fasta file

  • gtf: Path to reference annotation gtf file

  • gff3: Path to reference annotation gff3 file

  • tappas_gff3: Path to standard tappAS reference annotation gff3 file

  • tofu: Path to the directory where the tool tofu in cDNA_cupcake package is located

  • sqanti3: Path to the directory where the SQANTI3 package is located

Optional parameters:

  • out: Path to desired directory to store output files, defalut: working directory

  • gmap_ref: Path to GMAP index directory if available

III. To run

  • Clone the repository to your working directory
git clone https://github.com/NCI-CGR/Pacbio-Alternative-Splicing.git
  • Install required software

  • Create conda environment using the provided yaml file and activate it after installation:

conda env create -f pacbioAS_env.yaml
conda activate pacbioAS
  • Edit and save config/config.yaml

  • To run on an HPC using slurm job scheduler: Edit config/cluster_config.yaml according to your HPC information Run run.sh to initiate running of the pipeline

  • To run in a local environment:

    export PYTHONPATH=$PYTHONPATH:{PATH}/cDNA_Cupcake/sequence/:{PATH}/conda/envs/sqanti3/lib/python3.7/site-packages/
    snakemake -p --cores 16 --keep-going --rerun-incomplete --jobs 300 --latency-wait 120 all
    
  • Look in log directory for logs for each rule

  • To view the snakemkae rule graph:

snakemake --rulegraph | dot -T png > pacbioAS.png

dag

IV. Example output

. user/defined/output_dir
├── ccs_bam # ccs reads ├── {sample_A}_ccs.bam
│ ├── {sample_A}_ccs.bam.pbi
│ ├── {sample_A}_ccs.other_files
│ ├── {sample_B}_ccs.bam
│ ├── {sample_B}_ccs.bam.pbi
│ ├── {sample_B}_ccs.other_files
│ └── merged_ccs.bam
├── cluster # clusted unpolished reads ├── {sample_A}.unpolished.bam
│ ├── {sample_A}.unpolished.bam.pbi
│ ├── {sample_A}.unpolished.cluster_report.csv
│ ├── {sample_A}.unpolished.fasta.gz
│ ├── {sample_A}.unpolished.other_files
│ ├── {sample_B}.unpolished.bam
│ ├── {sample_B}.unpolished.bam.pbi
│ ├── {sample_B}.unpolished.cluster_report.csv
│ ├── {sample_B}.unpolished.fasta.gz
│ ├── {sample_B}.unpolished.other_files
│ ├── merged.unpolished.bam
│ ├── merged.unpolished.bam.pbi
│ ├── merged.unpolished.cluster_report.csv
│ ├── merged.unpolished.fasta.gz
│ └── merged.unpolished.other_files
├── collapsed # collapsed reads ├── {sample_A}.collapsed.abundance.txt
│ ├── {sample_A}.collapsed.fasta
│ ├── {sample_A}.collapsed.filtered.abundance.txt
│ ├── {sample_A}.collapsed.filtered.rep.fasta
│ ├── {sample_A}.collapsed.filtered.rep.fq
│ ├── {sample_A}.collapsed.filtered.other_files
│ ├── {sample_A}.collapsed.rep.fq
│ ├── {sample_A}.collapsed.other_files
│ ├── {sample_B}.collapsed.abundance.txt
│ ├── {sample_B}.collapsed.fasta
│ ├── {sample_B}.collapsed.filtered.abundance.txt
│ ├── {sample_B}.collapsed.filtered.rep.fasta
│ ├── {sample_B}.collapsed.filtered.rep.fq
│ ├── {sample_B}.collapsed.filtered.other_files
│ ├── {sample_B}.collapsed.rep.fq
│ ├── {sample_B}.collapsed.other_files
│ ├── merged.collapsed.abundance.txt
│ ├── merged.collapsed.fasta
│ ├── merged.collapsed.filtered.abundance.txt
│ ├── merged.collapsed.filtered.rep.fasta
│ ├── merged.collapsed.filtered.rep.fq
│ ├── merged.collapsed.filtered.other_files
│ ├── merged.collapsed.rep.fq
│ └── merged.collapsed.other_files
├── kallisto_index
│ ├── complete.txt
│ └── index
├── polished # clustered polished reads ├── {sample_A}.polished.bam
│ ├── {sample_A}.polished.bam.pbi
│ ├── {sample_A}.polished.cluster_report.csv
│ ├── {sample_A}.polished.hq.fasta
│ ├── {sample_A}.polished.hq.fastq
│ ├── {sample_A}.polished.hq.fastq.gz
│ ├── {sample_A}.polished.other_files
│ ├── {sample_B}.polished.bam
│ ├── {sample_B}.polished.bam.pbi
│ ├── {sample_B}.polished.cluster_report.csv
│ ├── {sample_B}.polished.hq.fasta
│ ├── {sample_B}.polished.hq.fastq
│ ├── {sample_B}.polished.other_files
│ ├── merged.polished.bam
│ ├── merged.polished.bam.pbi
│ ├── merged.polished.cluster_report.csv
│ ├── merged.polished.hq.fasta
│ ├── merged.polished.hq.fastq
│ └── merged.polished.other_files
├── sqanti # splicing analysis with non-collpased reads ├── GMST
│ ├── {sample_A}.polished.hq_classification.txt
│ ├── {sample_A}.polished.hq_corrected.fasta
│ ├── {sample_A}.polished.hq_corrected.gtf
│ ├── {sample_A}.polished.hq_corrected_sorted.sam
│ ├── {sample_A}.polished.hq_junctions.txt
│ ├── {sample_A}.polished.hq_sqanti_report.pdf
│ ├── {sample_A}.polished.hq_other_files
│ ├── {sample_B}.polished.hq_classification.txt
│ ├── {sample_B}.polished.hq_corrected.fasta
│ ├── {sample_B}.polished.hq_corrected.gtf
│ ├── {sample_B}.polished.hq_corrected_sorted.sam
│ ├── {sample_B}.polished.hq_junctions.txt
│ ├── {sample_B}.polished.hq_sqanti_report.pdf
│ ├── {sample_B}.polished.hq_other_files
│ ├── merged.polished.hq_classification.txt
│ ├── merged.polished.hq_corrected.fasta
│ ├── merged.polished.hq_corrected.gtf
│ ├── merged.polished.hq_corrected.sorted.sam
│ ├── merged.polished.hq_junctions.txt
│ ├── merged.polished.hq_sqanti_report.pdf
│ ├── merged.polished.hq_other_files
│ └── other_files
├── sqanti_r2 # splicing analysis with collapsed ├── GMST
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_classification.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite.fasta
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite.gtf
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_junctions.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_reasons.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_sqanti_report.pdf
│ ├── {sample_A}.collapsed.filtered.rep_classification.txt
│ ├── {sample_A}.collapsed.filtered.rep_corrected.fasta
│ ├── {sample_A}.collapsed.filtered.rep_corrected.gtf
│ ├── {sample_A}.collapsed.filtered.rep_corrected_sorted.sam
│ ├── {sample_A}.collapsed.filtered.rep_junctions.txt
│ ├── {sample_A}.collapsed.filtered.rep_sqanti_report.pdf
│ ├── {sample_A}.collapsed.filtered.rep_other_files
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_classification.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite.fasta
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite.gtf
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_junctions.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_reasons.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_sqanti_report.pdf
│ ├── {sample_B}.collapsed.filtered.rep_classification.txt
│ ├── {sample_B}.collapsed.filtered.rep_corrected.fasta
│ ├── {sample_B}.collapsed.filtered.rep_corrected.gtf
│ ├── {sample_B}.collapsed.filtered.rep_corrected_sorted.sam
│ ├── {sample_B}.collapsed.filtered.rep_junctions.txt
│ ├── {sample_B}.collapsed.filtered.rep_sqanti_report.pdf
│ ├── {sample_B}.collapsed.filtered.rep_other_files
│ ├── merged.collapsed.filtered.rep_classification.txt
│ ├── merged.collapsed.filtered.rep_corrected.fasta
│ ├── merged.collapsed.filtered.rep_corrected.gtf
│ ├── merged.collapsed.filtered.rep_corrected.sorted.sam
│ ├── merged.collapsed.filtered.rep_junctions.txt
│ ├── merged.collapsed.filtered.rep_sqanti_report.pdf
│ ├── merged.collapsed.filtered.rep_tappAS_annot_from_SQANTI3.gff3
│ ├── merged.collapsed.filtered.rep_other_files
│ └── other_files
├── subread
│ ├── {sample_A}.bam
│ ├── {sample_A}.bam.pbi
│ ├── {sample_A}.subreadset.xml
│ ├── {sample_B}.bam
│ ├── {sample_B}.bam.pbi
│ ├── {sample_B}.subreadset.xml
│ ├── merged.bam
│ └── merged.bam.pbi
└── targeted # targeted non-collapsed reads
 ├── {sample_A}.polished.hq_corrected_sorted_targeted.sam
 ├── {sample_B}.polished.hq_corrected_sorted_targeted.sam
 └── merged.polished.hq_corrected.sorted.targeted.sam

Code Snippets

12
13
14
15
16
shell:
      """
      samtools merge {output[0]} {input[0]} 2>log/merge_ccs.err
      samtools merge {output[1]} {input[1]} 2>log/merge_subread.err
      """
25
26
27
28
shell:
      """
      isoseq3 cluster {input} {output[0]} -j 32 2>log/cluter.err
      """
42
43
44
45
46
47
48
shell:
      """
      isoseq3 polish -j 16 {input[0]} {input[1]} {output[0]} 2>log/polish.err
      gunzip -c {output[1]} > {output[2]} 2>log/gunzip.err
      gunzip -c {output[3]} > {output[4]} 2>log/gunzip2.err
      isoseq3 summarize {output[0]} {output[5]} 2>log/cluter_report.err
      """
58
59
60
61
62
shell:
      """
      python {sqanti3}/sqanti3_qc.py {input[0]} {gtf} {genome} --aligner_choice=gmap -x {gref} --skipORF --force_id_ignore -t 16 -d {out}sqanti/ 2>log/sqanti.err
      samtools sort {output[0]} -o {output[1]} 2>log/sort.err
      """
70
71
72
73
74
75
76
run:
      samfile = pysam.AlignmentFile(str(input), "r")
      osamfile = pysam.AlignmentFile(str(output), "w", template=samfile)

      for read in samfile:
          if read.reference_name == cn and read.reference_start > st and read.reference_start < ed:
             osamfile.write(read)
88
89
90
91
92
shell:
      """
      python {tofu}/collapse_isoforms_by_sam.py --input {input[0]} --fq -s {input[1]} --dun-merge-5-shorter -o {out}collapsed/merged 2>log/collapsed.err
      fastq_to_fasta -i {output[0]} -o {output[3]} 2>log/fastq_tofasta2.err
      """
102
103
104
105
shell:
      """
      python {tofu}/get_abundance_post_collapse.py {out}collapsed/merged.collapsed {input[1]} 2>log/count.err
      """
116
117
118
119
120
shell:
      """
      python {tofu}/filter_away_subset.py {out}collapsed/merged.collapsed 2>log/filter.err
      fastq_to_fasta -i {output[1]} -o {output[2]} 2>log/fastq_tofasta3.err
      """
131
132
133
134
135
136
shell:
      """
      python {sqanti3}/sqanti3_qc.py {input} {gtf} {genome} --skipORF --aligner_choice=gmap -x {gref} -t 16 -d {out}sqanti_r2/ --isoAnnotLite --gff3 {tappas_gff3} 2>log/sqanti_r2.err
      samtools sort {output[1]} -o {output[2]} 2>log/sort2.err
      touch {output[3]}
      """
145
146
147
148
149
shell:
      """
      kallisto index -i {out}kallisto_index/index {input[0]} 2>log/kallisto_index.err
      touch {output}
      """
 9
10
11
12
shell:
      """
      ccs {input} {output} --noPolish --minPasses 1 2>log/{wildcards.sample}_ccs.err
      """
21
22
23
24
shell:
      """
      isoseq3 cluster {input} {output} -j 32 2>log/{wildcards.sample}_cluter.err
      """
38
39
40
41
42
43
44
shell:
      """
      isoseq3 polish -j 16 {input[0]} {input[1]} {output[0]} 2>log/{wildcards.sample}_polish.err
      gunzip -c {output[1]} > {output[2]} 2>log/{wildcards.sample}_gunzip.err
      gunzip -c {output[3]} > {output[4]} 2>log/{wildcards.sample}_gunzip2.err
      isoseq3 summarize {output[0]} {output[5]} 2>log/{wildcards.sample}_cluter_report.err
      """
52
53
54
55
56
57
58
run:
      global gref
      if gref == '':
         shell("""gmap_build -d {ref} {genome} -t 16 -D {out}gmap_index 2>log/gmap_index.err;touch {output}""")
         gref = out + "gmap_index/" + ref
      else:
          shell("""touch {output}""")
68
69
70
71
72
shell:
      """
      python {sqanti3}/sqanti3_qc.py {input[0]} {gtf} {genome} --aligner_choice=gmap -x {gref} --skipORF --force_id_ignore -t 16 -d {out}sqanti/ 2>log/{wildcards.sample}_sqanti.err
      samtools sort {output[0]} -o {output[1]} 2>log/{wildcards.sample}_sort.err
      """
80
81
82
83
84
85
86
run:
      samfile = pysam.AlignmentFile(str(input), "r")
      osamfile = pysam.AlignmentFile(str(output), "w", template=samfile)

      for read in samfile:
          if read.reference_name == cn and read.reference_start > st and read.reference_start < ed:
             osamfile.write(read)
 99
100
101
102
103
104
shell:
      """
      python {tofu}/collapse_isoforms_by_sam.py --input {input[0]} --fq -s {input[1]} --dun-merge-5-shorter -o {out}collapsed/{wildcards.sample} 2>log/{wildcards.sample}_collapsed.err
      fastq_to_fasta -i {output[0]} -o {output[3]} 2>log/{wildcards.sample}_fastq_tofasta2.err
      touch {output[4]}
      """
114
115
116
117
shell:
      """
      python {tofu}/get_abundance_post_collapse.py {out}collapsed/{wildcards.sample}.collapsed {input[1]} 2>log/{wildcards.sample}_count.err
      """
128
129
130
131
132
shell:
      """
      python {tofu}/filter_away_subset.py {out}collapsed/{wildcards.sample}.collapsed 2>log/{wildcards.sample}_filter.err
      fastq_to_fasta -i {output[1]} -o {output[2]} 2>log/{wildcards.sample}_fastq_tofasta3.err
      """
143
144
145
146
147
shell:
      """
      python {sqanti3}/sqanti3_qc.py {input} {gtf} {genome} --skipORF --aligner_choice=gmap -x {gref} -t 16 -d {out}sqanti_r2/ 2>log/{wildcards.sample}_sqanti_r2.err
      samtools sort {output[1]} -o {output[2]} 2>log/{wildcards.sample}_sort2.err
      """
156
157
158
159
160
shell:
      """
      python {sqanti3}/sqanti3_RulesFilter.py {input[0]} {input[1]} {gff3} 2>log/{wildcards.sample}_sqanti_r2_filter.err
      touch {output}
      """
ShowHide 18 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/NCI-CGR/Pacbio-Alternative-Splicing
Name: pacbio-alternative-splicing
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...