Alternative Splicing Analysis Using PACBIO Targeted Long-Reads Sequencing Data

public 1yr ago 0 bookmarks

View Workflow

pacbio-alternative-splicing — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Description

This snakemake pipeline is for alternative splicing analysis using PACBIO targed long-reads sequencing data. The pipeline may be run on an HPC or in a local environment.

Major steps in the workflow include:

General data processing using ISOSEQ3
Alignment and splicing analysis of non-collaplsed reads using GMAP and SQANTI3
Further QC filtering and splicing analysis of collapsed reads using cDNA_cupcake and SQANTI3
Generateing customozied tappAS annotation file with sample merged data using SQANTI3

Expected results include:

Detection and quantification of transcripts in targed regions with non-collapsed reads
The concise version of detection of transcripts in targed regions with collapsed reads
Customized tappAS annotation file

Glossary of related terms

Software Requirements

User's guide

I. Input requirements

Edited config/config.yaml
Indexed bam files of demultiplexed subreads data
Reference genome file and annotation files

II. Editing the config.yaml

Basic parameters:

subread: Path to subread data stored directory
targeted: Genomic coordinates of the targed region; chromosome:start_position-end_position
ref: hg38, hg19, mm10, etc
genome: Path to referece genome fasta file
gtf: Path to reference annotation gtf file
gff3: Path to reference annotation gff3 file
tappas_gff3: Path to standard tappAS reference annotation gff3 file
tofu: Path to the directory where the tool tofu in cDNA_cupcake package is located
sqanti3: Path to the directory where the SQANTI3 package is located

Optional parameters:

out: Path to desired directory to store output files, defalut: working directory
gmap_ref: Path to GMAP index directory if available

III. To run

Clone the repository to your working directory

git clone https://github.com/NCI-CGR/Pacbio-Alternative-Splicing.git

Install required software
Create conda environment using the provided yaml file and activate it after installation:

conda env create -f pacbioAS_env.yaml
conda activate pacbioAS

Edit and save config/config.yaml
To run on an HPC using slurm job scheduler: Edit config/cluster_config.yaml according to your HPC information Run run.sh to initiate running of the pipeline

To run in a local environment:

export PYTHONPATH=$PYTHONPATH:{PATH}/cDNA_Cupcake/sequence/:{PATH}/conda/envs/sqanti3/lib/python3.7/site-packages/
snakemake -p --cores 16 --keep-going --rerun-incomplete --jobs 300 --latency-wait 120 all

Look in log directory for logs for each rule
To view the snakemkae rule graph:

snakemake --rulegraph | dot -T png > pacbioAS.png

dag

IV. Example output

. user/defined/output_dir
├── ccs_bam # ccs reads
│ ├── {sample_A}_ccs.bam
│ ├── {sample_A}_ccs.bam.pbi
│ ├── {sample_A}_ccs.other_files
│ ├── {sample_B}_ccs.bam
│ ├── {sample_B}_ccs.bam.pbi
│ ├── {sample_B}_ccs.other_files
│ └── merged_ccs.bam
├── cluster # clusted unpolished reads
│ ├── {sample_A}.unpolished.bam
│ ├── {sample_A}.unpolished.bam.pbi
│ ├── {sample_A}.unpolished.cluster_report.csv
│ ├── {sample_A}.unpolished.fasta.gz
│ ├── {sample_A}.unpolished.other_files
│ ├── {sample_B}.unpolished.bam
│ ├── {sample_B}.unpolished.bam.pbi
│ ├── {sample_B}.unpolished.cluster_report.csv
│ ├── {sample_B}.unpolished.fasta.gz
│ ├── {sample_B}.unpolished.other_files
│ ├── merged.unpolished.bam
│ ├── merged.unpolished.bam.pbi
│ ├── merged.unpolished.cluster_report.csv
│ ├── merged.unpolished.fasta.gz
│ └── merged.unpolished.other_files
├── collapsed # collapsed reads
│ ├── {sample_A}.collapsed.abundance.txt
│ ├── {sample_A}.collapsed.fasta
│ ├── {sample_A}.collapsed.filtered.abundance.txt
│ ├── {sample_A}.collapsed.filtered.rep.fasta
│ ├── {sample_A}.collapsed.filtered.rep.fq
│ ├── {sample_A}.collapsed.filtered.other_files
│ ├── {sample_A}.collapsed.rep.fq
│ ├── {sample_A}.collapsed.other_files
│ ├── {sample_B}.collapsed.abundance.txt
│ ├── {sample_B}.collapsed.fasta
│ ├── {sample_B}.collapsed.filtered.abundance.txt
│ ├── {sample_B}.collapsed.filtered.rep.fasta
│ ├── {sample_B}.collapsed.filtered.rep.fq
│ ├── {sample_B}.collapsed.filtered.other_files
│ ├── {sample_B}.collapsed.rep.fq
│ ├── {sample_B}.collapsed.other_files
│ ├── merged.collapsed.abundance.txt
│ ├── merged.collapsed.fasta
│ ├── merged.collapsed.filtered.abundance.txt
│ ├── merged.collapsed.filtered.rep.fasta
│ ├── merged.collapsed.filtered.rep.fq
│ ├── merged.collapsed.filtered.other_files
│ ├── merged.collapsed.rep.fq
│ └── merged.collapsed.other_files
├── kallisto_index
│ ├── complete.txt
│ └── index
├── polished # clustered polished reads
│ ├── {sample_A}.polished.bam
│ ├── {sample_A}.polished.bam.pbi
│ ├── {sample_A}.polished.cluster_report.csv
│ ├── {sample_A}.polished.hq.fasta
│ ├── {sample_A}.polished.hq.fastq
│ ├── {sample_A}.polished.hq.fastq.gz
│ ├── {sample_A}.polished.other_files
│ ├── {sample_B}.polished.bam
│ ├── {sample_B}.polished.bam.pbi
│ ├── {sample_B}.polished.cluster_report.csv
│ ├── {sample_B}.polished.hq.fasta
│ ├── {sample_B}.polished.hq.fastq
│ ├── {sample_B}.polished.other_files
│ ├── merged.polished.bam
│ ├── merged.polished.bam.pbi
│ ├── merged.polished.cluster_report.csv
│ ├── merged.polished.hq.fasta
│ ├── merged.polished.hq.fastq
│ └── merged.polished.other_files
├── sqanti # splicing analysis with non-collpased reads
│ ├── GMST
│ ├── {sample_A}.polished.hq_classification.txt
│ ├── {sample_A}.polished.hq_corrected.fasta
│ ├── {sample_A}.polished.hq_corrected.gtf
│ ├── {sample_A}.polished.hq_corrected_sorted.sam
│ ├── {sample_A}.polished.hq_junctions.txt
│ ├── {sample_A}.polished.hq_sqanti_report.pdf
│ ├── {sample_A}.polished.hq_other_files
│ ├── {sample_B}.polished.hq_classification.txt
│ ├── {sample_B}.polished.hq_corrected.fasta
│ ├── {sample_B}.polished.hq_corrected.gtf
│ ├── {sample_B}.polished.hq_corrected_sorted.sam
│ ├── {sample_B}.polished.hq_junctions.txt
│ ├── {sample_B}.polished.hq_sqanti_report.pdf
│ ├── {sample_B}.polished.hq_other_files
│ ├── merged.polished.hq_classification.txt
│ ├── merged.polished.hq_corrected.fasta
│ ├── merged.polished.hq_corrected.gtf
│ ├── merged.polished.hq_corrected.sorted.sam
│ ├── merged.polished.hq_junctions.txt
│ ├── merged.polished.hq_sqanti_report.pdf
│ ├── merged.polished.hq_other_files
│ └── other_files
├── sqanti_r2 # splicing analysis with collapsed
│ ├── GMST
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_classification.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite.fasta
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite.gtf
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_junctions.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_reasons.txt
│ ├── {sample_A}.collapsed.filtered.rep_classification.filtered_lite_sqanti_report.pdf
│ ├── {sample_A}.collapsed.filtered.rep_classification.txt
│ ├── {sample_A}.collapsed.filtered.rep_corrected.fasta
│ ├── {sample_A}.collapsed.filtered.rep_corrected.gtf
│ ├── {sample_A}.collapsed.filtered.rep_corrected_sorted.sam
│ ├── {sample_A}.collapsed.filtered.rep_junctions.txt
│ ├── {sample_A}.collapsed.filtered.rep_sqanti_report.pdf
│ ├── {sample_A}.collapsed.filtered.rep_other_files
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_classification.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite.fasta
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite.gtf
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_junctions.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_reasons.txt
│ ├── {sample_B}.collapsed.filtered.rep_classification.filtered_lite_sqanti_report.pdf
│ ├── {sample_B}.collapsed.filtered.rep_classification.txt
│ ├── {sample_B}.collapsed.filtered.rep_corrected.fasta
│ ├── {sample_B}.collapsed.filtered.rep_corrected.gtf
│ ├── {sample_B}.collapsed.filtered.rep_corrected_sorted.sam
│ ├── {sample_B}.collapsed.filtered.rep_junctions.txt
│ ├── {sample_B}.collapsed.filtered.rep_sqanti_report.pdf
│ ├── {sample_B}.collapsed.filtered.rep_other_files
│ ├── merged.collapsed.filtered.rep_classification.txt
│ ├── merged.collapsed.filtered.rep_corrected.fasta
│ ├── merged.collapsed.filtered.rep_corrected.gtf
│ ├── merged.collapsed.filtered.rep_corrected.sorted.sam
│ ├── merged.collapsed.filtered.rep_junctions.txt
│ ├── merged.collapsed.filtered.rep_sqanti_report.pdf
│ ├── merged.collapsed.filtered.rep_tappAS_annot_from_SQANTI3.gff3
│ ├── merged.collapsed.filtered.rep_other_files
│ └── other_files
├── subread
│ ├── {sample_A}.bam
│ ├── {sample_A}.bam.pbi
│ ├── {sample_A}.subreadset.xml
│ ├── {sample_B}.bam
│ ├── {sample_B}.bam.pbi
│ ├── {sample_B}.subreadset.xml
│ ├── merged.bam
│ └── merged.bam.pbi
└── targeted # targeted non-collapsed reads
 ├── {sample_A}.polished.hq_corrected_sorted_targeted.sam
 ├── {sample_B}.polished.hq_corrected_sorted_targeted.sam
 └── merged.polished.hq_corrected.sorted.targeted.sam

Code Snippets

shell:
      """
      samtools merge {output[0]} {input[0]} 2>log/merge_ccs.err
      samtools merge {output[1]} {input[1]} 2>log/merge_subread.err
      """

SnakeMake SAMtools From line 12 of modules/Snakefile_annot

shell:
      """
      isoseq3 cluster {input} {output[0]} -j 32 2>log/cluter.err
      """

SnakeMake IsoSeq3 From line 25 of modules/Snakefile_annot

shell:
      """
      isoseq3 polish -j 16 {input[0]} {input[1]} {output[0]} 2>log/polish.err
      gunzip -c {output[1]} > {output[2]} 2>log/gunzip.err
      gunzip -c {output[3]} > {output[4]} 2>log/gunzip2.err
      isoseq3 summarize {output[0]} {output[5]} 2>log/cluter_report.err
      """

SnakeMake IsoSeq3 From line 42 of modules/Snakefile_annot

shell:
      """
      python {sqanti3}/sqanti3_qc.py {input[0]} {gtf} {genome} --aligner_choice=gmap -x {gref} --skipORF --force_id_ignore -t 16 -d {out}sqanti/ 2>log/sqanti.err
      samtools sort {output[0]} -o {output[1]} 2>log/sort.err
      """

SnakeMake SAMtools From line 58 of modules/Snakefile_annot

run:
      samfile = pysam.AlignmentFile(str(input), "r")
      osamfile = pysam.AlignmentFile(str(output), "w", template=samfile)

      for read in samfile:
          if read.reference_name == cn and read.reference_start > st and read.reference_start < ed:
             osamfile.write(read)

SnakeMake From line 70 of modules/Snakefile_annot

shell:
      """
      python {tofu}/collapse_isoforms_by_sam.py --input {input[0]} --fq -s {input[1]} --dun-merge-5-shorter -o {out}collapsed/merged 2>log/collapsed.err
      fastq_to_fasta -i {output[0]} -o {output[3]} 2>log/fastq_tofasta2.err
      """

SnakeMake From line 88 of modules/Snakefile_annot

shell:
      """
      python {tofu}/get_abundance_post_collapse.py {out}collapsed/merged.collapsed {input[1]} 2>log/count.err
      """

SnakeMake From line 102 of modules/Snakefile_annot

shell:
      """
      python {tofu}/filter_away_subset.py {out}collapsed/merged.collapsed 2>log/filter.err
      fastq_to_fasta -i {output[1]} -o {output[2]} 2>log/fastq_tofasta3.err
      """

SnakeMake From line 116 of modules/Snakefile_annot

shell:
      """
      python {sqanti3}/sqanti3_qc.py {input} {gtf} {genome} --skipORF --aligner_choice=gmap -x {gref} -t 16 -d {out}sqanti_r2/ --isoAnnotLite --gff3 {tappas_gff3} 2>log/sqanti_r2.err
      samtools sort {output[1]} -o {output[2]} 2>log/sort2.err
      touch {output[3]}
      """

SnakeMake SAMtools From line 131 of modules/Snakefile_annot

shell:
      """
      kallisto index -i {out}kallisto_index/index {input[0]} 2>log/kallisto_index.err
      touch {output}
      """

SnakeMake kallisto From line 145 of modules/Snakefile_annot

shell:
      """
      ccs {input} {output} --noPolish --minPasses 1 2>log/{wildcards.sample}_ccs.err
      """

SnakeMake From line 9 of modules/Snakefile_gen

shell:
      """
      isoseq3 cluster {input} {output} -j 32 2>log/{wildcards.sample}_cluter.err
      """

SnakeMake IsoSeq3 From line 21 of modules/Snakefile_gen

shell:
      """
      isoseq3 polish -j 16 {input[0]} {input[1]} {output[0]} 2>log/{wildcards.sample}_polish.err
      gunzip -c {output[1]} > {output[2]} 2>log/{wildcards.sample}_gunzip.err
      gunzip -c {output[3]} > {output[4]} 2>log/{wildcards.sample}_gunzip2.err
      isoseq3 summarize {output[0]} {output[5]} 2>log/{wildcards.sample}_cluter_report.err
      """

SnakeMake IsoSeq3 From line 38 of modules/Snakefile_gen

run:
      global gref
      if gref == '':
         shell("""gmap_build -d {ref} {genome} -t 16 -D {out}gmap_index 2>log/gmap_index.err;touch {output}""")
         gref = out + "gmap_index/" + ref
      else:
          shell("""touch {output}""")

SnakeMake From line 52 of modules/Snakefile_gen

shell:
      """
      python {sqanti3}/sqanti3_qc.py {input[0]} {gtf} {genome} --aligner_choice=gmap -x {gref} --skipORF --force_id_ignore -t 16 -d {out}sqanti/ 2>log/{wildcards.sample}_sqanti.err
      samtools sort {output[0]} -o {output[1]} 2>log/{wildcards.sample}_sort.err
      """

SnakeMake SAMtools From line 68 of modules/Snakefile_gen

run:
      samfile = pysam.AlignmentFile(str(input), "r")
      osamfile = pysam.AlignmentFile(str(output), "w", template=samfile)

      for read in samfile:
          if read.reference_name == cn and read.reference_start > st and read.reference_start < ed:
             osamfile.write(read)

SnakeMake From line 80 of modules/Snakefile_gen

shell:
      """
      python {tofu}/collapse_isoforms_by_sam.py --input {input[0]} --fq -s {input[1]} --dun-merge-5-shorter -o {out}collapsed/{wildcards.sample} 2>log/{wildcards.sample}_collapsed.err
      fastq_to_fasta -i {output[0]} -o {output[3]} 2>log/{wildcards.sample}_fastq_tofasta2.err
      touch {output[4]}
      """

SnakeMake From line 99 of modules/Snakefile_gen

shell:
      """
      python {tofu}/get_abundance_post_collapse.py {out}collapsed/{wildcards.sample}.collapsed {input[1]} 2>log/{wildcards.sample}_count.err
      """

SnakeMake From line 114 of modules/Snakefile_gen

shell:
      """
      python {tofu}/filter_away_subset.py {out}collapsed/{wildcards.sample}.collapsed 2>log/{wildcards.sample}_filter.err
      fastq_to_fasta -i {output[1]} -o {output[2]} 2>log/{wildcards.sample}_fastq_tofasta3.err
      """

SnakeMake From line 128 of modules/Snakefile_gen

shell:
      """
      python {sqanti3}/sqanti3_qc.py {input} {gtf} {genome} --skipORF --aligner_choice=gmap -x {gref} -t 16 -d {out}sqanti_r2/ 2>log/{wildcards.sample}_sqanti_r2.err
      samtools sort {output[1]} -o {output[2]} 2>log/{wildcards.sample}_sort2.err
      """

SnakeMake SAMtools From line 143 of modules/Snakefile_gen

shell:
      """
      python {sqanti3}/sqanti3_RulesFilter.py {input[0]} {input[1]} {gff3} 2>log/{wildcards.sample}_sqanti_r2_filter.err
      touch {output}
      """