polya_liftover sc/snRNAseq Snakemake Workflow using PolyA_DB and UCSC Liftover with Cellranger.

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

polya_liftover - sc/snRNAseq Snakemake Workflow

A Snakemake workflow for using PolyA_DB and UCSC Liftover with Cellranger.

Some genes are not accurately annotated in the reference genome. Here, we use information provide by the PolyA_DB v3.2 to update the coordinates, then the USCS Liftover tool to update to a more recent genome. Next, we use Cellranger to create the reference and count matrix. Finally, by taking advantage of the integrated Conda and Singularity support, we can run the whole thing in an isolated environment.

Code Snippets

shell:
    "wget --no-verbose -O resources/cellranger.tar.gz '{params.url}' &> {log} && "
    "tar -xzf resources/cellranger.tar.gz -C resources &> {log} && "
    "rm -rf resources/cellranger.tar.gz "

SnakeMake cellranger From line 14 of rules/cellranger_resources.smk

shell:
    "wget "
    "--no-verbose -O- "
    "{params.url} | "
    "gunzip > {output.gtf} "
    "2> {log}"

SnakeMake From line 29 of rules/cellranger_resources.smk

shell:
    "{input.bin} "
    "mkgtf "
    "{input.gtf} "
    "{output.gtf} "
    "--attribute=gene_biotype:protein_coding "
    "--attribute=gene_biotype:lncRNA "
    "--attribute=gene_biotype:IG_C_gene "
    "--attribute=gene_biotype:IG_D_gene "
    "--attribute=gene_biotype:IG_J_gene "
    "--attribute=gene_biotype:IG_LV_gene "
    "--attribute=gene_biotype:IG_V_gene "
    "--attribute=gene_biotype:IG_V_pseudogene "
    "--attribute=gene_biotype:IG_J_pseudogene "
    "--attribute=gene_biotype:IG_C_pseudogene "
    "--attribute=gene_biotype:TR_C_gene "
    "--attribute=gene_biotype:TR_D_gene "
    "--attribute=gene_biotype:TR_J_gene "
    "--attribute=gene_biotype:TR_V_gene "
    "--attribute=gene_biotype:TR_V_pseudogene "
    "--attribute=gene_biotype:TR_J_pseudogene "
    "--attribute=transcript_biotype:protein_coding "
    "--attribute=transcript_biotype:lncRNA "
    "--attribute=transcript_biotype:IG_C_gene "
    "--attribute=transcript_biotype:IG_D_gene "
    "--attribute=transcript_biotype:IG_J_gene "
    "--attribute=transcript_biotype:IG_LV_gene "
    "--attribute=transcript_biotype:IG_V_gene "
    "--attribute=transcript_biotype:IG_V_pseudogene "
    "--attribute=transcript_biotype:IG_J_pseudogene "
    "--attribute=transcript_biotype:IG_C_pseudogene "
    "--attribute=transcript_biotype:TR_C_gene "
    "--attribute=transcript_biotype:TR_D_gene "
    "--attribute=transcript_biotype:TR_J_gene "
    "--attribute=transcript_biotype:TR_V_gene "
    "--attribute=transcript_biotype:TR_V_pseudogene "
    "--attribute=transcript_biotype:TR_J_pseudogene "

SnakeMake From line 47 of rules/cellranger_resources.smk

shell:
    "wget "
    "--no-verbose -O- "
    "{params.url} | "
    "gunzip > {output.fa} "
    "2> {log}"

SnakeMake From line 95 of rules/cellranger_resources.smk

shell:
    "{input.bin} "
    "mkref "
    "--genome=converted_filtered_genome "
    "--genes={input.gtf} "
    "--fasta={input.fa} "
    "--memgb={params.mem} "
    "&> {log} && "
    "mv converted_filtered_genome {output.ref} "

SnakeMake From line 14 of rules/cellranger.smk

shell:
    "{input.bin} "
    "count "
    "--nosecondary "
    "{params.introns} "
    "--id {wildcards.sample}_{wildcards.lane} "
    "--transcriptome {input.genome}  "
    "--fastqs data "
    "--sample {wildcards.sample} "
    "--lanes {wildcards.lane} "
    "--expect-cells {params.n_cells} "
    "--localcores {threads} "
    "--localmem {params.mem} "
    "&> {log} && "
    "rm -rf results/counts/{wildcards.sample}_{wildcards.lane} && "
    "mv {wildcards.sample}_{wildcards.lane} results/counts "

SnakeMake From line 49 of rules/cellranger.smk

shell:
    "unzip "
    "{input.bcl_zip} "
    "-d results/bcl2fastq "
    "&> {log}"

SnakeMake bcl2fastq-nextseq demultiplexer From line 17 of rules/input_resources.smk

shell:
    "mv "
    "{input} "
    "{output} "
    "&> {log}"

SnakeMake From line 33 of rules/input_resources.smk

script:
    "../scripts/convert_to_bed.py"

SnakeMake From line 51 of rules/input_resources.smk

shell:
    "wget "
    "--no-verbose -O- "
    "{params.url} | "
    "gunzip > {output.over_9_to_10} "
    "2> {log}"

SnakeMake From line 13 of rules/lift_resources.smk

shell:
    "wget "
    "--no-verbose -O- "
    "{params.url} | "
    "gunzip > {output.over_10_to_39} "
    "2> {log}"

SnakeMake From line 30 of rules/lift_resources.smk

shell:
    "liftOver "
    "{input.bed} "
    "{input.chain} "
    "{output.bed} "
    "{output.unmapped} "
    "&> {log} "

SnakeMake ucsc-liftover From line 17 of rules/lift.smk

shell:
    "liftOver "
    "{input.bed} "
    "{input.chain} "
    "{output.bed} "
    "{output.unmapped} "
    "&> {log} "

SnakeMake ucsc-liftover From line 39 of rules/lift.smk

shell:
    "workflow/scripts/move_coordinates.bash "
    "-b {input.bed} "
    "-g {input.gtf} "
    "-o {output.gtf} "
    "&> {log}"

SnakeMake From line 58 of rules/lift.smk

if __name__ == "__main__":
    from helpers.get_logger import get_logger

    LOG = snakemake.log[0]  # noqa: F821
    PARAMS = snakemake.params  # noqa: F821
    OUTPUT = snakemake.output  # noqa: F821

    logger = get_logger(__name__, LOG)

    with open(OUTPUT["bed"], "w") as file:
        lines = [
            f"chr{pos['chr']} {pos['start']-1} {pos['end']-1} {name}\n"
            for name, pos in PARAMS["genes"].items()
        ]
        file.writelines(lines)
        logger.info(f"Converted to BED file as:\n{lines}")

Python helpers From line 3 of scripts/convert_to_bed.py

set -e -x

# Get parameters from snakemake
while getopts ":b:g:o:" opt; do
  case "$opt" in
    b) BEDFILE="$OPTARG" ;;
    g) GTF="$OPTARG" ;;
    o) OUT="$OPTARG" ;;
    :) echo 'All arguments must be provided' >&2
       exit 1 ;;
    \?) echo 'An illegal option was provided' >&2
        exit 1 ;;
  esac
done

# We want to iteratively modify while preserving the original file
# So we create a tmpfile, modify that, then move it to output
TMPGTF=$(mktemp)
cp $GTF $TMPGTF

# Read Bedfile, splitting on expected fields
while read -r CHR NEWSTART NEWEND NAME; do
    # Extract current end from GTF
    OLDEND=$(awk -v name="$NAME" '$3 == "gene" && $0 ~ name {print $5}' "$TMPGTF")
    # Increment NEWEND as GTF is 1-indexed, while BED is 0-indexed
    ((NEWEND+=1))
    # For anyline that contains `name`
    # If field 5 (feature end) is OLDEND,
    # Replace with NEWWEND
    # Also, we don't want to use sponge, so old fashioned tmp files
    TMPFILE=$(mktemp)
    awk \
      -v oldend="$OLDEND" \
      -v newend="$NEWEND" \
      -v name="$NAME" \
      'BEGIN{FS=OFS="\t"} $5 == oldend && $0 ~ name {$5 = newend} 1' \
      "$TMPGTF" > "$TMPFILE" &&
    mv "$TMPFILE" "$TMPGTF"
done < "$BEDFILE"

mv "$TMPGTF" "$OUT"