polya_liftover sc/snRNAseq Snakemake Workflow using PolyA_DB and UCSC Liftover with Cellranger.

public public 1yr ago Version: Version 1 0 bookmarks

polya_liftover - sc/snRNAseq Snakemake Workflow

A Snakemake workflow for using PolyA_DB and UCSC Liftover with Cellranger.

Some genes are not accurately annotated in the reference genome. Here, we use information provide by the PolyA_DB v3.2 to update the coordinates, then the USCS Liftover tool to update to a more recent genome. Next, we use Cellranger to create the reference and count matrix. Finally, by taking advantage of the integrated Conda and Singularity support, we can run the whole thing in an isolated environment.

Code Snippets

14
15
16
17
shell:
    "wget --no-verbose -O resources/cellranger.tar.gz '{params.url}' &> {log} && "
    "tar -xzf resources/cellranger.tar.gz -C resources &> {log} && "
    "rm -rf resources/cellranger.tar.gz "
29
30
31
32
33
34
shell:
    "wget "
    "--no-verbose -O- "
    "{params.url} | "
    "gunzip > {output.gtf} "
    "2> {log}"
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
shell:
    "{input.bin} "
    "mkgtf "
    "{input.gtf} "
    "{output.gtf} "
    "--attribute=gene_biotype:protein_coding "
    "--attribute=gene_biotype:lncRNA "
    "--attribute=gene_biotype:IG_C_gene "
    "--attribute=gene_biotype:IG_D_gene "
    "--attribute=gene_biotype:IG_J_gene "
    "--attribute=gene_biotype:IG_LV_gene "
    "--attribute=gene_biotype:IG_V_gene "
    "--attribute=gene_biotype:IG_V_pseudogene "
    "--attribute=gene_biotype:IG_J_pseudogene "
    "--attribute=gene_biotype:IG_C_pseudogene "
    "--attribute=gene_biotype:TR_C_gene "
    "--attribute=gene_biotype:TR_D_gene "
    "--attribute=gene_biotype:TR_J_gene "
    "--attribute=gene_biotype:TR_V_gene "
    "--attribute=gene_biotype:TR_V_pseudogene "
    "--attribute=gene_biotype:TR_J_pseudogene "
    "--attribute=transcript_biotype:protein_coding "
    "--attribute=transcript_biotype:lncRNA "
    "--attribute=transcript_biotype:IG_C_gene "
    "--attribute=transcript_biotype:IG_D_gene "
    "--attribute=transcript_biotype:IG_J_gene "
    "--attribute=transcript_biotype:IG_LV_gene "
    "--attribute=transcript_biotype:IG_V_gene "
    "--attribute=transcript_biotype:IG_V_pseudogene "
    "--attribute=transcript_biotype:IG_J_pseudogene "
    "--attribute=transcript_biotype:IG_C_pseudogene "
    "--attribute=transcript_biotype:TR_C_gene "
    "--attribute=transcript_biotype:TR_D_gene "
    "--attribute=transcript_biotype:TR_J_gene "
    "--attribute=transcript_biotype:TR_V_gene "
    "--attribute=transcript_biotype:TR_V_pseudogene "
    "--attribute=transcript_biotype:TR_J_pseudogene "
 95
 96
 97
 98
 99
100
shell:
    "wget "
    "--no-verbose -O- "
    "{params.url} | "
    "gunzip > {output.fa} "
    "2> {log}"
14
15
16
17
18
19
20
21
22
shell:
    "{input.bin} "
    "mkref "
    "--genome=converted_filtered_genome "
    "--genes={input.gtf} "
    "--fasta={input.fa} "
    "--memgb={params.mem} "
    "&> {log} && "
    "mv converted_filtered_genome {output.ref} "
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
shell:
    "{input.bin} "
    "count "
    "--nosecondary "
    "{params.introns} "
    "--id {wildcards.sample}_{wildcards.lane} "
    "--transcriptome {input.genome}  "
    "--fastqs data "
    "--sample {wildcards.sample} "
    "--lanes {wildcards.lane} "
    "--expect-cells {params.n_cells} "
    "--localcores {threads} "
    "--localmem {params.mem} "
    "&> {log} && "
    "rm -rf results/counts/{wildcards.sample}_{wildcards.lane} && "
    "mv {wildcards.sample}_{wildcards.lane} results/counts "
17
18
19
20
21
shell:
    "unzip "
    "{input.bcl_zip} "
    "-d results/bcl2fastq "
    "&> {log}"
33
34
35
36
37
shell:
    "mv "
    "{input} "
    "{output} "
    "&> {log}"
51
52
script:
    "../scripts/convert_to_bed.py"
13
14
15
16
17
18
shell:
    "wget "
    "--no-verbose -O- "
    "{params.url} | "
    "gunzip > {output.over_9_to_10} "
    "2> {log}"
30
31
32
33
34
35
shell:
    "wget "
    "--no-verbose -O- "
    "{params.url} | "
    "gunzip > {output.over_10_to_39} "
    "2> {log}"
17
18
19
20
21
22
23
shell:
    "liftOver "
    "{input.bed} "
    "{input.chain} "
    "{output.bed} "
    "{output.unmapped} "
    "&> {log} "
39
40
41
42
43
44
45
shell:
    "liftOver "
    "{input.bed} "
    "{input.chain} "
    "{output.bed} "
    "{output.unmapped} "
    "&> {log} "
58
59
60
61
62
63
shell:
    "workflow/scripts/move_coordinates.bash "
    "-b {input.bed} "
    "-g {input.gtf} "
    "-o {output.gtf} "
    "&> {log}"
SnakeMake From line 58 of rules/lift.smk
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
if __name__ == "__main__":
    from helpers.get_logger import get_logger

    LOG = snakemake.log[0]  # noqa: F821
    PARAMS = snakemake.params  # noqa: F821
    OUTPUT = snakemake.output  # noqa: F821

    logger = get_logger(__name__, LOG)

    with open(OUTPUT["bed"], "w") as file:
        lines = [
            f"chr{pos['chr']} {pos['start']-1} {pos['end']-1} {name}\n"
            for name, pos in PARAMS["genes"].items()
        ]
        file.writelines(lines)
        logger.info(f"Converted to BED file as:\n{lines}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
set -e -x

# Get parameters from snakemake
while getopts ":b:g:o:" opt; do
  case "$opt" in
    b) BEDFILE="$OPTARG" ;;
    g) GTF="$OPTARG" ;;
    o) OUT="$OPTARG" ;;
    :) echo 'All arguments must be provided' >&2
       exit 1 ;;
    \?) echo 'An illegal option was provided' >&2
        exit 1 ;;
  esac
done

# We want to iteratively modify while preserving the original file
# So we create a tmpfile, modify that, then move it to output
TMPGTF=$(mktemp)
cp $GTF $TMPGTF

# Read Bedfile, splitting on expected fields
while read -r CHR NEWSTART NEWEND NAME; do
    # Extract current end from GTF
    OLDEND=$(awk -v name="$NAME" '$3 == "gene" && $0 ~ name {print $5}' "$TMPGTF")
    # Increment NEWEND as GTF is 1-indexed, while BED is 0-indexed
    ((NEWEND+=1))
    # For anyline that contains `name`
    # If field 5 (feature end) is OLDEND,
    # Replace with NEWWEND
    # Also, we don't want to use sponge, so old fashioned tmp files
    TMPFILE=$(mktemp)
    awk \
      -v oldend="$OLDEND" \
      -v newend="$NEWEND" \
      -v name="$NAME" \
      'BEGIN{FS=OFS="\t"} $5 == oldend && $0 ~ name {$5 = newend} 1' \
      "$TMPGTF" > "$TMPFILE" &&
    mv "$TMPFILE" "$TMPGTF"
done < "$BEDFILE"

mv "$TMPGTF" "$OUT"
ShowHide 11 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/IMS-Bio2Core-Facility/polya_liftover
Name: polya_liftover
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...