Identification of 3' UTR Variants Disrupting Transcript Function Pipeline

public public 1yr ago 0 bookmarks

Identify variants occurring in 3’ UTRs that could disrupt the function of a transcript.

This project involves:

  • retrieving reference genome using genomepy

  • extracting 3'UTR regions from gene annotation

  • extracting PAS hexamers from PolyA_DB3

  • computing MAPS score on different 3'UTR variants locally and on the cloud using hailctl dataproc

Evaluation Metrics

The different 3' UTR variants are evaluated by the MAPS (mutability-adjusted proportion of singletons) score to measure the overall depletion of singletons, as described in the gnomAD flagship paper . A higher MAPS score (compared to a baseline) suggests that there is a stronger evolutionary selection against the variant group.

Repository Files

The repository mainly contains a Snakemake workflow in workflow and a python package in utr3variants for the functions used in the pipeline scripts. Input paths, URLs and parameters are collected in configs/config.yml and should be adapted by the user. Additionally, there is a qc/ directory for any exploratory analyses on input databases.

The Snakemake file structure under workflow is organised as recommended by the Snakemake documentation. It contains Snakefiles for Snakemake pipeline rules in rules/ and python scripts in scripts/ . The Snakefiles in rules/ are used by the Snakefile in the repository root and cannot be run independently. Snakemake objects are passed to the python scripts when the pipeline is called locally. However, scripts that are to be submitted to the GCP do not handle Snakemake objects and can be run independently as well. How the python scripts are used, is best demonstrated in the Snakemake rules, where input and output files are defined. More information on scripts in Snakemake pipelines is documented here .

Pipeline

The workflow is implemented as a Snakemake pipeline. A good resource on the idea of Snakemake and how it works is its official documentation .

Installation

The dependencies are available as a conda environment.

conda env create -f workflow/envs/environment.yml
conda activate utr-variants

The pipeline can run hail operations locally or on the Google Cloud Platform (GCP). In order to run locally, you need to install the gcs-connector in the new environment.

curl -sSL https://broad.io/install-gcs-connector | python

Configuration

The configs/config.yml file contains links to hail datasets and local files. Local input files and output directories can be modified accordingly. For GCP runs, keys bucket and cluster need to be specified, and the key local needs to be set to false in the config.yml .

bucket: bucket_name
cluster: cluster_name
local: false

The key bucket specifies the GCP storage bucket name you want your output to be stored in, while cluster defines the name of the hailctl dataproc cluster. Before calling the Snakemake pipeline on the GCP, you need to start the dataproc session, preferably with a compute timeout.

hailctl dataproc start cluster_name --max-age=2h --packages gnomad

You don't need to start a dataproc session, if you want to run the pipeline locally, setting local to true . In that case, the bucket and cluster keys do not need to be specified.

Call the Pipeline

The pipeline was configured to be called from the root directory of this repository. For a dry run, you can supply the -n argument:

snakemake -n

In order to execute the steps listed in the dry run, use -j or --cores to set the number of cores available to the pipeline:

snakemake -j 10

Plotting rules have a separately defined conda environment for R dependencies ( workflow/envs/utr-variants-r.yml ). In order to use this environment call the pipeline as follows:

snakemake --use-conda -j10

This will create the environment once before activating for every rule call requiring it. You can also precompute the environment by calling

snakemake --use-conda --conda-create-envs-only

See Snakemake's documentation for more information on integrating conda environments.

Workflow Graph

It is useful to get a visual representation of what will be computed. For that, you can call the dependency rule that creates graphs for the dependency between rules (more general) and jobs (specific to wildcards).

snakemake dependency -fj

Here, -f enforces the pipeline to overwrite any pre-existing dependency graphs.

The output files can be found in the output_root directory specified in configs/config.yml . Below are some examples of a local run.

Rule graph for GCP run

rulegraph

Job graph for GCP run

dag

Code Snippets

17
18
19
20
21
22
23
24
25
shell:
    """
    genome_dir=$(dirname "{output[0]}")
    genome_dir=$(dirname "$genome_dir")
    echo downloading genome to: $genome_dir
    genomepy install {wildcards.assembly} \
        -p {params.provider} --annotation \
        -g $genome_dir >> {log} 2>&1
    """
37
38
39
40
41
shell:
    """
    wget --timestamping {params.url} -O {output}.gz
    gunzip {output}.gz
    """
49
50
51
52
53
54
55
shell:
    """
    tmpdir="$(dirname "$(tempfile)")"
    # TODO: download from GCP bucket
    wget --no-check-certificate -nc -P $tmpdir {params.url}
    unzip "$tmpdir"/human_pas.zip -d $(dirname {output})
    """
63
64
65
66
shell:
    """
    wget -nc -P $(dirname {output}) {params.url}
    """
73
74
75
76
77
shell:
    """
    wget -nc -P $(dirname {output}) {params.url}
    gzip -d {output}.gz
    """
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
shell:
    """
    INTERVAL_PATH="gs://{config[bucket]}/$(basename {input})"
    gsutil cp {input} $INTERVAL_PATH
    hailctl dataproc submit {config[cluster]} \
        --pyfiles {params.annotate_gnomad},{params.maps} \
        {params.script}  \
            -o gs://{output.maps} \
            --intervals $INTERVAL_PATH \
            --annotations {params.annotations} \
            --gnomAD_ht {config[gnomAD][gnomAD_ht]} \
            --context_ht {config[gnomAD][context_ht]} \
            --mutation_ht {config[gnomAD][mutation_rate_ht]} \
            --genome_assembly {config[genome_assembly]} \
            --chr_subset {params.chr_subset} \
            --skip_checks {config[gnomAD][skip_checks]}
    gsutil rm $INTERVAL_PATH
    """
50
script: '../scripts/prepare_gnomad.py'
63
script: '../scripts/count_singletons.py'
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
run:
    import pandas as pd
    from utr3variants.utils import extract_annotations

    df = pd.read_table(input.counts,sep='\t')
    df = extract_annotations(df,'target',ALL_ANNOTATIONS)

    if 'hexamer_motif' in ALL_ANNOTATIONS:
        df = df[df.hexamer_motif != 'AAAAAA']  # remove A-rich hexamers
        # bin hexamers
        df['hexamer_motif_bin'] = df.hexamer_motif
        df.loc[
            (df.feature == 'hexamer') &
            ~df.hexamer_motif.isin(['AATAAA', 'ATTAAA'])
            , 'hexamer_motif_bin'
        ] = 'other hexamer'

    if 'percent_expressed' in ALL_ANNOTATIONS:
        df['percent_expressed'] = pd.cut(
            pd.to_numeric(df.percent_expressed,errors='coerce'),10
        ).astype(str)

    # fill in blank values
    df['database'].fillna('gnomAD',inplace=True)
    df['feature'].fillna('other variant',inplace=True)
    # for anno in [x for x in ALL_ANNOTATIONS if x not in ['feature', 'database']]:
    #    df[anno].fillna(df.feature,inplace=True)

    df.to_csv(output.counts,sep='\t',index=False)
123
script: '../scripts/maps.R'
17
18
19
20
21
22
shell:
    """
    zcat {input} | grep three_prime_UTR | sortBed |\
     mergeBed -c 3,6,7 -o distinct,distinct,distinct > {output.utr}
    # TODO: determine 3UTR start depending on strand
    """
37
script: '../scripts/extract_polyadb.py'
49
script: '../scripts/extract_polyasite2.py'
65
script: '../scripts/merge_utr_intervals.py'
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import hail as hl
import pandas as pd
from utr3variants.annotate_gnomad import annotate_by_intervals
from utr3variants.maps import count_for_maps


if __name__ == '__main__':
    ref = snakemake.config['genome_assembly']
    interval_file = snakemake.input['intervals'].__str__()
    gnomad_path = snakemake.input['gnomAD'].__str__()
    mutation_path = snakemake.config['gnomAD']['mutation_rate_ht']

    counts_out = snakemake.output.counts

    hl.init(
        local=f'local[{snakemake.threads}]',
        log=snakemake.log['hail'],
        default_reference=ref,
    )
    mutation_ht = hl.read_table(mutation_path)
    ht = hl.read_table(gnomad_path)
    intervals = hl.import_bed(interval_file, min_partitions=100)

    ht = annotate_by_intervals(ht, intervals, columns=['target'])

    count_ht = count_for_maps(
        ht,
        mutation_ht,
        additional_grouping=['target'],
        skip_mut_check=snakemake.config['gnomAD']['skip_checks'],
    )

    print('save...')
    count_ht.export(counts_out)

    # Preview of counts table
    count_df = pd.read_table(counts_out)
    print(count_df.head())
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
import re
from typing import Iterable
from typing import List
import numpy as np
import pandas as pd
import pybedtools
from pysam import FastaFile  # pylint: disable=no-name-in-module
from utr3variants.utils import extract_annotations, get_most_expressed

ANNO_COLUMN_MAP = {
    # Mapping of snakemake wildcard to column name in database
    'hexamer_motif': 'PAS Signal',
    'conservation': 'Conservation',
    'percent_expressed': 'PSE',
    'expression': 'Mean RPM',
    'most_expressed': 'most_expressed',
}


def create_signal_interval(
    feature: pybedtools.Interval, signal: str, sequence: str, name: str = None
) -> Iterable[pybedtools.Interval]:
    """
    Create pybedtools.Interval objects of signal locations within given interval

    :param feature: interval in which to search for signal
    :param signal: sequence of signal motif
    :param sequence: sequence of feature interval
    :param name: value to overwrite in feature.name if specified
    :return: generator of signal locations
    """
    name = feature.name if name is None else name
    return (
        pybedtools.Interval(
            chrom=feature.chrom,
            start=feature.end - i.end()
            if feature.strand == '-'
            else feature.start + i.start(),
            end=feature.end - i.start()
            if feature.strand == '-'
            else feature.start + i.end(),
            name=name,
            score=feature.score,
            strand=feature.strand,
        )
        for i in re.finditer(signal, sequence, re.IGNORECASE)
    )


def get_hexamers(
    feature: pybedtools.Interval, sequence: str, hexamer_column: int
) -> List[pybedtools.Interval]:
    """
    Retrieve all hexamers of regions from within given interval

    :param feature: interval with interval.name containing annotation
        of form <PAS signal>|<conservation>
    :param sequence: sequence of feature interval
    :param hexamer_column: column index of hexamer in feature name
    :return: list of hexamer coordinates
    """
    name_split = feature.name.split('|')
    hexamer_motif = name_split[hexamer_column]

    if hexamer_motif == 'NoPAS':
        return []
    if hexamer_motif == 'OtherPAS':
        signal_sequences = [
            'AGTAAA',
            'TATAAA',
            'CATAAA',
            'GATAAA',
            'AATATA',
            'AATACA',
            'AATAGA',
            'AAAAAG',
            'ACTAAA',
        ]
    elif hexamer_motif == 'Arich':
        signal_sequences = ['AAAAAA']
    else:
        signal_sequences = [hexamer_motif]
    dna_signals = [s.replace('U', 'T') for s in signal_sequences]

    hexamer_list = []
    for signal in dna_signals:
        name_split[hexamer_column] = signal  # rename hexamer annotation
        intervals = create_signal_interval(
            feature, signal, sequence, name='|'.join(name_split)
        )
        hexamer_list.extend(intervals)
    return hexamer_list


if __name__ == '__main__':
    genome = snakemake.config['assembly_ucsc']
    annotations = snakemake.params.annotations
    pas_db = snakemake.input.db.__str__()
    fasta_file = snakemake.input.fasta.__str__()
    output = snakemake.output

    print('read database')
    df = pd.read_csv(pas_db, sep='\t')

    # cleanup columns
    df['Start'] = df['Position'] - 1
    df['PSE'] = df['PSE'].str.rstrip('%').astype('float') / 100
    df['score'] = '.'

    df = get_most_expressed(
        df,
        aggregate_column='Gene Symbol',
        expression_column='Mean RPM',
        interval_columns=['Chromosome', 'Position', 'Strand'],
        new_column='most_expressed',
    )
    annotations.append('most_expressed')

    # put annotation into name
    db_cols = [ANNO_COLUMN_MAP[a] for a in annotations]
    df['Name'] = df[db_cols].astype(str).agg('|'.join, axis=1)

    # create bedtools object/table
    print('Create BedTools object')
    # BedTool object needed for hexamer extraction
    bed_cols = [
        'Chromosome',  # chrom
        'Start',  # start
        'Position',  # end
        'Name',  # name
        'score',  # score
        'Strand',  # strand
    ]
    bed = pybedtools.BedTool.from_dataframe(df[bed_cols])

    # convert back to dataframe with annotations
    intervals_df = extract_annotations(
        bed.to_dataframe(),
        annotation_string='name',
        annotation_columns=annotations,
        database='PolyA_DB',
        feature='PAS',
    )

    if 'hexamer_motif' in annotations:
        print('extract hexamers')
        bed_40nt_us = bed.slop(  # pylint: disable=unexpected-keyword-arg
            l=40, r=0, s=True, genome=genome
        ).sequence(fi=fasta_file, s=True, fullHeader=True)
        sequences = FastaFile(bed_40nt_us.seqfn)

        hexamer_intervals = []
        hex_column = annotations.index('hexamer_motif')
        for f in bed_40nt_us:
            seq = sequences.fetch(f'{f.chrom}:{f.start}-{f.stop}({f.strand})')
            hex_intervals = get_hexamers(
                feature=f, sequence=seq, hexamer_column=hex_column
            )
            hexamer_intervals.extend(hex_intervals)

        hex_df = pybedtools.BedTool(hexamer_intervals).to_dataframe()
        hex_df = extract_annotations(
            hex_df,
            annotation_string='name',
            annotation_columns=annotations,
            database='PolyA_DB',
            feature='hexamer',
        )
        intervals_df['hexamer_motif'] = np.nan
        intervals_df = pd.concat([intervals_df, hex_df])

    print('save...')
    # get stats
    intervals_df['feature'].value_counts().to_csv(output.stats, sep='\t')
    intervals_df.to_csv(output.intervals, index=False, sep='\t')
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
from collections import OrderedDict
import pandas as pd
import pybedtools
from utr3variants.utils import (
    encode_annotations,
    extract_annotations,
    convert_chromosome,
    get_most_expressed,
)

COLUMN_TYPES = OrderedDict(
    chrom=str,
    start=int,
    end=int,
    cluster_id=str,
    score=float,
    strand=str,
    percent_expressed=float,
    n_protocols=int,
    expression=float,
    cluster_annotation=str,
    hexamer_motif=str,
)

MERGED_COLUMN_TYPES = OrderedDict(
    chrom=str,
    start=int,
    end=int,
    name=str,
    score=float,
    strand=str,
    gchrom=str,
    gstart=int,
    gend=int,
    gname=str,
    gscore=str,
    gstrand=str,
    g13=str,
    g14=str,
    g15=str,
    g16=str,
    g17=str,
    g18=str,
    g19=str,
)


if __name__ == '__main__':
    genome = snakemake.config['assembly_ucsc']
    annotations = snakemake.params.annotations
    pas_db = snakemake.input.db.__str__()
    genes_file = snakemake.input.genes.__str__()
    output = snakemake.output

    df_db = pd.read_csv(pas_db, sep='\t', names=COLUMN_TYPES.keys(), dtype=COLUMN_TYPES)

    # match chromosome style to genome annotation
    df_db['chrom'] = df_db.chrom.apply(lambda x: convert_chromosome(x, style='chr'))

    if 'most_expressed' not in annotations:
        df_db = encode_annotations(df_db, annotations, 'name')
    else:
        df_db = encode_annotations(
            df_db,
            annotation_columns=[x for x in annotations if x != 'most_expressed'],
            encode_column='name',
        )

        bed_cols = ['chrom', 'start', 'end', 'name', 'score', 'strand']
        bed = pybedtools.BedTool.from_dataframe(df_db[bed_cols])

        print('Annotate overlapping genes')
        genes_bed = pybedtools.BedTool(genes_file).sort()
        df_db = (
            bed.sort()
            .closest(genes_bed, s=True, fu=True, D='ref')
            .to_dataframe(names=MERGED_COLUMN_TYPES.keys(), dtype=MERGED_COLUMN_TYPES)
        )

        print('Annotate most expressed PAS')
        df_db = get_most_expressed(
            df_db,
            aggregate_column='gname',
            expression_column='score',
            interval_columns=['chrom', 'start', 'end', 'strand'],
            new_column='most_expressed',
        )
        df_db['name'] = df_db['name'] + '|' + df_db['most_expressed'].map(str)

        # drop unnecessary columns
        df_db.drop(
            columns=[x for x in df_db.columns if x.startswith('g')], inplace=True
        )

    intervals_df = extract_annotations(
        df_db,
        annotation_string='name',
        annotation_columns=annotations,
        database='PolyASite2',
        feature='PAS',
    )

    if 'hexamer_motif' in annotations:
        print('Extract hexamers')
        hexamers_df = intervals_df[intervals_df['hexamer_motif'] != ''].copy()

        # split by different signals per PAS by ;
        hexamers_df['hexamer_motif_split'] = hexamers_df['hexamer_motif'].str.split(';')
        hexamers_df = hexamers_df.explode('hexamer_motif_split')

        # split each signal by @ into different columns
        hexamers_df[['hexamer_motif', 'rel_pos', 'hex_start']] = hexamers_df[
            'hexamer_motif_split'
        ].str.split('@', expand=True)

        # retrieve hexamer interval
        hexamers_df.hex_start = hexamers_df.hex_start.astype(int)
        hexamers_df.start = (hexamers_df.hex_start - 1).where(
            hexamers_df.strand == '+', hexamers_df.hex_start - 6
        )
        hexamers_df.end = hexamers_df.hex_start.where(
            hexamers_df.strand == '-', hexamers_df.hex_start - 1 + 6
        )

        # remove unnecessary columns
        hexamers_df.drop(
            columns=['hexamer_motif_split', 'rel_pos', 'hex_start'], inplace=True
        )

        # append hexamer regions translated into BED coordinates
        hexamers_df['database'] = 'PolyASite2'
        hexamers_df['feature'] = 'hexamer'
        intervals_df['hexamer_motif'] = ''
        intervals_df = pd.concat([intervals_df, hexamers_df])

    print('save...')
    intervals_df['feature'].value_counts().to_csv(output.stats, sep='\t')  # get stats
    intervals_df.to_csv(output.intervals, index=False, sep='\t')
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
library(data.table)
library(ggplot2)
source(snakemake@input$utils)

count_dt <- fread(snakemake@input$counts)

# get reference MAPS
dt_csq <- maps_reference(count_dt)

# get aggregation MAPS
aggregations <- strsplit(snakemake@wildcards$aggregation, '-')[[1]]
dt <- maps(count_dt, grouping = aggregations)
setorder(dt, -maps)
fwrite(dt, snakemake@output$tsv, sep = '\t')

# Plot only top n variants
dt <- dt[variant_count > snakemake@params$variant_count_min]

new_columns <- c('anno_x', 'shape', 'facet')[seq_along(aggregations)]

# Flatten for 3'UTR and other variants
if (length(aggregations) == 3) {
  if ('database' %in% aggregations) {
    flatten_values <- c('GENCODE', 'gnomAD')
    #dt_lines <- dt[database %in% flatten_values]
    dt <- dt[!database %in% flatten_values]
  } else if ('feature' %in% aggregations) {
    flatten_values <- c('3UTR', 'other variant')
    #dt_lines <- dt[feature %in% flatten_values]
    dt <- dt[!feature %in% flatten_values]
  }
  #setnames(dt_lines, old = aggregations, new = new_columns)
}

setnames(dt, old = aggregations, new = new_columns)
dt[, anno_x := factor(anno_x, levels = unique(anno_x), ordered = TRUE)]

title <- paste0('MAPS on ', snakemake@wildcards$chr_subset,
                ' (', snakemake@params$chr_subset, ')')
dodge_width <- 0.5

if ('shape' %in% names(dt)) {
  p <- ggplot(dt, aes(reorder(anno_x, -maps), maps, color = shape, group = shape))
} else {
  p <- ggplot(dt, aes(reorder(anno_x, -maps), maps))
}

if ('facet' %in% names(dt)) {
  p <- p +
    facet_grid('facet~.') +
    ggtitle(title, subtitle = paste('Facet:', toTitleCase(aggregations[3])))
}

if (exists('dt_lines')) {
  p <- p + geom_hline(
    aes(yintercept = maps, color = shape),
    data = dt_lines[, .(maps, shape)]
  )
}

p <- p +
  geom_point(position = position_dodge(width = dodge_width)) +
  labs(
    title = title,
    x = toTitleCase(aggregations[1]),
    y = 'MAPS',
    color = toTitleCase(aggregations[2])
  ) +
  geom_hline(  # add lines for reference
    aes(yintercept = maps, group = consequence),
    data = dt_csq[, .(maps, consequence)],
    linetype = 'dashed',
    color = 'grey50'
  ) +
  geom_text(
    aes(x = 0, y = maps, label = consequence, vjust = 1.3),
    hjust = -0.05,
    data = dt_csq,
    inherit.aes = FALSE,
    size = 2.5
  ) +
  geom_text(
    aes(y = min(maps - maps_sem) - 0.03, label = variant_count),
    size = 3,
    angle = 30,
    position = position_dodge(dodge_width)
  ) +
  geom_errorbar(
    aes(ymin = maps - maps_sem, ymax = maps + maps_sem),
    width = 0.15,
    position = position_dodge(width = dodge_width)
  ) +
  #scale_color_brewer(palette = 'Set1') +
  theme_classic() +
  theme(
    legend.position = 'bottom',
    axis.text.x = element_text(angle = 90, hjust = 1),
    axis.ticks = element_blank(),
  )

ggsave(snakemake@output$png, width = 8, height = 6)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import tempfile
from functools import reduce
import pandas as pd
from pandas.api.types import CategoricalDtype
import pybedtools
from utr3variants.utils import (
    encode_annotations,
    extract_annotations,
    convert_chromosome,
)

BED_COLS = ['chrom', 'start', 'end', 'name', 'score', 'strand']
CANONICAL_CHROMOSOMES = [str(x) for x in list(range(1, 23))] + ['X', 'Y']
CANONICAL_CHROMOSOMES_CHR = [f'chr{chr}' for chr in CANONICAL_CHROMOSOMES]

chr_dtype = CategoricalDtype(CANONICAL_CHROMOSOMES, ordered=True)


def rename_interval(
    interval: pybedtools.Interval,
    prefix: str = None,
    name: str = None,
    sep='|',
    keep_col=-1,
) -> pybedtools.Interval:
    """
    Change name of interval by adding a prefix and/or overwriting the old name

    :param interval:
    :param prefix: prefix to add to existing name if specified
    :param name: value to overwrite old name, depending on value of keep_col
    :param sep: character to separate prefix from rest of name
    :param keep_col: column index of annotation column to keep in interval.name list
        separated by sep. Overwrite all annotations with name if keep_col=-1 (default)
        e.g. for interval.name = 'AAUAAA|No' and keep_col = 0: keep 'AAUAAA', drop 'No'
    """
    if name is None:
        name = interval.name
        if keep_col >= 0:
            name = name.split(sep)[keep_col]
    if prefix is not None:
        name = sep.join(filter(None, [prefix, name]))
    interval.name = name
    return interval


if __name__ == '__main__':
    chr_style = snakemake.params.chr_style_gnomAD
    annotations = snakemake.params.annotations
    chainfile = snakemake.input.chainfile.__str__()

    utrs = pd.read_table(
        snakemake.input.utrs.__str__(), sep='\t', dtype=str, names=BED_COLS
    )  # pybedtools.BedTool(snakemake.input.utrs.__str__())
    polya_db = pd.read_table(snakemake.input.PolyA_DB.__str__(), sep='\t', dtype=str)
    polya_site = pd.read_table(
        snakemake.input.PolyASite2.__str__(), sep='\t', dtype=str
    )

    # liftover PolyASite2 to hg19
    pasite_anno = [x for x in annotations if x in polya_site.columns]
    polya_site = encode_annotations(polya_site, pasite_anno, 'name')
    polya_site_bed = pybedtools.BedTool.from_dataframe(polya_site[BED_COLS]).liftover(
        chainfile
    )
    polya_site = extract_annotations(
        polya_site_bed.to_dataframe(dtype=str), 'name', pasite_anno
    )
    polya_site = polya_site[polya_site.chrom.isin(CANONICAL_CHROMOSOMES_CHR)]

    # convert chromosome format to match gnomAD dataset
    print('Convert chromosome styles')
    utrs.chrom = utrs.chrom.apply(lambda x: convert_chromosome(x, chr_style))
    polya_db.chrom = polya_db.chrom.apply(lambda x: convert_chromosome(x, chr_style))
    polya_site.chrom = polya_site.chrom.apply(
        lambda x: convert_chromosome(x, chr_style)
    )

    # subtract from 3'UTR
    utrs_bed = pybedtools.BedTool.from_dataframe(utrs)
    polya_db_bed = pybedtools.BedTool.from_dataframe(polya_db)
    polya_site_bed = pybedtools.BedTool.from_dataframe(polya_site)
    with tempfile.TemporaryDirectory() as tmp_dir:
        other_utr = (
            utrs_bed.subtract(polya_db_bed)  # pylint: disable=too-many-function-args
            .subtract(polya_site_bed)
            .saveas(f'{tmp_dir}/other_utr.bed')
            .to_dataframe(dtype=str)
        )
        other_utr['database'] = 'GENCODE'
        other_utr['feature'] = '3UTR'

    # merge all intervals via pandas
    df = reduce(
        lambda df_left, df_right: pd.merge(
            df_left,
            df_right,
            how='outer',
            on=list(set(df_left.columns) & set(df_right.columns)),
        ),
        [other_utr, polya_db, polya_site],
    )

    # sort intervals
    df['chrom'] = df['chrom'].astype(str).astype(chr_dtype)
    df.sort_values(by=['chrom', 'start', 'end', 'strand'], inplace=True)

    print('Create hail-parsable locus intervals')
    df['locus_interval'] = (
        '('
        + df.chrom.astype(str)
        + ':'
        + df.start.map(str)
        + '-'
        + df.end.map(str)
        + ']'
    )

    # Encode annotations
    df = encode_annotations(df, annotations, 'name', verbose=True)

    # save
    print('save...')
    df[['locus_interval', 'strand'] + annotations].to_csv(
        snakemake.output.intervals, sep='\t', index=False
    )
    df[['chrom', 'start', 'end', 'name', 'score', 'strand']].to_csv(
        snakemake.output.bed, sep='\t', index=False, header=False
    )
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import hail as hl
from utr3variants.annotate_gnomad import filter_gnomad, annotate_for_maps


if __name__ == '__main__':
    hl.init(
        local=f'local[{snakemake.threads}]',
        log=snakemake.log['hail'],
        default_reference=snakemake.config['genome_assembly'],
    )
    context_ht = hl.read_table(snakemake.config['gnomAD']['context_ht'])
    ht = hl.read_table(snakemake.config['gnomAD']['gnomAD_ht'])

    subset_interval = hl.parse_locus_interval(snakemake.params['chr_subset'])
    print('Filter gnomAD')
    ht = filter_gnomad(ht, [subset_interval])
    print('Annotate')
    ht = annotate_for_maps(ht, context_ht)

    print('save...')
    ht.write(snakemake.output['gnomAD_ht'])
36
37
38
39
40
41
42
shell:
    """
    snakemake --dag | dot -Tsvg -Grankdir=TB > {output.dag[0]}
    snakemake --dag | dot -Tpng -Grankdir=TB > {output.dag[1]}
    snakemake --rulegraph | dot -Tsvg -Grankdir=TB > {output.rulegraph[0]}
    snakemake --rulegraph | dot -Tpng -Grankdir=TB > {output.rulegraph[1]}
    """
ShowHide 16 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/populationgenomics/utr-variants
Name: utr-variants
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...