Identification of 3' UTR Variants Disrupting Transcript Function Pipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Identify variants occurring in 3’ UTRs that could disrupt the function of a transcript.

This project involves:

retrieving reference genome using genomepy
extracting 3'UTR regions from gene annotation
extracting PAS hexamers from PolyA_DB3
computing MAPS score on different 3'UTR variants locally and on the cloud using hailctl dataproc

Evaluation Metrics

The different 3' UTR variants are evaluated by the MAPS (mutability-adjusted proportion of singletons) score to measure the overall depletion of singletons, as described in the gnomAD flagship paper . A higher MAPS score (compared to a baseline) suggests that there is a stronger evolutionary selection against the variant group.

Repository Files

The repository mainly contains a Snakemake workflow in workflow and a python package in utr3variants for the functions used in the pipeline scripts. Input paths, URLs and parameters are collected in configs/config.yml and should be adapted by the user. Additionally, there is a qc/ directory for any exploratory analyses on input databases.

The Snakemake file structure under workflow is organised as recommended by the Snakemake documentation. It contains Snakefiles for Snakemake pipeline rules in rules/ and python scripts in scripts/ . The Snakefiles in rules/ are used by the Snakefile in the repository root and cannot be run independently. Snakemake objects are passed to the python scripts when the pipeline is called locally. However, scripts that are to be submitted to the GCP do not handle Snakemake objects and can be run independently as well. How the python scripts are used, is best demonstrated in the Snakemake rules, where input and output files are defined. More information on scripts in Snakemake pipelines is documented here .

Pipeline

The workflow is implemented as a Snakemake pipeline. A good resource on the idea of Snakemake and how it works is its official documentation .

Installation

The dependencies are available as a conda environment.

conda env create -f workflow/envs/environment.yml
conda activate utr-variants

The pipeline can run hail operations locally or on the Google Cloud Platform (GCP). In order to run locally, you need to install the gcs-connector in the new environment.

curl -sSL https://broad.io/install-gcs-connector | python

Configuration

The configs/config.yml file contains links to hail datasets and local files. Local input files and output directories can be modified accordingly. For GCP runs, keys bucket and cluster need to be specified, and the key local needs to be set to false in the config.yml .

bucket: bucket_name
cluster: cluster_name
local: false

The key bucket specifies the GCP storage bucket name you want your output to be stored in, while cluster defines the name of the hailctl dataproc cluster. Before calling the Snakemake pipeline on the GCP, you need to start the dataproc session, preferably with a compute timeout.

hailctl dataproc start cluster_name --max-age=2h --packages gnomad

You don't need to start a dataproc session, if you want to run the pipeline locally, setting local to true . In that case, the bucket and cluster keys do not need to be specified.

Call the Pipeline

The pipeline was configured to be called from the root directory of this repository. For a dry run, you can supply the -n argument:

snakemake -n

In order to execute the steps listed in the dry run, use -j or --cores to set the number of cores available to the pipeline:

snakemake -j 10

Plotting rules have a separately defined conda environment for R dependencies ( workflow/envs/utr-variants-r.yml ). In order to use this environment call the pipeline as follows:

snakemake --use-conda -j10

This will create the environment once before activating for every rule call requiring it. You can also precompute the environment by calling

snakemake --use-conda --conda-create-envs-only

See Snakemake's documentation for more information on integrating conda environments.

Workflow Graph

It is useful to get a visual representation of what will be computed. For that, you can call the dependency rule that creates graphs for the dependency between rules (more general) and jobs (specific to wildcards).

snakemake dependency -fj

Here, -f enforces the pipeline to overwrite any pre-existing dependency graphs.

The output files can be found in the output_root directory specified in configs/config.yml . Below are some examples of a local run.

Rule graph for GCP run

rulegraph

Job graph for GCP run

dag

Code Snippets

shell:
    """
    genome_dir=$(dirname "{output[0]}")
    genome_dir=$(dirname "$genome_dir")
    echo downloading genome to: $genome_dir
    genomepy install {wildcards.assembly} \
        -p {params.provider} --annotation \
        -g $genome_dir >> {log} 2>&1
    """

SnakeMake genomepy From line 17 of rules/download_data.smk

shell:
    """
    wget --timestamping {params.url} -O {output}.gz
    gunzip {output}.gz
    """

SnakeMake From line 37 of rules/download_data.smk

shell:
    """
    tmpdir="$(dirname "$(tempfile)")"
    # TODO: download from GCP bucket
    wget --no-check-certificate -nc -P $tmpdir {params.url}
    unzip "$tmpdir"/human_pas.zip -d $(dirname {output})
    """

SnakeMake From line 49 of rules/download_data.smk

shell:
    """
    wget -nc -P $(dirname {output}) {params.url}
    """

SnakeMake From line 63 of rules/download_data.smk

shell:
    """
    wget -nc -P $(dirname {output}) {params.url}
    gzip -d {output}.gz
    """

SnakeMake From line 73 of rules/download_data.smk

shell:
    """
    INTERVAL_PATH="gs://{config[bucket]}/$(basename {input})"
    gsutil cp {input} $INTERVAL_PATH
    hailctl dataproc submit {config[cluster]} \
        --pyfiles {params.annotate_gnomad},{params.maps} \
        {params.script}  \
            -o gs://{output.maps} \
            --intervals $INTERVAL_PATH \
            --annotations {params.annotations} \
            --gnomAD_ht {config[gnomAD][gnomAD_ht]} \
            --context_ht {config[gnomAD][context_ht]} \
            --mutation_ht {config[gnomAD][mutation_rate_ht]} \
            --genome_assembly {config[genome_assembly]} \
            --chr_subset {params.chr_subset} \
            --skip_checks {config[gnomAD][skip_checks]}
    gsutil rm $INTERVAL_PATH
    """

SnakeMake From line 22 of rules/evaluation.smk

script: '../scripts/prepare_gnomad.py'

SnakeMake From line 50 of rules/evaluation.smk

script: '../scripts/count_singletons.py'

SnakeMake From line 63 of rules/evaluation.smk

run:
    import pandas as pd
    from utr3variants.utils import extract_annotations

    df = pd.read_table(input.counts,sep='\t')
    df = extract_annotations(df,'target',ALL_ANNOTATIONS)

    if 'hexamer_motif' in ALL_ANNOTATIONS:
        df = df[df.hexamer_motif != 'AAAAAA']  # remove A-rich hexamers
        # bin hexamers
        df['hexamer_motif_bin'] = df.hexamer_motif
        df.loc[
            (df.feature == 'hexamer') &
            ~df.hexamer_motif.isin(['AATAAA', 'ATTAAA'])
            , 'hexamer_motif_bin'
        ] = 'other hexamer'

    if 'percent_expressed' in ALL_ANNOTATIONS:
        df['percent_expressed'] = pd.cut(
            pd.to_numeric(df.percent_expressed,errors='coerce'),10
        ).astype(str)

    # fill in blank values
    df['database'].fillna('gnomAD',inplace=True)
    df['feature'].fillna('other variant',inplace=True)
    # for anno in [x for x in ALL_ANNOTATIONS if x not in ['feature', 'database']]:
    #    df[anno].fillna(df.feature,inplace=True)

    df.to_csv(output.counts,sep='\t',index=False)

SnakeMake Pandas utr-variants From line 79 of rules/evaluation.smk

script: '../scripts/maps.R'

SnakeMake From line 123 of rules/evaluation.smk

shell:
    """
    zcat {input} | grep three_prime_UTR | sortBed |\
     mergeBed -c 3,6,7 -o distinct,distinct,distinct > {output.utr}
    # TODO: determine 3UTR start depending on strand
    """

SnakeMake From line 17 of rules/extract_UTR_features.smk

script: '../scripts/extract_polyadb.py'

SnakeMake From line 37 of rules/extract_UTR_features.smk

script: '../scripts/extract_polyasite2.py'

SnakeMake From line 49 of rules/extract_UTR_features.smk

script: '../scripts/merge_utr_intervals.py'

SnakeMake From line 65 of rules/extract_UTR_features.smk

import hail as hl
import pandas as pd
from utr3variants.annotate_gnomad import annotate_by_intervals
from utr3variants.maps import count_for_maps


if __name__ == '__main__':
    ref = snakemake.config['genome_assembly']
    interval_file = snakemake.input['intervals'].__str__()
    gnomad_path = snakemake.input['gnomAD'].__str__()
    mutation_path = snakemake.config['gnomAD']['mutation_rate_ht']

    counts_out = snakemake.output.counts

    hl.init(
        local=f'local[{snakemake.threads}]',
        log=snakemake.log['hail'],
        default_reference=ref,
    )
    mutation_ht = hl.read_table(mutation_path)
    ht = hl.read_table(gnomad_path)
    intervals = hl.import_bed(interval_file, min_partitions=100)

    ht = annotate_by_intervals(ht, intervals, columns=['target'])

    count_ht = count_for_maps(
        ht,
        mutation_ht,
        additional_grouping=['target'],
        skip_mut_check=snakemake.config['gnomAD']['skip_checks'],
    )

    print('save...')
    count_ht.export(counts_out)

    # Preview of counts table
    count_df = pd.read_table(counts_out)
    print(count_df.head())

Python Pandas utr-variants hail From line 10 of scripts/count_singletons.py

import re
from typing import Iterable
from typing import List
import numpy as np
import pandas as pd
import pybedtools
from pysam import FastaFile  # pylint: disable=no-name-in-module
from utr3variants.utils import extract_annotations, get_most_expressed

ANNO_COLUMN_MAP = {
    # Mapping of snakemake wildcard to column name in database
    'hexamer_motif': 'PAS Signal',
    'conservation': 'Conservation',
    'percent_expressed': 'PSE',
    'expression': 'Mean RPM',
    'most_expressed': 'most_expressed',
}


def create_signal_interval(
    feature: pybedtools.Interval, signal: str, sequence: str, name: str = None
) -> Iterable[pybedtools.Interval]:
    """
    Create pybedtools.Interval objects of signal locations within given interval

    :param feature: interval in which to search for signal
    :param signal: sequence of signal motif
    :param sequence: sequence of feature interval
    :param name: value to overwrite in feature.name if specified
    :return: generator of signal locations
    """
    name = feature.name if name is None else name
    return (
        pybedtools.Interval(
            chrom=feature.chrom,
            start=feature.end - i.end()
            if feature.strand == '-'
            else feature.start + i.start(),
            end=feature.end - i.start()
            if feature.strand == '-'
            else feature.start + i.end(),
            name=name,
            score=feature.score,
            strand=feature.strand,
        )
        for i in re.finditer(signal, sequence, re.IGNORECASE)
    )


def get_hexamers(
    feature: pybedtools.Interval, sequence: str, hexamer_column: int
) -> List[pybedtools.Interval]:
    """
    Retrieve all hexamers of regions from within given interval

    :param feature: interval with interval.name containing annotation
        of form <PAS signal>|<conservation>
    :param sequence: sequence of feature interval
    :param hexamer_column: column index of hexamer in feature name
    :return: list of hexamer coordinates
    """
    name_split = feature.name.split('|')
    hexamer_motif = name_split[hexamer_column]

    if hexamer_motif == 'NoPAS':
        return []
    if hexamer_motif == 'OtherPAS':
        signal_sequences = [
            'AGTAAA',
            'TATAAA',
            'CATAAA',
            'GATAAA',
            'AATATA',
            'AATACA',
            'AATAGA',
            'AAAAAG',
            'ACTAAA',
        ]
    elif hexamer_motif == 'Arich':
        signal_sequences = ['AAAAAA']
    else:
        signal_sequences = [hexamer_motif]
    dna_signals = [s.replace('U', 'T') for s in signal_sequences]

    hexamer_list = []
    for signal in dna_signals:
        name_split[hexamer_column] = signal  # rename hexamer annotation
        intervals = create_signal_interval(
            feature, signal, sequence, name='|'.join(name_split)
        )
        hexamer_list.extend(intervals)
    return hexamer_list


if __name__ == '__main__':
    genome = snakemake.config['assembly_ucsc']
    annotations = snakemake.params.annotations
    pas_db = snakemake.input.db.__str__()
    fasta_file = snakemake.input.fasta.__str__()
    output = snakemake.output

    print('read database')
    df = pd.read_csv(pas_db, sep='\t')

    # cleanup columns
    df['Start'] = df['Position'] - 1
    df['PSE'] = df['PSE'].str.rstrip('%').astype('float') / 100
    df['score'] = '.'

    df = get_most_expressed(
        df,
        aggregate_column='Gene Symbol',
        expression_column='Mean RPM',
        interval_columns=['Chromosome', 'Position', 'Strand'],
        new_column='most_expressed',
    )
    annotations.append('most_expressed')

    # put annotation into name
    db_cols = [ANNO_COLUMN_MAP[a] for a in annotations]
    df['Name'] = df[db_cols].astype(str).agg('|'.join, axis=1)

    # create bedtools object/table
    print('Create BedTools object')
    # BedTool object needed for hexamer extraction
    bed_cols = [
        'Chromosome',  # chrom
        'Start',  # start
        'Position',  # end
        'Name',  # name
        'score',  # score
        'Strand',  # strand
    ]
    bed = pybedtools.BedTool.from_dataframe(df[bed_cols])

    # convert back to dataframe with annotations
    intervals_df = extract_annotations(
        bed.to_dataframe(),
        annotation_string='name',
        annotation_columns=annotations,
        database='PolyA_DB',
        feature='PAS',
    )

    if 'hexamer_motif' in annotations:
        print('extract hexamers')
        bed_40nt_us = bed.slop(  # pylint: disable=unexpected-keyword-arg
            l=40, r=0, s=True, genome=genome
        ).sequence(fi=fasta_file, s=True, fullHeader=True)
        sequences = FastaFile(bed_40nt_us.seqfn)

        hexamer_intervals = []
        hex_column = annotations.index('hexamer_motif')
        for f in bed_40nt_us:
            seq = sequences.fetch(f'{f.chrom}:{f.start}-{f.stop}({f.strand})')
            hex_intervals = get_hexamers(
                feature=f, sequence=seq, hexamer_column=hex_column
            )
            hexamer_intervals.extend(hex_intervals)

        hex_df = pybedtools.BedTool(hexamer_intervals).to_dataframe()
        hex_df = extract_annotations(
            hex_df,
            annotation_string='name',
            annotation_columns=annotations,
            database='PolyA_DB',
            feature='hexamer',
        )
        intervals_df['hexamer_motif'] = np.nan
        intervals_df = pd.concat([intervals_df, hex_df])

    print('save...')
    # get stats
    intervals_df['feature'].value_counts().to_csv(output.stats, sep='\t')
    intervals_df.to_csv(output.intervals, index=False, sep='\t')

Python Snakemake Pandas numpy BEDTools pysam Pybedtools utr-variants From line 11 of scripts/extract_polyadb.py

from collections import OrderedDict
import pandas as pd
import pybedtools
from utr3variants.utils import (
    encode_annotations,
    extract_annotations,
    convert_chromosome,
    get_most_expressed,
)

COLUMN_TYPES = OrderedDict(
    chrom=str,
    start=int,
    end=int,
    cluster_id=str,
    score=float,
    strand=str,
    percent_expressed=float,
    n_protocols=int,
    expression=float,
    cluster_annotation=str,
    hexamer_motif=str,
)

MERGED_COLUMN_TYPES = OrderedDict(
    chrom=str,
    start=int,
    end=int,
    name=str,
    score=float,
    strand=str,
    gchrom=str,
    gstart=int,
    gend=int,
    gname=str,
    gscore=str,
    gstrand=str,
    g13=str,
    g14=str,
    g15=str,
    g16=str,
    g17=str,
    g18=str,
    g19=str,
)


if __name__ == '__main__':
    genome = snakemake.config['assembly_ucsc']
    annotations = snakemake.params.annotations
    pas_db = snakemake.input.db.__str__()
    genes_file = snakemake.input.genes.__str__()
    output = snakemake.output

    df_db = pd.read_csv(pas_db, sep='\t', names=COLUMN_TYPES.keys(), dtype=COLUMN_TYPES)

    # match chromosome style to genome annotation
    df_db['chrom'] = df_db.chrom.apply(lambda x: convert_chromosome(x, style='chr'))

    if 'most_expressed' not in annotations:
        df_db = encode_annotations(df_db, annotations, 'name')
    else:
        df_db = encode_annotations(
            df_db,
            annotation_columns=[x for x in annotations if x != 'most_expressed'],
            encode_column='name',
        )

        bed_cols = ['chrom', 'start', 'end', 'name', 'score', 'strand']
        bed = pybedtools.BedTool.from_dataframe(df_db[bed_cols])

        print('Annotate overlapping genes')
        genes_bed = pybedtools.BedTool(genes_file).sort()
        df_db = (
            bed.sort()
            .closest(genes_bed, s=True, fu=True, D='ref')
            .to_dataframe(names=MERGED_COLUMN_TYPES.keys(), dtype=MERGED_COLUMN_TYPES)
        )

        print('Annotate most expressed PAS')
        df_db = get_most_expressed(
            df_db,
            aggregate_column='gname',
            expression_column='score',
            interval_columns=['chrom', 'start', 'end', 'strand'],
            new_column='most_expressed',
        )
        df_db['name'] = df_db['name'] + '|' + df_db['most_expressed'].map(str)

        # drop unnecessary columns
        df_db.drop(
            columns=[x for x in df_db.columns if x.startswith('g')], inplace=True
        )

    intervals_df = extract_annotations(
        df_db,
        annotation_string='name',
        annotation_columns=annotations,
        database='PolyASite2',
        feature='PAS',
    )

    if 'hexamer_motif' in annotations:
        print('Extract hexamers')
        hexamers_df = intervals_df[intervals_df['hexamer_motif'] != ''].copy()

        # split by different signals per PAS by ;
        hexamers_df['hexamer_motif_split'] = hexamers_df['hexamer_motif'].str.split(';')
        hexamers_df = hexamers_df.explode('hexamer_motif_split')

        # split each signal by @ into different columns
        hexamers_df[['hexamer_motif', 'rel_pos', 'hex_start']] = hexamers_df[
            'hexamer_motif_split'
        ].str.split('@', expand=True)

        # retrieve hexamer interval
        hexamers_df.hex_start = hexamers_df.hex_start.astype(int)
        hexamers_df.start = (hexamers_df.hex_start - 1).where(
            hexamers_df.strand == '+', hexamers_df.hex_start - 6
        )
        hexamers_df.end = hexamers_df.hex_start.where(
            hexamers_df.strand == '-', hexamers_df.hex_start - 1 + 6
        )

        # remove unnecessary columns
        hexamers_df.drop(
            columns=['hexamer_motif_split', 'rel_pos', 'hex_start'], inplace=True
        )

        # append hexamer regions translated into BED coordinates
        hexamers_df['database'] = 'PolyASite2'
        hexamers_df['feature'] = 'hexamer'
        intervals_df['hexamer_motif'] = ''
        intervals_df = pd.concat([intervals_df, hexamers_df])

    print('save...')
    intervals_df['feature'].value_counts().to_csv(output.stats, sep='\t')  # get stats
    intervals_df.to_csv(output.intervals, index=False, sep='\t')

Python Pandas Pybedtools utr-variants From line 10 of scripts/extract_polyasite2.py

library(data.table)
library(ggplot2)
source(snakemake@input$utils)

count_dt <- fread(snakemake@input$counts)

# get reference MAPS
dt_csq <- maps_reference(count_dt)

# get aggregation MAPS
aggregations <- strsplit(snakemake@wildcards$aggregation, '-')[[1]]
dt <- maps(count_dt, grouping = aggregations)
setorder(dt, -maps)
fwrite(dt, snakemake@output$tsv, sep = '\t')

# Plot only top n variants
dt <- dt[variant_count > snakemake@params$variant_count_min]

new_columns <- c('anno_x', 'shape', 'facet')[seq_along(aggregations)]

# Flatten for 3'UTR and other variants
if (length(aggregations) == 3) {
  if ('database' %in% aggregations) {
    flatten_values <- c('GENCODE', 'gnomAD')
    #dt_lines <- dt[database %in% flatten_values]
    dt <- dt[!database %in% flatten_values]
  } else if ('feature' %in% aggregations) {
    flatten_values <- c('3UTR', 'other variant')
    #dt_lines <- dt[feature %in% flatten_values]
    dt <- dt[!feature %in% flatten_values]
  }
  #setnames(dt_lines, old = aggregations, new = new_columns)
}

setnames(dt, old = aggregations, new = new_columns)
dt[, anno_x := factor(anno_x, levels = unique(anno_x), ordered = TRUE)]

title <- paste0('MAPS on ', snakemake@wildcards$chr_subset,
                ' (', snakemake@params$chr_subset, ')')
dodge_width <- 0.5

if ('shape' %in% names(dt)) {
  p <- ggplot(dt, aes(reorder(anno_x, -maps), maps, color = shape, group = shape))
} else {
  p <- ggplot(dt, aes(reorder(anno_x, -maps), maps))
}

if ('facet' %in% names(dt)) {
  p <- p +
    facet_grid('facet~.') +
    ggtitle(title, subtitle = paste('Facet:', toTitleCase(aggregations[3])))
}

if (exists('dt_lines')) {
  p <- p + geom_hline(
    aes(yintercept = maps, color = shape),
    data = dt_lines[, .(maps, shape)]
  )
}

p <- p +
  geom_point(position = position_dodge(width = dodge_width)) +
  labs(
    title = title,
    x = toTitleCase(aggregations[1]),
    y = 'MAPS',
    color = toTitleCase(aggregations[2])
  ) +
  geom_hline(  # add lines for reference
    aes(yintercept = maps, group = consequence),
    data = dt_csq[, .(maps, consequence)],
    linetype = 'dashed',
    color = 'grey50'
  ) +
  geom_text(
    aes(x = 0, y = maps, label = consequence, vjust = 1.3),
    hjust = -0.05,
    data = dt_csq,
    inherit.aes = FALSE,
    size = 2.5
  ) +
  geom_text(
    aes(y = min(maps - maps_sem) - 0.03, label = variant_count),
    size = 3,
    angle = 30,
    position = position_dodge(dodge_width)
  ) +
  geom_errorbar(
    aes(ymin = maps - maps_sem, ymax = maps + maps_sem),
    width = 0.15,
    position = position_dodge(width = dodge_width)
  ) +
  #scale_color_brewer(palette = 'Set1') +
  theme_classic() +
  theme(
    legend.position = 'bottom',
    axis.text.x = element_text(angle = 90, hjust = 1),
    axis.ticks = element_blank(),
  )

ggsave(snakemake@output$png, width = 8, height = 6)

R ggplot2 data.table From line 6 of scripts/maps.R

import tempfile
from functools import reduce
import pandas as pd
from pandas.api.types import CategoricalDtype
import pybedtools
from utr3variants.utils import (
    encode_annotations,
    extract_annotations,
    convert_chromosome,
)

BED_COLS = ['chrom', 'start', 'end', 'name', 'score', 'strand']
CANONICAL_CHROMOSOMES = [str(x) for x in list(range(1, 23))] + ['X', 'Y']
CANONICAL_CHROMOSOMES_CHR = [f'chr{chr}' for chr in CANONICAL_CHROMOSOMES]

chr_dtype = CategoricalDtype(CANONICAL_CHROMOSOMES, ordered=True)


def rename_interval(
    interval: pybedtools.Interval,
    prefix: str = None,
    name: str = None,
    sep='|',
    keep_col=-1,
) -> pybedtools.Interval:
    """
    Change name of interval by adding a prefix and/or overwriting the old name

    :param interval:
    :param prefix: prefix to add to existing name if specified
    :param name: value to overwrite old name, depending on value of keep_col
    :param sep: character to separate prefix from rest of name
    :param keep_col: column index of annotation column to keep in interval.name list
        separated by sep. Overwrite all annotations with name if keep_col=-1 (default)
        e.g. for interval.name = 'AAUAAA|No' and keep_col = 0: keep 'AAUAAA', drop 'No'
    """
    if name is None:
        name = interval.name
        if keep_col >= 0:
            name = name.split(sep)[keep_col]
    if prefix is not None:
        name = sep.join(filter(None, [prefix, name]))
    interval.name = name
    return interval


if __name__ == '__main__':
    chr_style = snakemake.params.chr_style_gnomAD
    annotations = snakemake.params.annotations
    chainfile = snakemake.input.chainfile.__str__()

    utrs = pd.read_table(
        snakemake.input.utrs.__str__(), sep='\t', dtype=str, names=BED_COLS
    )  # pybedtools.BedTool(snakemake.input.utrs.__str__())
    polya_db = pd.read_table(snakemake.input.PolyA_DB.__str__(), sep='\t', dtype=str)
    polya_site = pd.read_table(
        snakemake.input.PolyASite2.__str__(), sep='\t', dtype=str
    )

    # liftover PolyASite2 to hg19
    pasite_anno = [x for x in annotations if x in polya_site.columns]
    polya_site = encode_annotations(polya_site, pasite_anno, 'name')
    polya_site_bed = pybedtools.BedTool.from_dataframe(polya_site[BED_COLS]).liftover(
        chainfile
    )
    polya_site = extract_annotations(
        polya_site_bed.to_dataframe(dtype=str), 'name', pasite_anno
    )
    polya_site = polya_site[polya_site.chrom.isin(CANONICAL_CHROMOSOMES_CHR)]

    # convert chromosome format to match gnomAD dataset
    print('Convert chromosome styles')
    utrs.chrom = utrs.chrom.apply(lambda x: convert_chromosome(x, chr_style))
    polya_db.chrom = polya_db.chrom.apply(lambda x: convert_chromosome(x, chr_style))
    polya_site.chrom = polya_site.chrom.apply(
        lambda x: convert_chromosome(x, chr_style)
    )

    # subtract from 3'UTR
    utrs_bed = pybedtools.BedTool.from_dataframe(utrs)
    polya_db_bed = pybedtools.BedTool.from_dataframe(polya_db)
    polya_site_bed = pybedtools.BedTool.from_dataframe(polya_site)
    with tempfile.TemporaryDirectory() as tmp_dir:
        other_utr = (
            utrs_bed.subtract(polya_db_bed)  # pylint: disable=too-many-function-args
            .subtract(polya_site_bed)
            .saveas(f'{tmp_dir}/other_utr.bed')
            .to_dataframe(dtype=str)
        )
        other_utr['database'] = 'GENCODE'
        other_utr['feature'] = '3UTR'

    # merge all intervals via pandas
    df = reduce(
        lambda df_left, df_right: pd.merge(
            df_left,
            df_right,
            how='outer',
            on=list(set(df_left.columns) & set(df_right.columns)),
        ),
        [other_utr, polya_db, polya_site],
    )

    # sort intervals
    df['chrom'] = df['chrom'].astype(str).astype(chr_dtype)
    df.sort_values(by=['chrom', 'start', 'end', 'strand'], inplace=True)

    print('Create hail-parsable locus intervals')
    df['locus_interval'] = (
        '('
        + df.chrom.astype(str)
        + ':'
        + df.start.map(str)
        + '-'
        + df.end.map(str)
        + ']'
    )

    # Encode annotations
    df = encode_annotations(df, annotations, 'name', verbose=True)

    # save
    print('save...')
    df[['locus_interval', 'strand'] + annotations].to_csv(
        snakemake.output.intervals, sep='\t', index=False
    )
    df[['chrom', 'start', 'end', 'name', 'score', 'strand']].to_csv(
        snakemake.output.bed, sep='\t', index=False, header=False
    )

Python Pandas Pybedtools utr-variants From line 4 of scripts/merge_utr_intervals.py

import hail as hl
from utr3variants.annotate_gnomad import filter_gnomad, annotate_for_maps


if __name__ == '__main__':
    hl.init(
        local=f'local[{snakemake.threads}]',
        log=snakemake.log['hail'],
        default_reference=snakemake.config['genome_assembly'],
    )
    context_ht = hl.read_table(snakemake.config['gnomAD']['context_ht'])
    ht = hl.read_table(snakemake.config['gnomAD']['gnomAD_ht'])

    subset_interval = hl.parse_locus_interval(snakemake.params['chr_subset'])
    print('Filter gnomAD')
    ht = filter_gnomad(ht, [subset_interval])
    print('Annotate')
    ht = annotate_for_maps(ht, context_ht)

    print('save...')
    ht.write(snakemake.output['gnomAD_ht'])

Python utr-variants hail From line 9 of scripts/prepare_gnomad.py

shell:
    """
    snakemake --dag | dot -Tsvg -Grankdir=TB > {output.dag[0]}
    snakemake --dag | dot -Tpng -Grankdir=TB > {output.dag[1]}
    snakemake --rulegraph | dot -Tsvg -Grankdir=TB > {output.rulegraph[0]}
    snakemake --rulegraph | dot -Tpng -Grankdir=TB > {output.rulegraph[1]}
    """