Mutational antigenic profiling of SARS-CoV-2 RBD against REGN-COV2 and LY-CoV016

public 1yr ago Version: Science_revision 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Analysis of mutational antigenic profiling of barcoded codon variants of SARS-CoV-2 RBD against the antibodies in REGN-COV2 and LY-CoV016.

Study and analysis by Tyler Starr, Allie Greaney, Jesse Bloom , and co-authors. Details in the paper here .

Summary of workflow and results

For a summary of the workflow and links to key results files, click here . Reading this summary is the best way to understand the analysis.

Running the analysis

The analysis consists of three components, all of which are contained in this repository:

Instructions to build the computing environment.
The required input data.
The computer code and a Snakemake file to run it.

Configure `.git` to not track Jupyter notebook metadata

To simplify git tracking of Jupyter notebooks, we have added the filter described here to strip notebook metadata to .gitattributes and .gitconfig . The first time you check out this repo, run the following command to use this configuration (see here ):

git config --local include.path ../.gitconfig

Then don't worry about it anymore.

Build the computing environment

First, set up the computing environment, which is partially done via conda . Ensure you have conda installed; if not install it via Miniconda as described here . The fully pinned environment is specified in environment.yml , and an unpinned version is in environment_unpinned.yml . If the environment already exists, you can activate it with:

conda activate SARS-CoV-2-RBD_MAP

If you need to build the environment, then first build it with:

conda env create -f environment.yml

Then activate it as above.

Input data

The input data are specified in ./data/ ; see the README in that subdirectory for more details.

Note that this pipeline assumes you are using the FASTQ files on the Fred Hutch server. Otherwise you need to download them from the SRA. They are BioSample SAMN16850904 under BioProject PRJNA639956 .

Running the code

The analysis consists of Jupyter notebooks in the top-level directory along with some additional code in Snakefile . You can run the analysis by using Snakemake to run Snakefile , specifying the conda environment, as in:

snakemake --use-conda --conda-prefix /fh/fast/bloom_j/software/miniconda3/envs/SARS-CoV-2-RBD_MAP -R make_summary -j 1

However, you probably want to using a cluster to help with computationally intensive parts of the analysis. To run using the cluster configuration for the Fred Hutch server, simply run the bash script run_Hutch_cluster.bash , which executes Snakefile in a way that takes advantage of the Hutch server resources. You likely want to submit run_Hutch_cluster.bash itself to the cluster (since it takes a while to run) with:

sbatch -t 7-0 run_Hutch_cluster.bash

Configuring the analysis

The configuration for the analysis is specifed in config.yaml . This file defines key variables for the analysis, and should be relatively self-explanatory. You should modify the analysis by changing this configuration file; do not hard-code crucial experiment-specific variables within the notebooks or Snakefile .

In general:

add new samples to data/barcode_runs.csv
specify new combinations of samples for escape-profile plotting via data/escape_profiles_config.yaml
specify combinations of samles for multi-dimensional scaling via data/mds_config.yaml
to output structural mappings, add information to data/output_pdbs_config.yaml about which conditions should be mapped to which PDBs, and add information to data/structural_annotation_config.yaml about which chains to analyze for defining structural contacts between RBD and antibody or other ligands.
to write supplementary data and dms_view input data, set the make_supp_data flag in data/escape_profiles_config.yaml to true and dms_view input files will be written to results/supp_data for the condition sets with make_supp_data as true for those conditions with PDB mappings in data/output_pdbs_config.yaml .
to update the GISAID sequence set used to look at natural mutations, update the file pointed to by gisaid_spikes in config.yaml as described in data/README.md .
to plot information about viral escape selections, add them to data/escape_selection_results.yaml .

Cluster configuration

There is a cluster configuration file cluster.yaml that configures Snakefile for the Fred Hutch cluster. The run_Hutch_cluster.bash script uses this configuration to run Snakefile . If you are using a different cluster than the Fred Hutch one, you wll need to modify the cluster configuration file.

Notebooks that perform the analysis

The Jupyter notebooks that perform most of the analysis are in this top-level directory with the extension *.ipynb . These notebooks read the key configuration values from config.yaml .

There is also a ./scripts/ subdirectory with related scripts.

The notebooks need to be run in the order described in the workflow and results summary . This will occur automatically if you run them via Snakefile as described above.

Results

Results are placed in the ./results/ subdirectory. Many of the files created in this subdirectory are not tracked in the git repo as they are very large. However, key results files are tracked as well as a summary that shows the code and results. Click here to see that summary.

The large results files are tracked via git-lfs . This requires git-lfs to be installed, which it is in the conda environment specified by environment.yml . The following commands were then run:

git lfs install

You may need to run this if you are tracking these files and haven't installed git-lfs in your user account. Then the large results files were added for tracking with:

git lfs track <FILENAME>

Updating the conda environment

environment.yml contains a fully pinned conda environment. An environment without all of the versions pinned is in environment_unpinned.yml . If you need to update the environment, the suggested way to do it is add the new requirement to environment_unpinned.yml , then build, activate, and export that environment. The last three commands can be done by the following commands:

conda env create -f environment_unpinned.yml
conda activate SARS-CoV-2-RBD_MAP
conda env export > environment.yml

Creating "subset" repos and uploading data to the SRA

Currently this repo contains analyses of many antibodies and sera, and should remain public since collaborators do not want all of these data to be public.

For papers, you can make a public "subset" repo by following the instructions in ./subset_data/ . After making a subset repo, you can upload sequencing data to the Sequence Read Archive (SRA) following the instructions in ./SRA_upload/ .

Code Snippets

import argparse
import os
import subprocess


def main():

    parser = argparse.ArgumentParser(
                description='Run Jupyter notebook, create Markdown output')
    parser.add_argument('nb', help='Name of Jupyter notebook.')
    parser.add_argument('md_output', help='Name of created Markdown output.')
    args = parser.parse_args()

    nb = args.nb
    if not (os.path.isfile(nb) and os.path.splitext(nb)[-1] == '.ipynb'):
        raise IOError(f"not a valid Jupyter notebook: {nb}")

    md_output = args.md_output
    if os.path.splitext(md_output)[-1] not in {'.md', '.markdown'}:
        raise IOError(f"not a valid Markdown file: {md_output}")

    if os.path.isfile(md_output):
        os.remove(md_output)

    subprocess.check_call(['jupyter', 'nbconvert',
                           '--to', 'notebook',
                           '--execute',
                           '--inplace',
                           '--ExecutePreprocessor.timeout=-1',
                           nb,
                           ])

    subprocess.check_call(['jupyter', 'nbconvert',
                           '--output-dir', os.path.dirname(md_output),
                           '--output', os.path.basename(md_output),
                           '--to', 'markdown',
                           nb,
                           ])


if __name__ == '__main__':
    main()

Python From line 10 of scripts/run_nb.py

run:
    def path(f):
        """Get path relative to `summary_dir`."""
        return os.path.relpath(f, config['summary_dir'])
    with open(output.summary, 'w') as f:
        f.write(textwrap.dedent(f"""
        # Summary

        Analysis run by [Snakefile]({path(workflow.snakefile)})
        using [this config file]({path(workflow.configfiles[0])}).
        See the [README in the top directory]({path('README.md')})
        for details.

        Here is the rule graph of the computational workflow:
        ![{path(input.rulegraph)}]({path(input.rulegraph)})

        Here is the Markdown output of each notebook in the workflow:

        1. Get codon-variant-table from [here]({config['codon_variant_table_url']}).

        2. Count variants and then
           [aggregate counts]({path(input.aggregate_variant_counts)}) to create
           to create [variant counts file]({path(input.variant_counts)}).

        3. [Analyze sequencing counts to cells ratio]({path(input.counts_to_cells_ratio)});
           this prints a list of any samples where this ratio too low. Also
           creates [a CSV]({path(input.counts_to_cells_csv)}) with the
           sequencing counts, number of sorted cells, and ratios for
           all samples.

        4. [Escape scores from variant counts]({path(input.counts_to_scores)}).

        5. [Escape fractions for mutations and homologs]({path(input.scores_to_frac_escape)});
           creating [mutation escape fraction file]({path(input.escape_fracs)})
           and [homolog escape fraction file]({path(input.escape_fracs_homologs)}).

        6. [Call sites of strong escape]({path(input.call_strong_escape_sites)}),
           and write to [a CSV file]({path(input.strong_escape_sites)}).

        7. Plot [escape profiles]({path(input.escape_profiles)}).

        9. [Multidimensional scaling]({path(input.mds_escape_profiles)}) on escape profiles.

        10. Map escape profiles to ``*.pdb`` files using [this notebook]({path(input.output_pdbs)})

        11. Output a list of RBD structural contacts for each antibody with a high-resolution structure using [this notebook]({path(input.structural_contacts)})

        12. [Count mutations in GISAID RBD sequences]({path(input.gisaid_rbd_mutations)}).
           to create [this counts file]({path(input.gisaid_mutation_counts)}).

        13. [Analyze GISAID mutations at sites of escape]({path(input.natural_mutations)}).

        14. [Escape from binding by homologs]({path(input.homolog_escape)}).

        15. [Analyze viral escape selections]({path(input.escape_selections)}).

        16. [Make supplementary data files]({path(input.make_supp_data)}),
            which are [here]({path(config['supp_data_dir'])}). These include
            `dms-view` input files.

        17. Make custom plots for this project that fall out of the core pipeline, summarzied in [this notebook]({path(input.custom_plots)})

        """
        ).strip())

SnakeMake From line 93 of main/Snakefile

shell:
    "snakemake --forceall --rulegraph | dot -Tsvg > {output}"

SnakeMake Snakemake From line 164 of main/Snakefile

shell:
    """
    R -e \"rmarkdown::render(input=\'{params.nb}\')\";
    mv {params.md} {output.md};
    mv {params.md_files} {output.md_files}
    """

SnakeMake From line 180 of main/Snakefile

shell:
    """
    R -e \"rmarkdown::render(input=\'{params.nb}\')\";
    mv {params.md} {output.md}
    """

SnakeMake From line 198 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 215 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 227 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 238 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 249 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 262 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 274 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 287 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 304 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 316 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 331 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 349 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 362 of main/Snakefile

shell:
    "python scripts/run_nb.py {params.nb} {output.nb_markdown}"

SnakeMake From line 381 of main/Snakefile

run:
    # parse sample and library from `sample_lib` wildcard
    lib = params.sample_lib.split('_')[-1]
    sample = params.sample_lib[: -len(lib) - 1]
    assert sample == (barcode_runs_expandR1
                      .set_index('sample_lib')
                      .at[params.sample_lib, 'sample']
                      )
    assert lib == (barcode_runs_expandR1
                   .set_index('sample_lib')
                   .at[params.sample_lib, 'library']
                   )
    # initialize `CodonVariantTable` (used to get valid barcodes)
    wt_seqrecord = Bio.SeqIO.read(input.wt_seq, 'fasta')
    geneseq = str(wt_seqrecord.seq)
    primary_target = wt_seqrecord.name
    variants=dms_variants.codonvarianttable.CodonVariantTable(
                geneseq=geneseq,
                barcode_variant_file=input.variant_table,
                substitutions_are_codon=True,
                substitutions_col='codon_substitutions',
                primary_target=primary_target)
    # initialize `IlluminaBarcodeParser`
    parser = dms_variants.illuminabarcodeparser.IlluminaBarcodeParser(
                valid_barcodes=variants.valid_barcodes(lib),
                **config['illumina_barcode_parser_params'])
    # parse barcodes
    counts, fates = parser.parse(input.r1s,
                                 add_cols={'library': lib,
                                           'sample': sample})
    # write files
    counts.to_csv(output.counts, index=False)
    fates.to_csv(output.fates, index=False)

SnakeMake From line 398 of main/Snakefile

run:
    urllib.request.urlretrieve(config['codon_variant_table_url'],
                               output.codon_variant_table)

SnakeMake From line 436 of main/Snakefile

run:
    urllib.request.urlretrieve(config['mut_bind_expr_url'], output.file)

SnakeMake From line 444 of main/Snakefile

run:
    urllib.request.urlretrieve(config['variant_expr_url'], output.file)

SnakeMake From line 451 of main/Snakefile

run:
    urllib.request.urlretrieve(config['variant_bind_url'], output.file)

SnakeMake From line 458 of main/Snakefile

ShowHide 22 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/jbloomlab/SARS-CoV-2-RBD_MAP_clinical_Abs

Name: sars-cov-2-rbd_map_clinical_abs

Version: Science_revision

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Snakemake

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

Mutational antigenic profiling of SARS-CoV-2 RBD against REGN-COV2 and LY-CoV016

Help improve this workflow!

Summary of workflow and results

Running the analysis

Configure .git to not track Jupyter notebook metadata

Build the computing environment

Input data

Running the code

Configuring the analysis

Cluster configuration

Notebooks that perform the analysis

Results

Updating the conda environment

Creating "subset" repos and uploading data to the SRA

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public

Configure `.git` to not track Jupyter notebook metadata