Mutational antigenic profiling of SARS-CoV-2 RBD against REGN-COV2 and LY-CoV016
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Analysis of mutational antigenic profiling of barcoded codon variants of SARS-CoV-2 RBD against the antibodies in REGN-COV2 and LY-CoV016.
Study and analysis by Tyler Starr, Allie Greaney, Jesse Bloom , and co-authors. Details in the paper here .
Summary of workflow and results
For a summary of the workflow and links to key results files, click here . Reading this summary is the best way to understand the analysis.
Running the analysis
The analysis consists of three components, all of which are contained in this repository:
-
Instructions to build the computing environment.
-
The required input data.
-
The computer code and a Snakemake file to run it.
Configure
.git
to not track Jupyter notebook metadata
To simplify git tracking of Jupyter notebooks, we have added the filter described here to strip notebook metadata to .gitattributes and .gitconfig . The first time you check out this repo, run the following command to use this configuration (see here ):
git config --local include.path ../.gitconfig
Then don't worry about it anymore.
Build the computing environment
First, set up the computing environment, which is partially done via
conda
.
Ensure you have
conda
installed; if not install it via Miniconda as described
here
.
The fully pinned environment is specified in
environment.yml
, and an unpinned version is in
environment_unpinned.yml
.
If the environment already exists, you can activate it with:
conda activate SARS-CoV-2-RBD_MAP
If you need to build the environment, then first build it with:
conda env create -f environment.yml
Then activate it as above.
Input data
The input data are specified in ./data/ ; see the README in that subdirectory for more details.
Note that this pipeline assumes you are using the FASTQ files on the Fred Hutch server. Otherwise you need to download them from the SRA. They are BioSample SAMN16850904 under BioProject PRJNA639956 .
Running the code
The analysis consists of Jupyter notebooks in the top-level directory along with some additional code in Snakefile . You can run the analysis by using Snakemake to run Snakefile , specifying the conda environment, as in:
snakemake --use-conda --conda-prefix /fh/fast/bloom_j/software/miniconda3/envs/SARS-CoV-2-RBD_MAP -R make_summary -j 1
However, you probably want to using a cluster to help with computationally intensive parts of the analysis. To run using the cluster configuration for the Fred Hutch server, simply run the bash script run_Hutch_cluster.bash , which executes Snakefile in a way that takes advantage of the Hutch server resources. You likely want to submit run_Hutch_cluster.bash itself to the cluster (since it takes a while to run) with:
sbatch -t 7-0 run_Hutch_cluster.bash
Configuring the analysis
The configuration for the analysis is specifed in
config.yaml
.
This file defines key variables for the analysis, and should be relatively self-explanatory.
You should modify the analysis by changing this configuration file; do
not
hard-code crucial experiment-specific variables within the notebooks or
Snakefile
.
In general:
-
add new samples to data/barcode_runs.csv
-
specify new combinations of samples for escape-profile plotting via data/escape_profiles_config.yaml
-
specify combinations of samles for multi-dimensional scaling via data/mds_config.yaml
-
to output structural mappings, add information to data/output_pdbs_config.yaml about which conditions should be mapped to which PDBs, and add information to data/structural_annotation_config.yaml about which chains to analyze for defining structural contacts between RBD and antibody or other ligands.
-
to write supplementary data and
dms_view
input data, set themake_supp_data
flag in data/escape_profiles_config.yaml totrue
anddms_view
input files will be written to results/supp_data for the condition sets withmake_supp_data
astrue
for those conditions with PDB mappings in data/output_pdbs_config.yaml . -
to update the GISAID sequence set used to look at natural mutations, update the file pointed to by
gisaid_spikes
in config.yaml as described in data/README.md . -
to plot information about viral escape selections, add them to data/escape_selection_results.yaml .
Cluster configuration
There is a cluster configuration file cluster.yaml that configures Snakefile for the Fred Hutch cluster. The run_Hutch_cluster.bash script uses this configuration to run Snakefile . If you are using a different cluster than the Fred Hutch one, you wll need to modify the cluster configuration file.
Notebooks that perform the analysis
The Jupyter notebooks that perform most of the analysis are in this top-level directory with the extension
*.ipynb
.
These notebooks read the key configuration values from
config.yaml
.
There is also a ./scripts/ subdirectory with related scripts.
The notebooks need to be run in the order described in the workflow and results summary . This will occur automatically if you run them via Snakefile as described above.
Results
Results are placed in the
./results/
subdirectory.
Many of the files created in this subdirectory are not tracked in the
git
repo as they are very large.
However, key results files are tracked as well as a summary that shows the code and results.
Click
here
to see that summary.
The large results files are tracked via
git-lfs
.
This requires
git-lfs
to be installed, which it is in the
conda
environment specified by
environment.yml
.
The following commands were then run:
git lfs install
You may need to run this if you are tracking these files and haven't installed
git-lfs
in your user account.
Then the large results files were added for tracking with:
git lfs track <FILENAME>
Updating the conda environment
environment.yml contains a fully pinned conda environment. An environment without all of the versions pinned is in environment_unpinned.yml . If you need to update the environment, the suggested way to do it is add the new requirement to environment_unpinned.yml , then build, activate, and export that environment. The last three commands can be done by the following commands:
conda env create -f environment_unpinned.yml
conda activate SARS-CoV-2-RBD_MAP
conda env export > environment.yml
Creating "subset" repos and uploading data to the SRA
Currently this repo contains analyses of many antibodies and sera, and should remain public since collaborators do not want all of these data to be public.
For papers, you can make a public "subset" repo by following the instructions in ./subset_data/ . After making a subset repo, you can upload sequencing data to the Sequence Read Archive (SRA) following the instructions in ./SRA_upload/ .
Code Snippets
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | import argparse import os import subprocess def main(): parser = argparse.ArgumentParser( description='Run Jupyter notebook, create Markdown output') parser.add_argument('nb', help='Name of Jupyter notebook.') parser.add_argument('md_output', help='Name of created Markdown output.') args = parser.parse_args() nb = args.nb if not (os.path.isfile(nb) and os.path.splitext(nb)[-1] == '.ipynb'): raise IOError(f"not a valid Jupyter notebook: {nb}") md_output = args.md_output if os.path.splitext(md_output)[-1] not in {'.md', '.markdown'}: raise IOError(f"not a valid Markdown file: {md_output}") if os.path.isfile(md_output): os.remove(md_output) subprocess.check_call(['jupyter', 'nbconvert', '--to', 'notebook', '--execute', '--inplace', '--ExecutePreprocessor.timeout=-1', nb, ]) subprocess.check_call(['jupyter', 'nbconvert', '--output-dir', os.path.dirname(md_output), '--output', os.path.basename(md_output), '--to', 'markdown', nb, ]) if __name__ == '__main__': main() |
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | run: def path(f): """Get path relative to `summary_dir`.""" return os.path.relpath(f, config['summary_dir']) with open(output.summary, 'w') as f: f.write(textwrap.dedent(f""" # Summary Analysis run by [Snakefile]({path(workflow.snakefile)}) using [this config file]({path(workflow.configfiles[0])}). See the [README in the top directory]({path('README.md')}) for details. Here is the rule graph of the computational workflow: ![{path(input.rulegraph)}]({path(input.rulegraph)}) Here is the Markdown output of each notebook in the workflow: 1. Get codon-variant-table from [here]({config['codon_variant_table_url']}). 2. Count variants and then [aggregate counts]({path(input.aggregate_variant_counts)}) to create to create [variant counts file]({path(input.variant_counts)}). 3. [Analyze sequencing counts to cells ratio]({path(input.counts_to_cells_ratio)}); this prints a list of any samples where this ratio too low. Also creates [a CSV]({path(input.counts_to_cells_csv)}) with the sequencing counts, number of sorted cells, and ratios for all samples. 4. [Escape scores from variant counts]({path(input.counts_to_scores)}). 5. [Escape fractions for mutations and homologs]({path(input.scores_to_frac_escape)}); creating [mutation escape fraction file]({path(input.escape_fracs)}) and [homolog escape fraction file]({path(input.escape_fracs_homologs)}). 6. [Call sites of strong escape]({path(input.call_strong_escape_sites)}), and write to [a CSV file]({path(input.strong_escape_sites)}). 7. Plot [escape profiles]({path(input.escape_profiles)}). 9. [Multidimensional scaling]({path(input.mds_escape_profiles)}) on escape profiles. 10. Map escape profiles to ``*.pdb`` files using [this notebook]({path(input.output_pdbs)}) 11. Output a list of RBD structural contacts for each antibody with a high-resolution structure using [this notebook]({path(input.structural_contacts)}) 12. [Count mutations in GISAID RBD sequences]({path(input.gisaid_rbd_mutations)}). to create [this counts file]({path(input.gisaid_mutation_counts)}). 13. [Analyze GISAID mutations at sites of escape]({path(input.natural_mutations)}). 14. [Escape from binding by homologs]({path(input.homolog_escape)}). 15. [Analyze viral escape selections]({path(input.escape_selections)}). 16. [Make supplementary data files]({path(input.make_supp_data)}), which are [here]({path(config['supp_data_dir'])}). These include `dms-view` input files. 17. Make custom plots for this project that fall out of the core pipeline, summarzied in [this notebook]({path(input.custom_plots)}) """ ).strip()) |
164 165 | shell: "snakemake --forceall --rulegraph | dot -Tsvg > {output}" |
180 181 182 183 184 185 | shell: """ R -e \"rmarkdown::render(input=\'{params.nb}\')\"; mv {params.md} {output.md}; mv {params.md_files} {output.md_files} """ |
198 199 200 201 202 | shell: """ R -e \"rmarkdown::render(input=\'{params.nb}\')\"; mv {params.md} {output.md} """ |
215 216 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
227 228 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
238 239 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
249 250 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
262 263 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
274 275 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
287 288 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
304 305 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
316 317 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
331 332 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
349 350 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
362 363 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
381 382 | shell: "python scripts/run_nb.py {params.nb} {output.nb_markdown}" |
398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 | run: # parse sample and library from `sample_lib` wildcard lib = params.sample_lib.split('_')[-1] sample = params.sample_lib[: -len(lib) - 1] assert sample == (barcode_runs_expandR1 .set_index('sample_lib') .at[params.sample_lib, 'sample'] ) assert lib == (barcode_runs_expandR1 .set_index('sample_lib') .at[params.sample_lib, 'library'] ) # initialize `CodonVariantTable` (used to get valid barcodes) wt_seqrecord = Bio.SeqIO.read(input.wt_seq, 'fasta') geneseq = str(wt_seqrecord.seq) primary_target = wt_seqrecord.name variants=dms_variants.codonvarianttable.CodonVariantTable( geneseq=geneseq, barcode_variant_file=input.variant_table, substitutions_are_codon=True, substitutions_col='codon_substitutions', primary_target=primary_target) # initialize `IlluminaBarcodeParser` parser = dms_variants.illuminabarcodeparser.IlluminaBarcodeParser( valid_barcodes=variants.valid_barcodes(lib), **config['illumina_barcode_parser_params']) # parse barcodes counts, fates = parser.parse(input.r1s, add_cols={'library': lib, 'sample': sample}) # write files counts.to_csv(output.counts, index=False) fates.to_csv(output.fates, index=False) |
436 437 438 | run: urllib.request.urlretrieve(config['codon_variant_table_url'], output.codon_variant_table) |
444 445 | run: urllib.request.urlretrieve(config['mut_bind_expr_url'], output.file) |
451 452 | run: urllib.request.urlretrieve(config['variant_expr_url'], output.file) |
458 459 | run: urllib.request.urlretrieve(config['variant_bind_url'], output.file) |
Support
- Future updates