HIV immunoadaptation (HIVIA) tool

public public 1yr ago 0 bookmarks

This is the code basis for the paper 'Insights to HIV-1 coreceptor usage by estimating HLA adaptation using Bayesian generalized linear mixed models'. Since HLA information contains private sensitive data that we are not allowed to publish, we can't publish neihter the training HLA data nor the final models.

Content

Snakemake pipeline

The repository contains

  • a pipeline for learning HLA adaptation based on the p24 protein and the HLA I and II alleles, as well as clinical factors like age, sex, and ethnicity
 ./Snakefile
  • a prediction pipeline for predicting the HLA adaptation based on the HLA profile and the p24 sequence, as well as clinical factors.
./Snakefile_predict

The folder ./rules contains the different parts of the pipeline. The folder ./scripts contains the source code for the rules.

Config files

For each pipeline there is a corresponding config file:

./config.yml
./config_predict.yml

Environments

All environments are stored in the directory ./envs . The conda environment is defined in the file

./envs/conda/HIVIA.yml

The Snakemake environment is defined in the directory

./envs/snakemake/

There are different environments depending on the computational environment (GRID Engine, SLURM, cluster, etc. )

Run pipepline

The command to run the pipeline (on a cluster) is as follows

snakemake -n --configfile config.yml --profile envs/snakemake/[cluster] --nolock

This is a dry run (option -n). Remove the option -n to execute the workflow.

Code for manuscript figures and tables

The file ./notebooks/HIVIA.R contains the code to produce all figures and tables and p-values in the paper.

Key idea

The key elements of this project is to compute the adaptation score of a viral sequence based on its HLA profile, age, sex, and ethnicity. The adaptation score is computed as follows:

For each frequent variant site si, two Bayesian generalized linear mixed models are computed:

  • HLA model

    • age

    • sex

    • ethnicity

    • phylogeny

    • HLA information (binarized)

  • baseline model

    • age

    • sex

    • ethnicity

    • phylogeny

The modeling is performed with the script ./scripts/modeling/fit_glmm.R . The computation of the adaptation score is performed in the script ./scripts/eval/compute_adaptation.R. The call for fitting the HLA model using Bayesian GLMM is the following:

fit<- brm(formula = 'y ~ HLA_allele_1 + ... + HLA_allele_i + ... HLA_allele_n + age + sex + ethnicity + (1|patient_id)', cov_ranef = list(patient_id = ape::vcv.phylo(tree)) , data = df, family = 'categorical', prior = c(set_prior('horseshoe(df = 1, par_ratio = 0.01)', class="b")), seed = 100, control = list(adapt_delta = 0.99))

Code Snippets

19
20
shell:
  'Rscript {params.script_dir}/modeling/stratified_kCV_folds.R -i {input.df} -s {wildcards.seed} -d {params.dir} -k {params.k} --group {params.group} --log {log}'
30
31
shell: 
  "Rscript {params.script_dir}/modeling/get_rand_test.R -i {input.rand_df} -t {input.test_df} -o {output} --cv cv" 
44
45
46
47
48
49
50
run:
  hla = ",".join(params.hla)
  exec = 'Rscript {params.script_dir}/modeling/convert_train_hla2binary.R'
  exec += ' -i {input} -c {wildcards.hla_cutoff} -o {output}'
  exec += ' --hla ' + hla
  exec += ' -l {log}'
  shell(exec)
63
64
65
66
run:
  hla = ",".join(params.hla)
  exec = 'Rscript {params.script_dir}/eval/convert_test_hla2binary.R -i {input.df} --hla ' + hla + ' -o {output}'
  shell(exec)
 93
 94
 95
 96
 97
 98
 99
100
101
102
run:
  import os
  my_env=dict(os.environ)
  ENV_VARS=['HOME', 'TMPDIR','TMP','HOSTNAME','PATH','SGE_STDERR_PATH', 'SGE_O_WORKDIR' ,'SGE_O_HOST','SGE_O_HOME', 'PYTHONPATH']
  #with open(log[0], 'w') as l:
  #  [l.write(env_var+":"+ my_env.get(env_var, "")+"\n") for env_var in ENV_VARS]
  for key, value in my_env.items():
    print(key, value)

  fixed = ",".join(params.fixed)
130
131
shell: 
  "Rscript {params.script_dir}/eval/CV_classify_SAAV.R -f {input.fit} -t {input.test} -r {input.rand} -o {output} --model {wildcards.model} --model_version {wildcards.model_version} -p {wildcards.protein} -s {wildcards.saav} --seed {wildcards.seed}"
156
157
158
run:
	with open(output[0], 'w') as f: 
		f.write('\n'.join(input))
167
168
169
run:
  exec = "Rscript {params.script_dir}/eval/compute_adaptation_score.R -i {input} --output_adapt {output} --output_pred {params.output_pred} --outcome_type {wildcards.outcome_type} --rand"
  shell(exec)
10
11
12
13
14
run:
  path = wildcards.run_id + '/'+'models/' +  wildcards.protein + '/' + wildcards.outcome_type +'/final_models/' + wildcards.model_version + '/*'
  SAAVs_paths=glob.glob(path)
  SAAVs = [output.split('/')[-1] for output in SAAVs_paths]
  pd.DataFrame(SAAVs,columns=['sites']).to_csv(output[0], index = None)
27
28
29
30
31
32
33
run:
  hla = ",".join(params.hla)
  exec = 'Rscript {params.script_dir}/modeling/convert_train_hla2binary.R'
  exec += ' -i {input} -c {wildcards.hla_cutoff} -o {output}'
  exec += ' --hla ' + hla
  exec += ' -l {log}'
  shell(exec)
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
run: 
  import os
  my_env=dict(os.environ)
  ENV_VARS=['HOME', 'TMPDIR','TMP','HOSTNAME','PATH','SGE_STDERR_PATH', 'SGE_O_WORKDIR' ,'SGE_O_HOST','SGE_O_HOME', 'PYTHONPATH']
  for key, value in my_env.items():
    print(key, value)

  fixed = ",".join(params.fixed)
  #random = ",".join(params.random)
  #outcome = ",".join(params.outcome)

  # build fit
  exec = "Rscript {params.script_dir}/modeling/fit_glmm.R"
  exec += " -i {input.df} -t {input.tree} --method {params.method}"
  exec += " --model {wildcards.model} -p {params.prior} -s {wildcards.seed}"
  exec += " --adapt_delta {params.adapt_delta} -o {output}"
  exec += " -f " + fixed
  exec += " -r {params.random}"
  exec += " -y {params.outcome}"
  exec += " --coeffs {params.coeffs}"
  exec += " --outcome_type {wildcards.outcome_type}"
  exec += " --model_version {wildcards.model_version}"
  exec += " --protein {wildcards.protein}"
  exec += " --saav {wildcards.saav}"
  exec += " --ref_map {params.ref_map}"
  exec += " --par_ratio {params.par_ratio}"
  shell(exec)
16
17
shell:
  "Rscript {params.script_dir}/prep/coreceptor_data/clean_coreceptor_data.R -i {input} -o {output} -l {log}"
SnakeMake From line 16 of rules/prep.smk
31
32
shell:
	"Rscript {params.script_dir}/prep/sequence_data/subtype_sequences.R -i {input.sequences} -o {output.subtypes} -a {params.api}"
SnakeMake From line 31 of rules/prep.smk
42
43
shell:
  'Rscript {params.script_dir}/prep/sequence_data/clean_subtype_table.R -s {input.subtypes} -o {output.cleaned_subtypes}'
SnakeMake From line 42 of rules/prep.smk
95
96
shell:
  "Rscript {params.script_dir}/prep/sequence_data/get_reference_sequence.R -a {output.alignment_file} -o {output.refseq_file} -t {params.alignment_type} -s {params.suborganism} --subtype {params.subtype} -d {params.region_definition} -r {params.region} -b {params.basetype} -y {params.year} -f {params.format} -q {params.query} --api_source {params.api_source}"
SnakeMake From line 95 of rules/prep.smk
109
110
shell:
  "mafft --add {input.selected_sequences} --quiet {input.reference_sequence}  > {output.alignment}"
132
133
shell:
  "Rscript {params.script_dir}/prep/sequence_data/get_reference_sequence2.R -i {params.id} -n {params.protein} -d {params.db} -o {output}"
SnakeMake From line 132 of rules/prep.smk
152
153
shell:
  "Rscript {params.script_dir}/prep/sequence_data/get_codon_alignment.R -i {input.dna_alignment} -f {params.frame} -c {params.compensating_num} -t {params.input_type} --output_format {params.output_format} --output_seqs_type {params.output_seqs_type} -n {params.protein} --output_dir {params.output_dir} --api_source {params.api_source}"
SnakeMake From line 152 of rules/prep.smk
167
168
169
170
171
172
173
174
175
run: 
  exec = "Rscript {params.script_dir}/prep/sequence_data/clean_codon_align.R "
  exec+= "-i {input.codon_align_aa} "
  exec+= "-r {output.ref_seq} "
  exec+= "-a {output.alignment} "
  exec+= "-l {log} "
  exec+= "-s {output.seq_freq_mat} "
  exec+= "-p {output.pos_freq_mat} "
  shell(exec)
SnakeMake From line 167 of rules/prep.smk
188
189
shell:
  "mafft --add {input.ref_seq_aa} --quiet --mapout {input.codon_align_aa_ref_seq} >  {output.align}"
199
200
shell: 
  'Rscript {params.script_dir}/prep/sequence_data/remove_refseq.R -i {input} -o {output} -r {params.refseq_name}'
SnakeMake From line 199 of rules/prep.smk
219
220
shell:
  "Rscript {params.script_dir}/prep/sequence_data/make_SAAV_table.R -a {input.alignment} -r {input.ref_map} -d {wildcards.saav_cutoff} -o {output} --outcome_type {wildcards.outcome_type}" 
SnakeMake From line 219 of rules/prep.smk
264
265
266
267
268
269
270
271
272
273
274
275
276
run:
  fixed = ",".join(params.fixed)
  hla = ",".join(params.hla)

  exec = 'Rscript {params.script_dir}/modeling/select_variables.R'
  exec += ' -i {input} -o {output}'
  exec += ' -f ' + fixed
  exec += ' -r {params.random}'
  exec += ' -y {params.outcome}'
  exec += ' --id  {params.id_column}'
  exec += ' --hla ' + hla

  shell(exec)
SnakeMake From line 264 of rules/prep.smk
291
292
shell:
  'raxmlHPC -s {input} -w $(pwd)/{params.dir} -n {params.run_id} -m {params.model} -p {params.seed} --silent'
ShowHide 22 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/annahake/HIVIA_TOOL
Name: hivia_tool
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...