HIV immunoadaptation (HIVIA) tool

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is the code basis for the paper 'Insights to HIV-1 coreceptor usage by estimating HLA adaptation using Bayesian generalized linear mixed models'. Since HLA information contains private sensitive data that we are not allowed to publish, we can't publish neihter the training HLA data nor the final models.

Content

Snakemake pipeline

The repository contains

a pipeline for learning HLA adaptation based on the p24 protein and the HLA I and II alleles, as well as clinical factors like age, sex, and ethnicity

 ./Snakefile

a prediction pipeline for predicting the HLA adaptation based on the HLA profile and the p24 sequence, as well as clinical factors.

./Snakefile_predict

The folder ./rules contains the different parts of the pipeline. The folder ./scripts contains the source code for the rules.

Config files

For each pipeline there is a corresponding config file:

./config.yml
./config_predict.yml

Environments

All environments are stored in the directory ./envs . The conda environment is defined in the file

./envs/conda/HIVIA.yml

The Snakemake environment is defined in the directory

./envs/snakemake/

There are different environments depending on the computational environment (GRID Engine, SLURM, cluster, etc. )

Run pipepline

The command to run the pipeline (on a cluster) is as follows

snakemake -n --configfile config.yml --profile envs/snakemake/[cluster] --nolock

This is a dry run (option -n). Remove the option -n to execute the workflow.

Code for manuscript figures and tables

The file ./notebooks/HIVIA.R contains the code to produce all figures and tables and p-values in the paper.

Key idea

The key elements of this project is to compute the adaptation score of a viral sequence based on its HLA profile, age, sex, and ethnicity. The adaptation score is computed as follows:

For each frequent variant site si, two Bayesian generalized linear mixed models are computed:

HLA model
- age
- sex
- ethnicity
- phylogeny
- HLA information (binarized)
baseline model
- age
- sex
- ethnicity
- phylogeny

The modeling is performed with the script ./scripts/modeling/fit_glmm.R . The computation of the adaptation score is performed in the script ./scripts/eval/compute_adaptation.R. The call for fitting the HLA model using Bayesian GLMM is the following:

fit<- brm(formula = 'y ~ HLA_allele_1 + ... + HLA_allele_i + ... HLA_allele_n + age + sex + ethnicity + (1|patient_id)', cov_ranef = list(patient_id = ape::vcv.phylo(tree)) , data = df, family = 'categorical', prior = c(set_prior('horseshoe(df = 1, par_ratio = 0.01)', class="b")), seed = 100, control = list(adapt_delta = 0.99))

Code Snippets

shell:
  'Rscript {params.script_dir}/modeling/stratified_kCV_folds.R -i {input.df} -s {wildcards.seed} -d {params.dir} -k {params.k} --group {params.group} --log {log}'

SnakeMake From line 19 of rules/CV_model.smk

shell: 
  "Rscript {params.script_dir}/modeling/get_rand_test.R -i {input.rand_df} -t {input.test_df} -o {output} --cv cv" 

SnakeMake From line 30 of rules/CV_model.smk

run:
  hla = ",".join(params.hla)
  exec = 'Rscript {params.script_dir}/modeling/convert_train_hla2binary.R'
  exec += ' -i {input} -c {wildcards.hla_cutoff} -o {output}'
  exec += ' --hla ' + hla
  exec += ' -l {log}'
  shell(exec)

SnakeMake From line 44 of rules/CV_model.smk

run:
  hla = ",".join(params.hla)
  exec = 'Rscript {params.script_dir}/eval/convert_test_hla2binary.R -i {input.df} --hla ' + hla + ' -o {output}'
  shell(exec)

SnakeMake From line 63 of rules/CV_model.smk

run:
  import os
  my_env=dict(os.environ)
  ENV_VARS=['HOME', 'TMPDIR','TMP','HOSTNAME','PATH','SGE_STDERR_PATH', 'SGE_O_WORKDIR' ,'SGE_O_HOST','SGE_O_HOME', 'PYTHONPATH']
  #with open(log[0], 'w') as l:
  #  [l.write(env_var+":"+ my_env.get(env_var, "")+"\n") for env_var in ENV_VARS]
  for key, value in my_env.items():
    print(key, value)

  fixed = ",".join(params.fixed)

SnakeMake From line 93 of rules/CV_model.smk

shell: 
  "Rscript {params.script_dir}/eval/CV_classify_SAAV.R -f {input.fit} -t {input.test} -r {input.rand} -o {output} --model {wildcards.model} --model_version {wildcards.model_version} -p {wildcards.protein} -s {wildcards.saav} --seed {wildcards.seed}"

SnakeMake From line 130 of rules/CV_model.smk

run:
	with open(output[0], 'w') as f: 
		f.write('\n'.join(input))

SnakeMake From line 156 of rules/CV_model.smk

run:
  exec = "Rscript {params.script_dir}/eval/compute_adaptation_score.R -i {input} --output_adapt {output} --output_pred {params.output_pred} --outcome_type {wildcards.outcome_type} --rand"
  shell(exec)

SnakeMake From line 167 of rules/CV_model.smk

run:
  path = wildcards.run_id + '/'+'models/' +  wildcards.protein + '/' + wildcards.outcome_type +'/final_models/' + wildcards.model_version + '/*'
  SAAVs_paths=glob.glob(path)
  SAAVs = [output.split('/')[-1] for output in SAAVs_paths]
  pd.DataFrame(SAAVs,columns=['sites']).to_csv(output[0], index = None)

SnakeMake From line 10 of rules/final_models.smk

run:
  hla = ",".join(params.hla)
  exec = 'Rscript {params.script_dir}/modeling/convert_train_hla2binary.R'
  exec += ' -i {input} -c {wildcards.hla_cutoff} -o {output}'
  exec += ' --hla ' + hla
  exec += ' -l {log}'
  shell(exec)

SnakeMake From line 27 of rules/final_models.smk

run: 
  import os
  my_env=dict(os.environ)
  ENV_VARS=['HOME', 'TMPDIR','TMP','HOSTNAME','PATH','SGE_STDERR_PATH', 'SGE_O_WORKDIR' ,'SGE_O_HOST','SGE_O_HOME', 'PYTHONPATH']
  for key, value in my_env.items():
    print(key, value)

  fixed = ",".join(params.fixed)
  #random = ",".join(params.random)
  #outcome = ",".join(params.outcome)

  # build fit
  exec = "Rscript {params.script_dir}/modeling/fit_glmm.R"
  exec += " -i {input.df} -t {input.tree} --method {params.method}"
  exec += " --model {wildcards.model} -p {params.prior} -s {wildcards.seed}"
  exec += " --adapt_delta {params.adapt_delta} -o {output}"
  exec += " -f " + fixed
  exec += " -r {params.random}"
  exec += " -y {params.outcome}"
  exec += " --coeffs {params.coeffs}"
  exec += " --outcome_type {wildcards.outcome_type}"
  exec += " --model_version {wildcards.model_version}"
  exec += " --protein {wildcards.protein}"
  exec += " --saav {wildcards.saav}"
  exec += " --ref_map {params.ref_map}"
  exec += " --par_ratio {params.par_ratio}"
  shell(exec)

SnakeMake From line 57 of rules/final_models.smk

shell:
  "Rscript {params.script_dir}/prep/coreceptor_data/clean_coreceptor_data.R -i {input} -o {output} -l {log}"

SnakeMake From line 16 of rules/prep.smk

shell:
	"Rscript {params.script_dir}/prep/sequence_data/subtype_sequences.R -i {input.sequences} -o {output.subtypes} -a {params.api}"

SnakeMake From line 31 of rules/prep.smk

shell:
  'Rscript {params.script_dir}/prep/sequence_data/clean_subtype_table.R -s {input.subtypes} -o {output.cleaned_subtypes}'

SnakeMake From line 42 of rules/prep.smk

shell:
  "Rscript {params.script_dir}/prep/sequence_data/get_reference_sequence.R -a {output.alignment_file} -o {output.refseq_file} -t {params.alignment_type} -s {params.suborganism} --subtype {params.subtype} -d {params.region_definition} -r {params.region} -b {params.basetype} -y {params.year} -f {params.format} -q {params.query} --api_source {params.api_source}"

SnakeMake From line 95 of rules/prep.smk

shell:
  "mafft --add {input.selected_sequences} --quiet {input.reference_sequence}  > {output.alignment}"

SnakeMake MAFFT API (EBI) From line 109 of rules/prep.smk

shell:
  "Rscript {params.script_dir}/prep/sequence_data/get_reference_sequence2.R -i {params.id} -n {params.protein} -d {params.db} -o {output}"

SnakeMake From line 132 of rules/prep.smk

shell:
  "Rscript {params.script_dir}/prep/sequence_data/get_codon_alignment.R -i {input.dna_alignment} -f {params.frame} -c {params.compensating_num} -t {params.input_type} --output_format {params.output_format} --output_seqs_type {params.output_seqs_type} -n {params.protein} --output_dir {params.output_dir} --api_source {params.api_source}"

SnakeMake From line 152 of rules/prep.smk

run: 
  exec = "Rscript {params.script_dir}/prep/sequence_data/clean_codon_align.R "
  exec+= "-i {input.codon_align_aa} "
  exec+= "-r {output.ref_seq} "
  exec+= "-a {output.alignment} "
  exec+= "-l {log} "
  exec+= "-s {output.seq_freq_mat} "
  exec+= "-p {output.pos_freq_mat} "
  shell(exec)

SnakeMake From line 167 of rules/prep.smk

shell:
  "mafft --add {input.ref_seq_aa} --quiet --mapout {input.codon_align_aa_ref_seq} >  {output.align}"

SnakeMake MAFFT API (EBI) From line 188 of rules/prep.smk

shell: 
  'Rscript {params.script_dir}/prep/sequence_data/remove_refseq.R -i {input} -o {output} -r {params.refseq_name}'

SnakeMake From line 199 of rules/prep.smk

shell:
  "Rscript {params.script_dir}/prep/sequence_data/make_SAAV_table.R -a {input.alignment} -r {input.ref_map} -d {wildcards.saav_cutoff} -o {output} --outcome_type {wildcards.outcome_type}" 

SnakeMake From line 219 of rules/prep.smk

run:
  fixed = ",".join(params.fixed)
  hla = ",".join(params.hla)

  exec = 'Rscript {params.script_dir}/modeling/select_variables.R'
  exec += ' -i {input} -o {output}'
  exec += ' -f ' + fixed
  exec += ' -r {params.random}'
  exec += ' -y {params.outcome}'
  exec += ' --id  {params.id_column}'
  exec += ' --hla ' + hla

  shell(exec)

SnakeMake From line 264 of rules/prep.smk

shell:
  'raxmlHPC -s {input} -w $(pwd)/{params.dir} -n {params.run_id} -m {params.model} -p {params.seed} --silent'

SnakeMake raxml-ng From line 291 of rules/prep.smk

ShowHide 22 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/annahake/HIVIA_TOOL

Name: hivia_tool

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: GNU General Public License v2.0

Keywords:

raxml-ng MAFFT API (EBI) Snakemake

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free