3DGB is a workflow to build 3D models of genomes from HiC data

public 1yr ago Version: v1.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic
Lack of a description for a new keyword tool/pypi/iced .
Lack of a description for a new keyword tool/pypi/pastis .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

3D structure of the Neurospora crassa genome at 50 kb resolution

3D genome builder (3DGB) is a workflow to build 3D models of genomes from HiC raw data and to integrate omics data on the produced models for further visual exploration. 3DGB bundles HiC-Pro , PASTIS and custom Python scripts into a unified Snakemake workflow with limited inputs (see Preparing Required Files ). 3DGB produces annotated 3D models of genome in PDB and G3D formats.

Download this repository

git clone https://github.com/data-fun/3d-genome-builder.git
cd 3d-genome-builder

Install dependencies

Singularity

Download the latest version here

Install Singularity:

sudo apt install -y ./singularity-container_3.8.7_amd64.deb

Verify version:

$ singularity --version
singularity version 3.8.7

Conda environment

Install conda .

Install mamba:

conda install mamba -n base -c conda-forge

Create conda environment and install dependendies:

mamba env create -f binder/environment.yml

Load conda environment:

conda activate 3DGB

Download HiC-Pro Singularity image

wget --ciphers=DEFAULT:@SECLEVEL=1 https://zerkalo.curie.fr/partage/HiC-Pro/hicpro_3.1.0_ubuntu.img -P images

Verify HiC-Pro version with:

$ singularity exec images/hicpro_3.1.0_ubuntu.img HiC-Pro --version
[...]
HiC-Pro version 3.1.0

and bowtie2 version:

$ singularity exec images/hicpro_3.1.0_ubuntu.img bowtie2 --version 2>/dev/null | head -n 1
/usr/local/conda/envs/hicpro/bin/bowtie2-align-s version 2.4.4

Prepare required files

Create the config file

Create and edit a configuration file in yaml format. See for instance the template config_template.yml

Add the reference genome

The reference genome fasta file must be located in WORKING_DIR/genome.fasta where WORKING_DIR is the name of the working directory as specified in your config file.

Add FASTQ files (optional)

If you already have fastq files stored locally or some fastq files are not available on GEO or SRA, you can use these files providing they are in the proper directory structure:

3D structure of the chromosome 13 of S. cerevisiae at 5 kb resolution

WORKING_DIR/
├── fastq_files
│ ├── ID1
│ │ ├── ID1_R1.fastq.gz
│ │ └── ID1_R2.fastq.gz
│ ├── ID2
│ │ ├── ID2_R1.fastq.gz
│ │ └── ID2_R2.fastq.gz
│ ├── ID3
│ │ ├── ID3_R1.fastq.gz
│ │ └── ID3_R2.fastq.gz
│ └── ID4
│ ├── ID4_R1.fastq.gz
│ └── ID4_R2.fastq.gz
└── genome.fasta

WORKING_DIR is the name of the working directory as specified in your config file.
Paired-end fastq files are in the directory WORKING_DIR/fastq_files/IDx with IDx the identifier of the paired fastq files. Fastq identifiers are reported in the config file. Please note fastq files have to follow the pattern <sample ID>_R<1 or 2>.fastq.gz .

Note

Please strictly follow this file organization as it is required by the 3DGB workflow.

Build model

Run 3DGB:

snakemake --profile smk_profile -j 4 --configfile YOUR-CONFIG.yml

Note

Adapt YOUR-CONFIG.yml to the exact name of the config file you created.

Option -j 4 tells Snakemake to use up to 4 cores. If you are more cores available, you can increase this value ( e.g. -j 16 ).

Or with debugging options:

snakemake --profile smk_profile_debug -j 4 --configfile YOUR-CONFIG.yml --verbose

Depending on the number and size of fastq files, the 3D construction will take a couple of hours to run.

For troubleshooting, have a look to log files in WORKING_DIR/logs , where WORKING_DIR is the name of the working directory as specified in your config file.

Map quantitative values on the 3D model

To map quantitative values on the model run:

python ./scripts/map_parameter.py --pdb path/to/structure.pdb --bedgraph path/to/annotation.bedgraph --output path/to/output.pdb

Quantitative values should be formatted in a 4-column bedgraph file (chromosome/start/stop/value):

chr1	0	50000	116.959
chr1	50000	100000	48.4495
chr1	100000	150000	22.8726
chr1	150000	200000	84.3106
chr1	200000	250000	113.109

Each bead of the model will be assigned a quantitative value. The resolution in the bedgraph file should match the resolution used to build the model.

Get results

Upon completion, the WORKING_DIR should look like this:

WORKING_DIR/
├── contact_maps
├── dense_matrix
├── fastq_files
├── HiC-Pro
├── logs
├── pastis
├── sequence
└── structure

The following paths contain the most interesting results:

WORKING_DIR/contact_maps/*.png : contact maps.
WORKING_DIR/HiC-Pro/output/hic_results/pic/*/*.pdf : graphical summaries of read alignments produced by Hi-C Pro.
WORKING_DIR/pastis/structure_RESOLUTION.pdb : raw 3D models (in PDB format) produced by Pastis.
WORKING_DIR/structure/RESOLUTION/structure_cleaned.* : final (annotated) 3D models in PDB and G3D formats.

Note

WORKING_DIR is the name of the working directory as specified in your config file.

RESOLUTION is the resolution of the Hi-C data specified in the config file.

Examples

Visualize 3D model structures

To visualize 3D model structures (.pdb and .g3d files), follow this quick tutorial .

Build DAG graph

For visualization purpose, you can build the graph of all computational steps involved in the 3D construction of the genome.

snakemake --profile smk_profile --configfile YOUR-CONFIG.yml --rulegraph | dot -Tpdf > rules.pdf

where YOUR-CONFIG.yml should be replaced by the name of the config file you created.

With wildcards:

snakemake --profile smk_profile --configfile YOUR-CONFIG.yml --dag | dot -Tpdf > dag.pdf

Code Snippets

import argparse

from biopandas.pdb import PandasPdb


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments.
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--input-pdb",
        action="store",
        type=str,
        help="Input PDB file containing the 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--output-pdb",
        action="store",
        type=str,
        help="Output PDB file containing the completed 3D structure of the genome.",
        required=True,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


def interpolate_missing_coordinates(pdb_name_in, pdb_name_out):
    """Interpolate missing coordinates in PDB file containing a 3D genome structure.

    3D coordinates of the missing beads are interpolated one by one.

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome
    pdb_name_out : str
        Output PDB file containing the annotated 3D structure of the genome
    """
    pdb = PandasPdb().read_pdb(pdb_name_in)
    atoms = pdb.df["ATOM"]
    print(f"Found {atoms.shape[0]} beads in {pdb_name_in}")
    # Interpolate missing coordinates chromosome by chromosome.
    coord_columns= ["x_coord", "y_coord", "z_coord"]
    for residue_number in atoms["residue_number"].unique():
        missing_beads = (
            atoms.query(f"residue_number == {residue_number}")
            [coord_columns]
            .isna()
            .sum()   # Sum NAs over columns (coordinates).
            .sum()   # Sum NAs over lines.
        ) // 3       # 1     missing atom has 3 missing coordinates.
        chromosome = atoms.query(f"residue_number == {residue_number}")
        if not missing_beads:
            continue
        print(f"Chromosome {residue_number} has {missing_beads} missing beads")

        # Interpolate coordinates on the three coordinates (x, y, z).
        # Only interpolation between existing beads.
        # No extrapolation at chromosome ends.
        interpolate_chromosome = chromosome.copy()
        interpolate_chromosome[coord_columns] = chromosome[coord_columns].interpolate(
            method="pchip", axis=0, limit_area="inside"
        )
        # Print number of missing beads that will later remove.
        deleted_atoms_number = interpolate_chromosome["x_coord"].isna().sum()
        print(
            f"Removed {deleted_atoms_number} beads from chromosome {residue_number} extremities"
        )
        # Save interpolated chromosome into the full structure.
        atoms.loc[atoms["residue_number"] == residue_number, coord_columns] = interpolate_chromosome[coord_columns]
    # Remove in one step all beads with missing coordinates
    # located at chromosome extremities.
    atoms = atoms.dropna(subset=coord_columns)
    print(
        f"Found {atoms.shape[0]} beads after interpolation and removal of extremities missing beads."
    )

    pdb.df["ATOM"] = atoms
    pdb.to_pdb(pdb_name_out)
    print(f"Wrote {pdb_name_out}")


if __name__ == "__main__":
    ARGS = get_cli_arguments()

    interpolate_missing_coordinates(ARGS.input_pdb, ARGS.output_pdb)

Python biopandas From line 6 of scripts/add_missing_beads.py

import argparse
import math
from pathlib import Path

from Bio import SeqIO
from biopandas.pdb import PandasPdb


def is_file(parser, file_path):
    """Check file exists.

    Parameters
    ----------
    parser : argparse.ArgumentParser
        Command line argument parser.
    file_path : str
        File path
    Returns
    -------
    str
        File path.
    """
    if not Path(file_path).is_file():
        parser.error(f"The file {file_path} does not exist")
    else:
        return file_path


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments.
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--pdb",
        action="store",
        type=lambda name: is_file(parser, name),
        help="PDB file containing the 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--fasta",
        action="store",
        type=lambda name: is_file(parser, name),
        help="FASTA file containing the sequence of the genome.",
        required=True,
    )
    required.add_argument(
        "--resolution",
        action="store",
        type=int,
        help="HiC resolution from which the structure has been generated.",
        required=True,
    )
    required.add_argument(
        "--output",
        action="store",
        type=str,
        help="Output PDB file containing the annotated 3D structure of the genome.",
        required=True,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


def extract_chromosome_length(fasta_name):
    """Extract chromosome length from a FASTA file.

    Parameters
    ----------
    fasta_name : str
        Name of Fasta file containing the sequence of the genome.

    Returns
    -------
    list
        List of chromosome lengths.
    """
    chromosome_length_lst = []
    with open(fasta_name, "r") as fasta_file:
        print(f"Reading {fasta_name}")
        for record in SeqIO.parse(fasta_file, "fasta"):
            length = len(record.seq)
            print(f"Found chromosome {record.id} with {length} bases")
            chromosome_length_lst.append(length)
    return chromosome_length_lst


def assign_chromosome_number(
    pdb_name_in, chromosome_length, HiC_resolution, pdb_name_out
):
    """Assign chromosome numbers to the whole genome 3D structure.

    Note:
    - The PDB file produced by Pastis is not readable by Biopython
    because the residue number column is missing.
    - We use instead the biopandas library: http://rasbt.github.io/biopandas/

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome.
    chromosome_length : list
        List with chromosome lengths.
    HiC_resolution : int
        HiC resolution.
    pdb_name_out : str
        Output PDB file containing the annotated 3D structure of the genome.
    """
    coordinates = PandasPdb().read_pdb(pdb_name_in)
    coordinates_df = coordinates.df["ATOM"]
    # Verify the number of beads in the structure
    # and the number of beads deduced from the sequence and resolution
    # are the same.
    print(f"Number of beads read from structure: {coordinates_df.shape[0]}")
    beads_per_chromosome = [
        math.ceil(length / HiC_resolution) for length in chromosome_length
    ]
    print(
        f"Number of beads deduced from sequence and HiC resolution: {sum(beads_per_chromosome)}"
    )
    if coordinates_df.shape[0] != sum(beads_per_chromosome):
        raise ValueError(
            "Number of beads in structure and "
            "number of beads deduced from sequence and resolution "
            "are different."
        )
    # Add residue (chromosome) number.
    residue_number = []
    for index, bead_number in enumerate(beads_per_chromosome):
        residue_number = residue_number + [index + 1] * bead_number
    coordinates_df["residue_number"] = residue_number

    # Add residue name based on residue number.
    coordinates_df["residue_name"] = coordinates_df["residue_number"].apply(
        lambda x: f"C{x:02d}"
    )

    # Add chain (chromosome) name.
    # Chromosome 1 -> Chain A
    # Chromosome 2 -> Chain B
    # ...
    # Chromosome 26 -> Chain Z
    # Chromosome 27 -> Chain A
    # Chromosome 28 -> Chain B
    # ...
    letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    # Handle case where there are more than 26 chromosomes.
    repeat_factor = math.ceil(len(chromosome_length) / len(letters))
    residue_to_chain = dict(
        zip(coordinates_df["residue_number"].unique(), letters * repeat_factor)
    )
    # Map residue number to chain id.
    coordinates_df["chain_id"] = coordinates_df["residue_number"].map(residue_to_chain)

    # Fix atom number starting at 1 (instead of 0).
    coordinates_df["atom_number"] = coordinates_df["atom_number"] + 1

    # Write new PDB file to disk.
    coordinates.df["ATOM"] = coordinates_df
    coordinates.to_pdb(path=pdb_name_out, records=None, gz=False, append_newline=True)
    print(f"Wrote {pdb_name_out}")


if __name__ == "__main__":
    ARGS = get_cli_arguments()

    # Read FASTA file and get chromosome length.
    CHROMOSOME_LENGTH = extract_chromosome_length(ARGS.fasta)

    # Assign chromosome number.
    assign_chromosome_number(ARGS.pdb, CHROMOSOME_LENGTH, ARGS.resolution, ARGS.output)

Python Biopython biopandas From line 12 of scripts/assign_chromosomes.py

import argparse

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np


def get_cli_arguments():
    """Parse command line.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--contacts",
        action="store",
        type=str,
        help="Input contact file",
        required=True,
    )
    parser.add_argument(
        "--map",
        action="store",
        type=str,
        help="Output contact map image",
        required=True,
    )
    return parser.parse_args()


def create_contact_map(contacts_file, map_file):
    """Create contact map image.

    Parameters
    ----------
    contacts_file : str
        Contact file
    map_file : str
        Contact map image
    """
    contacts = np.loadtxt(contacts_file)
    plt.imshow(np.log2(contacts+1), cmap="hot")
    plt.colorbar()
    plt.savefig(map_file, dpi=300)


if __name__ == "__main__":
    ARGS = get_cli_arguments()
    create_contact_map(ARGS.contacts, ARGS.map)

Python numpy matplotlib From line 3 of scripts/build_contact_maps.py

import argparse

from Bio import SeqIO

def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--fasta",
        action="store",
        type=str,
        help="Fasta file containing the sequence of the genome.",
        required=True,
    )
    required.add_argument(
        "--output",
        action="store",
        type=str,
        help="Output text file containing chromosome sizes.",
        required=True,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


if __name__ == "__main__":
    # Parse command line arguments.
    ARGS = get_cli_arguments()

    print(f"Opening {ARGS.fasta}")
    with open(ARGS.fasta, "r") as fasta_file, open(ARGS.output, "w") as size_file:
        for record in SeqIO.parse(fasta_file, "fasta"):
            print(f"{record.id}: {len(record.seq)} bases")
            size_file.write(f"{record.id}\t{len(record.seq)}\n")

Python Biopython From line 8 of scripts/calculate_chromosome_sizes.py

import argparse
import math
import pandas as pd
from pathlib import Path

from Bio import SeqIO
from biopandas.pdb import PandasPdb


def is_file(parser, file_path):
    """Check file exists.

    Parameters
    ----------
    parser : argparse.ArgumentParser
        Command line argument parser
    file_path : str
        File path
    Returns
    -------
    str
        File path    
    """
    if not Path(file_path).is_file():
        parser.error(f"The file {file_path} does not exist")
    else:
        return file_path


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--pdb",
        action="store",
        type=lambda name: is_file(parser, name),
        help="PDB file containing the 3D structure of the genome",
        required=True,
    )
    parser.add_argument(
        "--fasta",
        action="store",
        type=lambda name: is_file(parser, name),
        help="Fasta file containing the sequence of the genome",
        required=True,
    )
    parser.add_argument(
        "--resolution", action="store", type=int, help="HiC resolution", required=True,
    )
    parser.add_argument(
        "--output",
        action="store",
        type=str,
        help="Output g3d file containing the annotated 3D structure of the genome",
        required=True,
    )
    return parser.parse_args()


def extract_chromosome_length(fasta_name):
    """Extract chromosome length from a FASTA file.

    Parameters
    ----------
    fasta_name : str
        Name of Fasta file containing the sequence of the genome

    Returns
    -------
    list
        List of chromosome lengthes
    """
    chromosome_length_lst = []
    with open(fasta_name, "r") as fasta_file:
        print(f"Reading {fasta_name}")
        for record in SeqIO.parse(fasta_file, "fasta"):
            length = len(record.seq)
            print(f"Found chromosome {record.id} with {length} bases")
            chromosome_length_lst.append(length)
    return chromosome_length_lst


def convert_to_g3d(pdb_name_in, chromosome_length, HiC_resolution, g3d_name_out):
    """Assign chromosome numbers to the whole genome 3D structure.

    Note:
    - The PDB file produced by Pastis is not readable by Biopython
    because the residue number column is missing.
    - We use instead the biopandas library. http://rasbt.github.io/biopandas/

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome
    chromosome_length : list
        List with chromosome lengths
    HiC_resolution : int
        HiC resolution
    g3d_name_out : str
        Output g3d file containing the annotated 3D structure of the genome
    """
    pdb = PandasPdb().read_pdb(pdb_name_in)
    atoms = pdb.df["ATOM"]
    print(f"Number of beads read from structure: {pdb.df['ATOM'].shape[0]}")

    # Deduce chromosomal position of each atoms
    bp_coordinates = list()
    for i in range(len(chromosome_length)):
        bp_coordinates_chrom_x = list(range(0, chromosome_length[i], HiC_resolution))
        bp_coordinates = bp_coordinates + bp_coordinates_chrom_x

    # Get the index of the beads without the ones already removed from the pdb file (by interpolate_missing_coordinates.py )
    atom_number = list(atoms["atom_number"]-1)

    # Delete the bp coordinates values corresponding to removed beads
    bp_coordinates = pd.Series(bp_coordinates)
    bp_coordinates = bp_coordinates.iloc[atom_number]
    bp_coordinates.reset_index(drop=True, inplace=True)

    # Extract chromosomes and 3d coordinates information
    g3d_data = { 'chrom': atoms["residue_number"],
                'locus':bp_coordinates,
                'X3D_x': atoms["x_coord"],
                'X3D_y': atoms["y_coord"],
                'X3D_z': atoms["z_coord"] }
    result = pd.DataFrame(g3d_data)

    #save as g3d file
    result.to_csv(g3d_name_out, sep="\t", index=False)

if __name__ == "__main__":
    ARGS = get_cli_arguments()

    # Read Fasta file and get chromosome length
    CHROMOSOME_LENGTH = extract_chromosome_length(ARGS.fasta)

    # Assign chromosome number
    convert_to_g3d(ARGS.pdb, CHROMOSOME_LENGTH, ARGS.resolution, ARGS.output)

Python Pandas Biopython biopandas From line 9 of scripts/convert_to_g3d.py

import argparse
from pathlib import Path

import jinja2


def is_file(parser, file_path):
    """Check file exists.

    Parameters
    ----------
    parser : argparse.ArgumentParser
        Command line argument parser.
    file_path : str
        File path
    Returns
    -------
    str
        File path.
    """
    if not Path(file_path).is_file():
        parser.error(f"The file {file_path} does not exist")
    else:
        return file_path


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments.
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--template",
        action="store",
        type=lambda name: is_file(parser, name),
        help="Template file for HiC-Pro configuration.",
        required=True,
    )
    required.add_argument(
        "--chromosome-sizes",
        action="store",
        type=lambda name: is_file(parser, name),
        help="File with chromosome sizes.",
        required=True,
    )
    required.add_argument(
        "--genome-fragment",
        action="store",
        type=lambda name: is_file(parser, name),
        help="File with genome restriction sites.",
        required=True,
    )
    required.add_argument(
        "--genome-index-path",
        action="store",
        type=str,
        help="Path to bowtie2 indexes.",
        required=True,
    )
    required.add_argument(
        "--resolutions",
        action="store",
        type=int,
        nargs="+",
        help="HiC resolutions.",
        required=True,
    )
    required.add_argument(
        "--output",
        action="store",
        type=str,
        help="HiC-Pro configuration file.",
        required=True,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


if __name__ == "__main__":
    ARGS = get_cli_arguments()

    with open(ARGS.template, "r") as template_file, \
        open(ARGS.output, "w") as out_file:
        template = jinja2.Template(template_file.read())
        out_file.write(
            template.render(
                genome_index_path=str(Path(ARGS.genome_index_path).resolve()) + "/",
                chromosome_sizes=Path(ARGS.chromosome_sizes).resolve(),
                genome_fragment=Path(ARGS.genome_fragment).resolve(),
                resolutions=" ".join([str(res) for res in ARGS.resolutions])
            )
        )

Python Bowtie 2 HiC-Pro Jinja2 From line 3 of scripts/create_HiC_Pro_config.py

import argparse
import numpy as np
import pandas as pd

from biopandas.pdb import PandasPdb

# Allow to print wide dataframes.
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 200)


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments.
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--input-pdb",
        action="store",
        type=str,
        help="Input PDB file containing the 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--output-pdb",
        action="store",
        type=str,
        help="Output PDB file containing the completed 3D structure of the genome.",
        required=True,
    )
    optional.add_argument(
        "--threshold",
        action="store",
        type=float,
        help="Distance threshold to detect flipped contigs. Default: 3",
        required=False,
        default=2.0,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


def delete_outlier_beads(pdb_name_in, pdb_name_out, threshold):
    """Delete outlier coordinates in a PDB file containing a 3D genome structure.

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome
    pdb_name_out : str
        Output PDB file containing the annotated 3D structure of the genome
    """
    pdb = PandasPdb().read_pdb(pdb_name_in)
    atoms = pdb.df["ATOM"]
    print(f"Found {atoms.shape[0]} beads in {pdb_name_in}")

    coord_columns = ["x_coord", "y_coord", "z_coord"]
    ATOMS = pd.DataFrame()

    # delete outlier coordinates chromosome by chromosome.
    for residue_number in atoms["residue_number"].unique():
        # Select beads of one chromosome
        chromosome_df = atoms.query(f"residue_number == {residue_number}").reset_index(
            drop=True
        )

        # Compute Euclidean distances between bead n and bead n+1
        chromosome_df["distance"] = np.linalg.norm(
            chromosome_df[coord_columns].to_numpy()
            - chromosome_df[coord_columns].shift(-1).to_numpy(),
            axis=1,
        )
        # The last distance is Nan because there is no bead n+1 for the last bead.
        # We replace it by the distance between bead n and bead n-1.
        column_index = chromosome_df.columns.get_loc("distance")
        chromosome_df.iat[-1, column_index] = chromosome_df.iat[-2, column_index]

        # Compute median distance with possible Nan values
        median_distance = chromosome_df["distance"].median(skipna=True)
        print(f"Median distance between beads: {median_distance:.2f}")

        # Select outlier beads.
        # i.e. the beads with the greater ditances.
        # Output beads coordinates with distances
        outlier_beads = chromosome_df["distance"] >= threshold * median_distance

        if not outlier_beads.sum():
            print(f"Chromosome {residue_number} has no outlier beads")
            chromosome_df = chromosome_df.drop("distance", axis=1)
            # Save chromosome into the full stucture.
            ATOMS = pd.concat([ATOMS, chromosome_df])

        else:
            # Print number of outlier beads that will be later removed.
            print(
                f"Chromosome {residue_number} has {outlier_beads.sum()} outlier beads"
            )
            deleted_atoms = chromosome_df[
                chromosome_df["distance"] >= threshold * median_distance
            ]
            print(
                deleted_atoms[
                    [
                        "record_name",
                        "atom_number",
                        "residue_name",
                        "residue_number",
                        "x_coord",
                        "y_coord",
                        "z_coord",
                        "distance",
                    ]
                ]
            )
            kept_atoms = chromosome_df[
                chromosome_df["distance"] < threshold * median_distance
            ]
            kept_atoms = kept_atoms.drop("distance", axis=1)
            # Save chromosome into the full stucture.
            ATOMS = pd.concat([ATOMS, kept_atoms])
    ATOMS.reset_index(inplace=True, drop=True)
    ATOMS["line_idx"] = ATOMS.index
    print(f"Found {ATOMS.shape[0]} beads after removal of outlier beads.")
    pdb.df["ATOM"] = ATOMS
    pdb.to_pdb(pdb_name_out)
    print(f"Wrote {pdb_name_out}")


if __name__ == "__main__":
    ARGS = get_cli_arguments()

    delete_outlier_beads(ARGS.input_pdb, ARGS.output_pdb, ARGS.threshold)

Python Pandas numpy biopandas From line 6 of scripts/delete_outlier_beads.py

from watermark import watermark


def list_versions():
    """List package versions and describe environment."""
    print(
        watermark(
            machine=True,
            python=True,
            packages="numpy,pandas,scipy,biopandas,matplotlib,Bio,jinja2,sklearn,iced,pastis",
            watermark=True,
            conda=True
        )
    )


if __name__ == "__main__":
    list_versions()
    print("-" * 40)

Python watermark From line 3 of scripts/describe_conda_env.py

import argparse
import numpy as np
import pandas as pd
from pathlib import Path

import iced
from iced import io
from pastis.optimization import mds
from pastis import dispersion
from pastis.optimization import negative_binomial_structure
from pastis.io.write_struct.pdb import writePDB
from scipy import sparse

def is_file(parser, file_path):
    """Check file exists.

    Parameters
    ----------
    parser : argparse.ArgumentParser
        Command line argument parser
    file_path : str
        File path
    Returns
    -------
    str
        File path    
    """
    if not Path(file_path).is_file():
        parser.error(f"The file {file_path} does not exist")
    else:
        return file_path


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--matrix",
        action="store",
        type=lambda name: is_file(parser, name),
        help="Name of Matrix file",
        required=True,
    )
    parser.add_argument(
        "--bed",
        action="store",
        type=lambda name: is_file(parser, name),
        help="Name of Bed file",
        required=True,
    )
    parser.add_argument(
        "--output",
        action="store",
        type=str,
        help="Name of output PDB file containing the 3D structure of the genome",
        required=True,
    )
    return parser.parse_args()


def run_pastis_nb(matrix_filename, bed_filename, output_filename, seed=0, percentage_to_filter=0.02):
    """Build 3D structure of genome with Pastis NB algorithme.

    Parameters
    ----------
    matrix_filename : str
        Name of Matrix file
    bed_filename : str
        Name of Bed file
    output_filename : str
        Name of output PDB file with 3D structure of the genome
    seed : int (default: 0)
        Random generator seed
    percentage_to_filter : float (default: 0.02)
        Percentage to filter out data 
    """
    ###############################################################################
    #Load chromosome lengths and count data
    lengths = io.load_lengths(bed_filename)

    ###############################################################################
    # The sparse matrix generated by HiC-Pro seems to randomly miss a row.
    # The dense matrix generated by HiC-Pro_3.1.0/bin/utils/sparseToDense.py is always complete.
    # The 3D model is thus inferred from it.
    #counts = io.load_counts(matrix_filename, base=1)
    counts_maps = pd.read_csv(matrix_filename, sep='\t', header=None)
    counts_maps = counts_maps.values
    # to solve : File "iced/_filter_.pyx", line 15, in iced._filter_._filter_csr ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long'
    counts_maps = counts_maps.astype("double")
    #counts_maps = np.load(matrix_filename)
    counts_maps[np.isnan(counts_maps)] = 0
    counts = sparse.coo_matrix(counts_maps)
    counts.setdiag(0)
    counts.eliminate_zeros()
    counts = counts.tocoo()

    ###############################################################################
    # Normalize the data
    counts = iced.filter.filter_low_counts(
        counts, percentage=percentage_to_filter,
        sparsity=False)
    normed, bias = iced.normalization.ICE_normalization(
        counts, max_iter=300,
        output_bias=True)
    # Save the normalization data and biases just for other purposes
    # io.write_counts("./nb_normalized.matrix", normed)
    # np.savetxt("./nb.biases", bias)
    random_state = np.random.RandomState(seed)

    ###############################################################################
    # First estimate MDS for initialization
    X = mds.estimate_X(normed, random_state=random_state)
    ###############################################################################
    # Estimate constant dispersion parameters
    dispersion_ = dispersion.ExponentialDispersion(degree=0)
    _, mean, variance, weights = dispersion.compute_mean_variance(counts, lengths, bias=bias)
    dispersion_.fit(mean, variance, sample_weights=(weights**0.5))
    ###############################################################################
    # Now perform NB 3D inference.
    alpha = -3
    beta = 1
    counts = counts.tocoo()
    print("Estimating structure")
    X = negative_binomial_structure.estimate_X(
        counts, alpha, beta, bias=bias,
        lengths=lengths,
        dispersion=dispersion_,
        use_zero_entries=True,
        ini=X.flatten())
    ###############################################################################
    # Remove beads that were not infered
    mask = (np.array(counts.sum(axis=0)).flatten() + np.array(counts.sum(axis=1)).flatten() == 0)
    mask = mask.flatten()
    X_ = X.copy()
    X_[mask] = np.nan
    ###############################################################################
    np.savetxt(output_filename + ".txt", X_)
    print("Results written to", output_filename)
    writePDB(X_, output_filename)


if __name__ == "__main__":
    ARGS = get_cli_arguments()
    run_pastis_nb(
        matrix_filename=ARGS.matrix,
        bed_filename=ARGS.bed,
        output_filename=ARGS.output,
        seed=0, 
        percentage_to_filter=0.02
    )

Python Pandas numpy scipy HiC-Pro iced pastis From line 8 of scripts/infer_structures_nb.py

import argparse
import math
from operator import invert
import shutil
import sys

from Bio import SeqIO
from Bio.Seq import Seq
from biopandas.pdb import PandasPdb
import numpy as np
import pandas as pd


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--pdb",
        action="store",
        type=str,
        help="PDB file containing the 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--fasta",
        action="store",
        type=str,
        help="Fasta file containing the sequence of the genome.",
        required=True,
    )
    required.add_argument(
        "--resolution",
        action="store",
        type=int,
        help="HiC resolution from which the structure has been generated.",
        required=True,
    )
    required.add_argument(
        "--output-pdb",
        action="store",
        type=str,
        help="Output PDB file containing the fixed 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--output-fasta",
        action="store",
        type=str,
        help="Output FASTA file containing the fixed sequence of the genome.",
        required=True,
    )
    optional.add_argument(
        "--threshold",
        action="store",
        type=float,
        help="Distance threshold to detect flipped contigs. Default: 3",
        required=False,
        default=3.0,
    )
    optional.add_argument(
        "--debug",
        action="store_true",
        help="Debug flag. Output one TSV file per chromosome with bead distances.",
        required=False,
        default=False,
    )
    # Trigger to run the script or not.
    optional.add_argument(
        "--run",
        action="store",
        choices=("True", "False"),
        help="Run the script. Default: False.",
        required=False,
        default="False",
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


def extract_chromosome_name_length(fasta_name):
    """Extract chromosome name and length from a FASTA file.

    Parameters
    ----------
    fasta_name : str
        Name of the FASTA file containing the sequence of the genome.

    Returns
    -------
    tuple
        List of chromosome names.
        List of chromosome lengthes.
    """
    chromosome_name_lst = []
    chromosome_length_lst = []
    with open(fasta_name, "r") as fasta_file:
        print(f"Reading genome sequence in {fasta_name}")
        for record in SeqIO.parse(fasta_file, "fasta"):
            name = record.id
            length = len(record.seq)
            print(f"Found chromosome {name} with {length} bases")
            chromosome_name_lst.append(name)
            chromosome_length_lst.append(length)
    return chromosome_name_lst, chromosome_length_lst


def find_inverted_contigs(
    pdb_name_in, chromosome_lengths, HiC_resolution, threshold
):
    """Find inverted contigs.

    Inverted contigs are detected based on the eucledian distance
    between adjacent beads in the 3D structure of the genome.

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome.
    chromosome_lengths : list
        List with chromosome lengths.
    HiC_resolution : int
        HiC resolution.
    threshold : float
        Threshold to detect flipped contigs.

    Returns
    -------
    inverted_contigs : dict
        Dictionnary with inverted contigs.
    """
    pdb_structure = PandasPdb().read_pdb(pdb_name_in)
    structure_df = pdb_structure.df["ATOM"]
    print(f"Number of beads read from structure: {structure_df.shape[0]}")

    coord_columns = ["x_coord", "y_coord", "z_coord"]
    missing_beads_number = structure_df[coord_columns].isna().sum().sum() // 3
    print(
        f"Found {missing_beads_number} missing beads in structure"
    )

    beads_per_chromosome = [
        math.ceil(length / HiC_resolution) for length in chromosome_lengths
    ]
    print(
        f"Number of expected beads deduced from sequence and HiC resolution: {sum(beads_per_chromosome)}"
    )

    if structure_df.shape[0] != sum(beads_per_chromosome):
        sys.exit(
            f"Cannot process structure {pdb_name_in} because it contains "
            f"{structure_df.shape[0]} beads "
            f"instead of {sum(beads_per_chromosome)}"
        )

    inverted_contigs = {}

    for chrom_num in structure_df["residue_number"].unique():
        print(f"\nLooking for inverted contigs into chromosome {chrom_num}")

        # Select beads of one chromosome
        chromosome_df = structure_df.query(
            f"residue_number == {chrom_num}"
        ).reset_index(drop=True)

        # Compute Euclidean distances between bead n and bead n+1
        chromosome_df["distance"] = np.linalg.norm(
            chromosome_df[coord_columns].to_numpy() - chromosome_df[coord_columns].shift(-1).to_numpy(),
            axis=1
        )
        # Compute median distance with possible Nan values
        median_distance = chromosome_df["distance"].median(skipna=True)
        print(f"Median distance between beads: {median_distance:.2f}")

        # Select extremities of inverted contigs
        # i.e. beads with distance above a given threshold.
        # Output beads coordinates with distances
        if ARGS.debug:
            filename = f"chr_{chrom_num}.tsv"
            target_columns = [
                "record_name", "atom_number", "atom_name",
                "residue_name", "chain_id", "residue_number",
                "x_coord", "y_coord", "z_coord", "line_idx",
                "distance",
            ]
            print(f"DEBUG: writing {filename} with distances.")
            chromosome_df[target_columns].to_csv(
                filename, sep="\t", index=False, na_rep="nan"
            )

        beads_selection = (
            chromosome_df["distance"] > threshold * median_distance
        )
        inversion_limits = chromosome_df.loc[
            beads_selection, "atom_number"
        ].values
        if len(inversion_limits) % 2 != 0:
            print("WARNING: odd number of inversion limits found")
            print(
                "WARNING: this might lead to a wrong detection of inverted contigs"
            )
            print(inversion_limits)
        if len(inversion_limits) != 0:
            for limit_1, limit_2 in zip(
                inversion_limits[0::2], inversion_limits[1::2]
            ):
                print(
                    f"Chromosome {chrom_num}: found inverted contig between bead {limit_1+1} and bead {limit_2}"
                )
                if chrom_num in inverted_contigs:
                    inverted_contigs[chrom_num].append((limit_1 + 1, limit_2))
                else:
                    inverted_contigs[chrom_num] = [(limit_1 + 1, limit_2)]
        else:
            inverted_contigs[chrom_num] = []
    return inverted_contigs


def flip_inverted_contigs_in_structure(
    inverted_contigs, pdb_name_in, pdb_name_out
):
    """Flip inverted contigs in the 3D structure of the genome.

    Parameters
    ----------
    inverted_contigs : dict
        Dictionnary with inverted contigs
    pdb_name_in : str
        PDB file containing the 3D structure of the genome
    pdb_name_out : str
        Output PDB file containing the 3D structure of the genome
    """
    pdb_structure = PandasPdb().read_pdb(pdb_name_in)
    coordinates = pdb_structure.df["ATOM"]
    if sum(map(len, inverted_contigs.values()))== 0:
        pdb_structure.to_pdb(
        path=pdb_name_out, records=None, gz=False, append_newline=True
    )
        print("\nNothing to fix. Structure is fine.")
        return
    print("\nFlipping contigs.")
    for chrom_num in inverted_contigs:
        for contig in inverted_contigs[chrom_num]:
            contig_start, contig_end = contig
            print(
                f"Structure of chromosome {chrom_num}: "
                f"flip contig between beads {contig_start} "
                f"and {contig_end}"
            )
            contig_start_index = coordinates[
                (coordinates["residue_number"] == chrom_num)
                & (coordinates["atom_number"] == contig_start)
            ].index[0]
            contig_end_index = coordinates[
                (coordinates["residue_number"] == chrom_num)
                & (coordinates["atom_number"] == contig_end)
            ].index[0]
            contig_before_df = coordinates.loc[: contig_start_index - 1, :]
            contig_df = coordinates.loc[contig_start_index:contig_end_index, :]
            contig_after_df = coordinates.loc[contig_end_index + 1 :, :]
            # Flip contig.
            contig_df = contig_df[::-1]
            # Assemble genome structure.
            coordinates = pd.concat(
                [contig_before_df, contig_df, contig_after_df]
            )

    coordinates = coordinates.reset_index(drop=True)
    # The 'line_idx' column keeps the real order of atoms in the PDB file.
    coordinates["line_idx"] = coordinates.index
    pdb_structure.df["ATOM"] = coordinates
    pdb_structure.to_pdb(
        path=pdb_name_out, records=None, gz=False, append_newline=True
    )


def flip_inverted_contigs_in_sequence(
    inverted_contigs,
    chromosome_names,
    fasta_name_in,
    HiC_resolution,
    fasta_name_out,
):
    """Flip inverted contigs in the genome 3D structure and sequence.

    Parameters
    ----------
    inverted_contigs : dict
        Dictionnary with inverted contigs.
    chromosome_names : list
        List with chromosome names.
    fasta_name_in : str
        Name of Fasta file containing the sequence of the genome.
    HiC_resolution : int
        HiC resolution.
    fasta_name_out : str
        Output FASTA file containing the fixed sequence (at the 3D structure resolution!).
    """
    # Flip inverted contigs in the genome sequence.
    genome_fasta = SeqIO.to_dict(SeqIO.parse(fasta_name_in, "fasta"))

    if bool([a for a in inverted_contigs.values() if a == []]):
        with open(fasta_name_out, "w") as fasta_file:
            SeqIO.write(genome_fasta.values(), fasta_file, "fasta")
        print("Nothing to fix. Sequence is fine.")
        return

    for chrom_num in inverted_contigs:
        chrom_name = chromosome_names[chrom_num - 1]
        chrom_sequence = str(genome_fasta[chrom_name].seq)
        for contig in inverted_contigs[chrom_num]:
            contig_start = contig[0] * HiC_resolution
            contig_end = contig[1] * HiC_resolution
            print(
                f"Sequence of chromosome {chrom_num}: "
                f"flip inverted contig between base {contig_start} "
                f"and {contig_end}"
            )
            contig_sequence = chrom_sequence[contig_start : contig_end + 1]
            # Flip contig.
            contig_sequence = contig_sequence[::-1]
            # Reassemble chromosome sequence.
            chrom_sequence = (
                chrom_sequence[:contig_start]
                + contig_sequence
                + chrom_sequence[contig_end + 1 :]
            )
        genome_fasta[chrom_name].seq = Seq(chrom_sequence)

    # Write genome sequence.
    with open(fasta_name_out, "w") as fasta_file:
        SeqIO.write(genome_fasta.values(), fasta_file, "fasta")


if __name__ == "__main__":
    # Parse command line arguments.
    ARGS = get_cli_arguments()

    if ARGS.run != "True":
        print("Do not verify inverted contigs.")
        print(f"Copying {ARGS.pdb} to {ARGS.output_pdb}.")
        shutil.copyfile(ARGS.pdb, ARGS.output_pdb)
        print(f"Copying {ARGS.fasta} to {ARGS.output_fasta}.")
        shutil.copyfile(ARGS.fasta, ARGS.output_fasta)
        sys.exit()

    # Read Fasta file and extract chromosome names and lengths.
    CHROMOSOME_NAMES, CHROMOSOME_LENGTHS = extract_chromosome_name_length(
        ARGS.fasta
    )

    # Find inverted contigs.
    INVERTED_CONTIGS = find_inverted_contigs(
        ARGS.pdb, CHROMOSOME_LENGTHS, ARGS.resolution, ARGS.threshold
    )
    # Flip inverted contigs in the genome 3D structure and sequence.
    flip_inverted_contigs_in_structure(
        INVERTED_CONTIGS, ARGS.pdb, ARGS.output_pdb
    )
    flip_inverted_contigs_in_sequence(
        INVERTED_CONTIGS,
        CHROMOSOME_NAMES,
        ARGS.fasta,
        ARGS.resolution,
        ARGS.output_fasta,
    )

Python Pandas numpy Biopython biopandas From line 11 of scripts/verify_inverted_contigs.py

shell:
    "python ../scripts/describe_conda_env.py >{output} &&"
    "fasterq-dump --version >>{output} &&"
    "pigz --version >>{output} "

SnakeMake From line 31 of master/Snakefile

shell:
    "/usr/local/bin/HiC-Pro_3.1.0/bin/utils/digest_genome.py --help >{output} && "
    "printf '\n\n\n' >>{output} && "
    "bowtie2-build --version >>{output} &&"
    "printf '\n\n\n' >>{output} && "
    "LC_ALL=C; HiC-Pro --version >>{output} "

SnakeMake Bowtie 2 HiC-Pro From line 46 of master/Snakefile

shell: 
    "fasterq-dump {wildcards.sra_id} --threads {threads} --log-level {params.loglevel} "
    "{params.progress} --outdir {params.outdir} >{log} 2>&1 "

SnakeMake From line 74 of master/Snakefile

shell: 
    "mv {input} {output}"

SnakeMake From line 87 of master/Snakefile

shell:
    "python ../scripts/calculate_chromosome_sizes.py "
    "--fasta {input} "
    "--output {output} "
    ">{log} 2>&1 "

SnakeMake From line 122 of master/Snakefile

shell:
    "/usr/local/bin/HiC-Pro_3.1.0/bin/utils/digest_genome.py "
    "-r {config[hicpro_restriction_sites]} "
    "-o {output} "
    "{input} "
    ">{log} 2>&1 "

SnakeMake From line 142 of master/Snakefile

shell:
    "bowtie2-build {input} HiC-Pro/bowtie2_index/genome >{log} 2>&1"

SnakeMake Bowtie 2 From line 163 of master/Snakefile

shell:
    "python ../scripts/create_HiC_Pro_config.py "
    "--template {input.template} "
    "--chromosome-sizes {input.chromosome_sizes} "
    "--genome-fragment {input.genome_fragment} "
    "--genome-index-path {params.genome_index_path} "
    "--resolutions {params.resolutions} "
    "--output {output} "
    ">{log} 2>&1"

SnakeMake From line 184 of master/Snakefile

shell:
    "LC_ALL=C; echo 'y' | HiC-Pro "
    "-i fastq_files "
    "-o HiC-Pro/output "
    "-c {input.config} "
    ">{log} 2>&1 "

SnakeMake HiC-Pro From line 212 of master/Snakefile

shell:
    "cp {input} {output}"

SnakeMake From line 233 of master/Snakefile

shell:
    "HiC-Pro "
    "-i HiC-Pro/merged_samples "
    "-o HiC-Pro/merged_output "
    "-c {input.config} "
    "-s merge_persample "
    "-s build_contact_maps "
    "-s ice_norm "
    ">{log} 2>&1 "

SnakeMake HiC-Pro From line 257 of master/Snakefile

shell:
    "/usr/local/bin/HiC-Pro_3.1.0/bin/utils/sparseToDense.py "
    "--bins {input.bed} "
    "--output {output} "
    "{input.matrix} "
    ">{log} 2>&1 "

SnakeMake From line 280 of master/Snakefile

shell:
    "/usr/local/bin/HiC-Pro_3.1.0/bin/utils/sparseToDense.py "
    "--bins {input.bed} "
    "--output {output} "
    "{input.matrix} "
    ">{log} 2>&1 "

SnakeMake From line 300 of master/Snakefile

shell:
    "python ../scripts/build_contact_maps.py "
    "--contacts {input} "
    "--map {output} "
    ">{log} 2>&1 "

SnakeMake From line 319 of master/Snakefile

shell:
    "python ../scripts/infer_structures_nb.py "
    "--matrix {input.matrix} "
    "--bed {input.bed} "
    "--output {output} "
    ">{log} 2>&1 "

SnakeMake From line 338 of master/Snakefile

shell:
    "python ../scripts/assign_chromosomes.py "
    "--pdb {input.structure} "
    "--fasta {input.sequence} "
    "--resolution {wildcards.resolution} "
    "--output {output} "
    ">{log} 2>&1 "

SnakeMake From line 358 of master/Snakefile

shell:
    "python ../scripts/verify_inverted_contigs.py "
    "--run {params.verify} "
    "--pdb {input.structure} "
    "--fasta {input.sequence} "
    "--resolution {wildcards.resolution} "
    "--output-pdb {output.structure} "
    "--output-fasta {output.sequence} "
    ">{log} 2>&1 "

SnakeMake From line 382 of master/Snakefile

shell:
    "python ../scripts/add_missing_beads.py "
    "--input-pdb {input} "
    "--output-pdb {output} "
    ">{log} 2>&1 "

SnakeMake From line 404 of master/Snakefile

shell:
    "python ../scripts/delete_outlier_beads.py "
    "--input-pdb {input} "
    "--output-pdb {output} "
    ">{log} 2>&1 "

SnakeMake From line 422 of master/Snakefile

shell:
    "python ../scripts/convert_to_g3d.py "
    "--pdb {input.structure} "
    "--fasta {input.sequence} "
    "--resolution {wildcards.resolution} "
    "--output {output} "
    ">{log} 2>&1 "