3DGB is a workflow to build 3D models of genomes from HiC data

public public 1yr ago Version: v1.0 0 bookmarks

3D structure of the Neurospora crassa genome at 50 kb resolution

3D genome builder (3DGB) is a workflow to build 3D models of genomes from HiC raw data and to integrate omics data on the produced models for further visual exploration. 3DGB bundles HiC-Pro , PASTIS and custom Python scripts into a unified Snakemake workflow with limited inputs (see Preparing Required Files ). 3DGB produces annotated 3D models of genome in PDB and G3D formats.

Download this repository

git clone https://github.com/data-fun/3d-genome-builder.git
cd 3d-genome-builder

Install dependencies

Singularity

Download the latest version here

Install Singularity:

sudo apt install -y ./singularity-container_3.8.7_amd64.deb

Verify version:

$ singularity --version
singularity version 3.8.7

Conda environment

Install conda .

Install mamba:

conda install mamba -n base -c conda-forge

Create conda environment and install dependendies:

mamba env create -f binder/environment.yml

Load conda environment:

conda activate 3DGB

Download HiC-Pro Singularity image

wget --ciphers=DEFAULT:@SECLEVEL=1 https://zerkalo.curie.fr/partage/HiC-Pro/hicpro_3.1.0_ubuntu.img -P images

Verify HiC-Pro version with:

$ singularity exec images/hicpro_3.1.0_ubuntu.img HiC-Pro --version
[...]
HiC-Pro version 3.1.0

and bowtie2 version:

$ singularity exec images/hicpro_3.1.0_ubuntu.img bowtie2 --version 2>/dev/null | head -n 1
/usr/local/conda/envs/hicpro/bin/bowtie2-align-s version 2.4.4

Prepare required files

Create the config file

Create and edit a configuration file in yaml format. See for instance the template config_template.yml

Add the reference genome

The reference genome fasta file must be located in WORKING_DIR/genome.fasta where WORKING_DIR is the name of the working directory as specified in your config file.

Add FASTQ files (optional)

If you already have fastq files stored locally or some fastq files are not available on GEO or SRA, you can use these files providing they are in the proper directory structure:

3D structure of the chromosome 13 of S. cerevisiae at 5 kb resolution

WORKING_DIR/
├── fastq_files
│ ├── ID1
│ │ ├── ID1_R1.fastq.gz
│ │ └── ID1_R2.fastq.gz
│ ├── ID2
│ │ ├── ID2_R1.fastq.gz
│ │ └── ID2_R2.fastq.gz
│ ├── ID3
│ │ ├── ID3_R1.fastq.gz
│ │ └── ID3_R2.fastq.gz
│ └── ID4
│ ├── ID4_R1.fastq.gz
│ └── ID4_R2.fastq.gz
└── genome.fasta
  • WORKING_DIR is the name of the working directory as specified in your config file.

  • Paired-end fastq files are in the directory WORKING_DIR/fastq_files/IDx with IDx the identifier of the paired fastq files. Fastq identifiers are reported in the config file. Please note fastq files have to follow the pattern <sample ID>_R<1 or 2>.fastq.gz .

Note

Please strictly follow this file organization as it is required by the 3DGB workflow.

Build model

Run 3DGB:

snakemake --profile smk_profile -j 4 --configfile YOUR-CONFIG.yml

Note

  • Adapt YOUR-CONFIG.yml to the exact name of the config file you created.

  • Option -j 4 tells Snakemake to use up to 4 cores. If you are more cores available, you can increase this value ( e.g. -j 16 ).

Or with debugging options:

snakemake --profile smk_profile_debug -j 4 --configfile YOUR-CONFIG.yml --verbose

Depending on the number and size of fastq files, the 3D construction will take a couple of hours to run.

For troubleshooting, have a look to log files in WORKING_DIR/logs , where WORKING_DIR is the name of the working directory as specified in your config file.

Map quantitative values on the 3D model

To map quantitative values on the model run:

python ./scripts/map_parameter.py --pdb path/to/structure.pdb --bedgraph path/to/annotation.bedgraph --output path/to/output.pdb

Quantitative values should be formatted in a 4-column bedgraph file (chromosome/start/stop/value):

chr1	0	50000	116.959
chr1	50000	100000	48.4495
chr1	100000	150000	22.8726
chr1	150000	200000	84.3106
chr1	200000	250000	113.109

Each bead of the model will be assigned a quantitative value. The resolution in the bedgraph file should match the resolution used to build the model.

Get results

Upon completion, the WORKING_DIR should look like this:

WORKING_DIR/
├── contact_maps
├── dense_matrix
├── fastq_files
├── HiC-Pro
├── logs
├── pastis
├── sequence
└── structure

The following paths contain the most interesting results:

  • WORKING_DIR/contact_maps/*.png : contact maps.

  • WORKING_DIR/HiC-Pro/output/hic_results/pic/*/*.pdf : graphical summaries of read alignments produced by Hi-C Pro.

  • WORKING_DIR/pastis/structure_RESOLUTION.pdb : raw 3D models (in PDB format) produced by Pastis.

  • WORKING_DIR/structure/RESOLUTION/structure_cleaned.* : final (annotated) 3D models in PDB and G3D formats.

Note

  • WORKING_DIR is the name of the working directory as specified in your config file.

  • RESOLUTION is the resolution of the Hi-C data specified in the config file.

Examples

Visualize 3D model structures

To visualize 3D model structures (.pdb and .g3d files), follow this quick tutorial .

Build DAG graph

For visualization purpose, you can build the graph of all computational steps involved in the 3D construction of the genome.

snakemake --profile smk_profile --configfile YOUR-CONFIG.yml --rulegraph | dot -Tpdf > rules.pdf

where YOUR-CONFIG.yml should be replaced by the name of the config file you created.

With wildcards:

snakemake --profile smk_profile --configfile YOUR-CONFIG.yml --dag | dot -Tpdf > dag.pdf

Code Snippets

  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
import argparse

from biopandas.pdb import PandasPdb


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments.
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--input-pdb",
        action="store",
        type=str,
        help="Input PDB file containing the 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--output-pdb",
        action="store",
        type=str,
        help="Output PDB file containing the completed 3D structure of the genome.",
        required=True,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


def interpolate_missing_coordinates(pdb_name_in, pdb_name_out):
    """Interpolate missing coordinates in PDB file containing a 3D genome structure.

    3D coordinates of the missing beads are interpolated one by one.

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome
    pdb_name_out : str
        Output PDB file containing the annotated 3D structure of the genome
    """
    pdb = PandasPdb().read_pdb(pdb_name_in)
    atoms = pdb.df["ATOM"]
    print(f"Found {atoms.shape[0]} beads in {pdb_name_in}")
    # Interpolate missing coordinates chromosome by chromosome.
    coord_columns= ["x_coord", "y_coord", "z_coord"]
    for residue_number in atoms["residue_number"].unique():
        missing_beads = (
            atoms.query(f"residue_number == {residue_number}")
            [coord_columns]
            .isna()
            .sum()   # Sum NAs over columns (coordinates).
            .sum()   # Sum NAs over lines.
        ) // 3       # 1     missing atom has 3 missing coordinates.
        chromosome = atoms.query(f"residue_number == {residue_number}")
        if not missing_beads:
            continue
        print(f"Chromosome {residue_number} has {missing_beads} missing beads")

        # Interpolate coordinates on the three coordinates (x, y, z).
        # Only interpolation between existing beads.
        # No extrapolation at chromosome ends.
        interpolate_chromosome = chromosome.copy()
        interpolate_chromosome[coord_columns] = chromosome[coord_columns].interpolate(
            method="pchip", axis=0, limit_area="inside"
        )
        # Print number of missing beads that will later remove.
        deleted_atoms_number = interpolate_chromosome["x_coord"].isna().sum()
        print(
            f"Removed {deleted_atoms_number} beads from chromosome {residue_number} extremities"
        )
        # Save interpolated chromosome into the full structure.
        atoms.loc[atoms["residue_number"] == residue_number, coord_columns] = interpolate_chromosome[coord_columns]
    # Remove in one step all beads with missing coordinates
    # located at chromosome extremities.
    atoms = atoms.dropna(subset=coord_columns)
    print(
        f"Found {atoms.shape[0]} beads after interpolation and removal of extremities missing beads."
    )

    pdb.df["ATOM"] = atoms
    pdb.to_pdb(pdb_name_out)
    print(f"Wrote {pdb_name_out}")


if __name__ == "__main__":
    ARGS = get_cli_arguments()

    interpolate_missing_coordinates(ARGS.input_pdb, ARGS.output_pdb)
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
import argparse
import math
from pathlib import Path

from Bio import SeqIO
from biopandas.pdb import PandasPdb


def is_file(parser, file_path):
    """Check file exists.

    Parameters
    ----------
    parser : argparse.ArgumentParser
        Command line argument parser.
    file_path : str
        File path
    Returns
    -------
    str
        File path.
    """
    if not Path(file_path).is_file():
        parser.error(f"The file {file_path} does not exist")
    else:
        return file_path


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments.
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--pdb",
        action="store",
        type=lambda name: is_file(parser, name),
        help="PDB file containing the 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--fasta",
        action="store",
        type=lambda name: is_file(parser, name),
        help="FASTA file containing the sequence of the genome.",
        required=True,
    )
    required.add_argument(
        "--resolution",
        action="store",
        type=int,
        help="HiC resolution from which the structure has been generated.",
        required=True,
    )
    required.add_argument(
        "--output",
        action="store",
        type=str,
        help="Output PDB file containing the annotated 3D structure of the genome.",
        required=True,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


def extract_chromosome_length(fasta_name):
    """Extract chromosome length from a FASTA file.

    Parameters
    ----------
    fasta_name : str
        Name of Fasta file containing the sequence of the genome.

    Returns
    -------
    list
        List of chromosome lengths.
    """
    chromosome_length_lst = []
    with open(fasta_name, "r") as fasta_file:
        print(f"Reading {fasta_name}")
        for record in SeqIO.parse(fasta_file, "fasta"):
            length = len(record.seq)
            print(f"Found chromosome {record.id} with {length} bases")
            chromosome_length_lst.append(length)
    return chromosome_length_lst


def assign_chromosome_number(
    pdb_name_in, chromosome_length, HiC_resolution, pdb_name_out
):
    """Assign chromosome numbers to the whole genome 3D structure.

    Note:
    - The PDB file produced by Pastis is not readable by Biopython
    because the residue number column is missing.
    - We use instead the biopandas library: http://rasbt.github.io/biopandas/

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome.
    chromosome_length : list
        List with chromosome lengths.
    HiC_resolution : int
        HiC resolution.
    pdb_name_out : str
        Output PDB file containing the annotated 3D structure of the genome.
    """
    coordinates = PandasPdb().read_pdb(pdb_name_in)
    coordinates_df = coordinates.df["ATOM"]
    # Verify the number of beads in the structure
    # and the number of beads deduced from the sequence and resolution
    # are the same.
    print(f"Number of beads read from structure: {coordinates_df.shape[0]}")
    beads_per_chromosome = [
        math.ceil(length / HiC_resolution) for length in chromosome_length
    ]
    print(
        f"Number of beads deduced from sequence and HiC resolution: {sum(beads_per_chromosome)}"
    )
    if coordinates_df.shape[0] != sum(beads_per_chromosome):
        raise ValueError(
            "Number of beads in structure and "
            "number of beads deduced from sequence and resolution "
            "are different."
        )
    # Add residue (chromosome) number.
    residue_number = []
    for index, bead_number in enumerate(beads_per_chromosome):
        residue_number = residue_number + [index + 1] * bead_number
    coordinates_df["residue_number"] = residue_number

    # Add residue name based on residue number.
    coordinates_df["residue_name"] = coordinates_df["residue_number"].apply(
        lambda x: f"C{x:02d}"
    )

    # Add chain (chromosome) name.
    # Chromosome 1 -> Chain A
    # Chromosome 2 -> Chain B
    # ...
    # Chromosome 26 -> Chain Z
    # Chromosome 27 -> Chain A
    # Chromosome 28 -> Chain B
    # ...
    letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    # Handle case where there are more than 26 chromosomes.
    repeat_factor = math.ceil(len(chromosome_length) / len(letters))
    residue_to_chain = dict(
        zip(coordinates_df["residue_number"].unique(), letters * repeat_factor)
    )
    # Map residue number to chain id.
    coordinates_df["chain_id"] = coordinates_df["residue_number"].map(residue_to_chain)

    # Fix atom number starting at 1 (instead of 0).
    coordinates_df["atom_number"] = coordinates_df["atom_number"] + 1

    # Write new PDB file to disk.
    coordinates.df["ATOM"] = coordinates_df
    coordinates.to_pdb(path=pdb_name_out, records=None, gz=False, append_newline=True)
    print(f"Wrote {pdb_name_out}")


if __name__ == "__main__":
    ARGS = get_cli_arguments()

    # Read FASTA file and get chromosome length.
    CHROMOSOME_LENGTH = extract_chromosome_length(ARGS.fasta)

    # Assign chromosome number.
    assign_chromosome_number(ARGS.pdb, CHROMOSOME_LENGTH, ARGS.resolution, ARGS.output)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import argparse

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np


def get_cli_arguments():
    """Parse command line.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--contacts",
        action="store",
        type=str,
        help="Input contact file",
        required=True,
    )
    parser.add_argument(
        "--map",
        action="store",
        type=str,
        help="Output contact map image",
        required=True,
    )
    return parser.parse_args()


def create_contact_map(contacts_file, map_file):
    """Create contact map image.

    Parameters
    ----------
    contacts_file : str
        Contact file
    map_file : str
        Contact map image
    """
    contacts = np.loadtxt(contacts_file)
    plt.imshow(np.log2(contacts+1), cmap="hot")
    plt.colorbar()
    plt.savefig(map_file, dpi=300)


if __name__ == "__main__":
    ARGS = get_cli_arguments()
    create_contact_map(ARGS.contacts, ARGS.map)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import argparse

from Bio import SeqIO

def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--fasta",
        action="store",
        type=str,
        help="Fasta file containing the sequence of the genome.",
        required=True,
    )
    required.add_argument(
        "--output",
        action="store",
        type=str,
        help="Output text file containing chromosome sizes.",
        required=True,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


if __name__ == "__main__":
    # Parse command line arguments.
    ARGS = get_cli_arguments()

    print(f"Opening {ARGS.fasta}")
    with open(ARGS.fasta, "r") as fasta_file, open(ARGS.output, "w") as size_file:
        for record in SeqIO.parse(fasta_file, "fasta"):
            print(f"{record.id}: {len(record.seq)} bases")
            size_file.write(f"{record.id}\t{len(record.seq)}\n")
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import argparse
import math
import pandas as pd
from pathlib import Path

from Bio import SeqIO
from biopandas.pdb import PandasPdb


def is_file(parser, file_path):
    """Check file exists.

    Parameters
    ----------
    parser : argparse.ArgumentParser
        Command line argument parser
    file_path : str
        File path
    Returns
    -------
    str
        File path    
    """
    if not Path(file_path).is_file():
        parser.error(f"The file {file_path} does not exist")
    else:
        return file_path


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--pdb",
        action="store",
        type=lambda name: is_file(parser, name),
        help="PDB file containing the 3D structure of the genome",
        required=True,
    )
    parser.add_argument(
        "--fasta",
        action="store",
        type=lambda name: is_file(parser, name),
        help="Fasta file containing the sequence of the genome",
        required=True,
    )
    parser.add_argument(
        "--resolution", action="store", type=int, help="HiC resolution", required=True,
    )
    parser.add_argument(
        "--output",
        action="store",
        type=str,
        help="Output g3d file containing the annotated 3D structure of the genome",
        required=True,
    )
    return parser.parse_args()


def extract_chromosome_length(fasta_name):
    """Extract chromosome length from a FASTA file.

    Parameters
    ----------
    fasta_name : str
        Name of Fasta file containing the sequence of the genome

    Returns
    -------
    list
        List of chromosome lengthes
    """
    chromosome_length_lst = []
    with open(fasta_name, "r") as fasta_file:
        print(f"Reading {fasta_name}")
        for record in SeqIO.parse(fasta_file, "fasta"):
            length = len(record.seq)
            print(f"Found chromosome {record.id} with {length} bases")
            chromosome_length_lst.append(length)
    return chromosome_length_lst


def convert_to_g3d(pdb_name_in, chromosome_length, HiC_resolution, g3d_name_out):
    """Assign chromosome numbers to the whole genome 3D structure.

    Note:
    - The PDB file produced by Pastis is not readable by Biopython
    because the residue number column is missing.
    - We use instead the biopandas library. http://rasbt.github.io/biopandas/

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome
    chromosome_length : list
        List with chromosome lengths
    HiC_resolution : int
        HiC resolution
    g3d_name_out : str
        Output g3d file containing the annotated 3D structure of the genome
    """
    pdb = PandasPdb().read_pdb(pdb_name_in)
    atoms = pdb.df["ATOM"]
    print(f"Number of beads read from structure: {pdb.df['ATOM'].shape[0]}")

    # Deduce chromosomal position of each atoms
    bp_coordinates = list()
    for i in range(len(chromosome_length)):
        bp_coordinates_chrom_x = list(range(0, chromosome_length[i], HiC_resolution))
        bp_coordinates = bp_coordinates + bp_coordinates_chrom_x

    # Get the index of the beads without the ones already removed from the pdb file (by interpolate_missing_coordinates.py )
    atom_number = list(atoms["atom_number"]-1)

    # Delete the bp coordinates values corresponding to removed beads
    bp_coordinates = pd.Series(bp_coordinates)
    bp_coordinates = bp_coordinates.iloc[atom_number]
    bp_coordinates.reset_index(drop=True, inplace=True)

    # Extract chromosomes and 3d coordinates information
    g3d_data = { 'chrom': atoms["residue_number"],
                'locus':bp_coordinates,
                'X3D_x': atoms["x_coord"],
                'X3D_y': atoms["y_coord"],
                'X3D_z': atoms["z_coord"] }
    result = pd.DataFrame(g3d_data)

    #save as g3d file
    result.to_csv(g3d_name_out, sep="\t", index=False)

if __name__ == "__main__":
    ARGS = get_cli_arguments()

    # Read Fasta file and get chromosome length
    CHROMOSOME_LENGTH = extract_chromosome_length(ARGS.fasta)

    # Assign chromosome number
    convert_to_g3d(ARGS.pdb, CHROMOSOME_LENGTH, ARGS.resolution, ARGS.output)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
import argparse
from pathlib import Path

import jinja2


def is_file(parser, file_path):
    """Check file exists.

    Parameters
    ----------
    parser : argparse.ArgumentParser
        Command line argument parser.
    file_path : str
        File path
    Returns
    -------
    str
        File path.
    """
    if not Path(file_path).is_file():
        parser.error(f"The file {file_path} does not exist")
    else:
        return file_path


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments.
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--template",
        action="store",
        type=lambda name: is_file(parser, name),
        help="Template file for HiC-Pro configuration.",
        required=True,
    )
    required.add_argument(
        "--chromosome-sizes",
        action="store",
        type=lambda name: is_file(parser, name),
        help="File with chromosome sizes.",
        required=True,
    )
    required.add_argument(
        "--genome-fragment",
        action="store",
        type=lambda name: is_file(parser, name),
        help="File with genome restriction sites.",
        required=True,
    )
    required.add_argument(
        "--genome-index-path",
        action="store",
        type=str,
        help="Path to bowtie2 indexes.",
        required=True,
    )
    required.add_argument(
        "--resolutions",
        action="store",
        type=int,
        nargs="+",
        help="HiC resolutions.",
        required=True,
    )
    required.add_argument(
        "--output",
        action="store",
        type=str,
        help="HiC-Pro configuration file.",
        required=True,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


if __name__ == "__main__":
    ARGS = get_cli_arguments()

    with open(ARGS.template, "r") as template_file, \
        open(ARGS.output, "w") as out_file:
        template = jinja2.Template(template_file.read())
        out_file.write(
            template.render(
                genome_index_path=str(Path(ARGS.genome_index_path).resolve()) + "/",
                chromosome_sizes=Path(ARGS.chromosome_sizes).resolve(),
                genome_fragment=Path(ARGS.genome_fragment).resolve(),
                resolutions=" ".join([str(res) for res in ARGS.resolutions])
            )
        )
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import argparse
import numpy as np
import pandas as pd

from biopandas.pdb import PandasPdb

# Allow to print wide dataframes.
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 200)


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments.
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--input-pdb",
        action="store",
        type=str,
        help="Input PDB file containing the 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--output-pdb",
        action="store",
        type=str,
        help="Output PDB file containing the completed 3D structure of the genome.",
        required=True,
    )
    optional.add_argument(
        "--threshold",
        action="store",
        type=float,
        help="Distance threshold to detect flipped contigs. Default: 3",
        required=False,
        default=2.0,
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


def delete_outlier_beads(pdb_name_in, pdb_name_out, threshold):
    """Delete outlier coordinates in a PDB file containing a 3D genome structure.

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome
    pdb_name_out : str
        Output PDB file containing the annotated 3D structure of the genome
    """
    pdb = PandasPdb().read_pdb(pdb_name_in)
    atoms = pdb.df["ATOM"]
    print(f"Found {atoms.shape[0]} beads in {pdb_name_in}")

    coord_columns = ["x_coord", "y_coord", "z_coord"]
    ATOMS = pd.DataFrame()

    # delete outlier coordinates chromosome by chromosome.
    for residue_number in atoms["residue_number"].unique():
        # Select beads of one chromosome
        chromosome_df = atoms.query(f"residue_number == {residue_number}").reset_index(
            drop=True
        )

        # Compute Euclidean distances between bead n and bead n+1
        chromosome_df["distance"] = np.linalg.norm(
            chromosome_df[coord_columns].to_numpy()
            - chromosome_df[coord_columns].shift(-1).to_numpy(),
            axis=1,
        )
        # The last distance is Nan because there is no bead n+1 for the last bead.
        # We replace it by the distance between bead n and bead n-1.
        column_index = chromosome_df.columns.get_loc("distance")
        chromosome_df.iat[-1, column_index] = chromosome_df.iat[-2, column_index]

        # Compute median distance with possible Nan values
        median_distance = chromosome_df["distance"].median(skipna=True)
        print(f"Median distance between beads: {median_distance:.2f}")

        # Select outlier beads.
        # i.e. the beads with the greater ditances.
        # Output beads coordinates with distances
        outlier_beads = chromosome_df["distance"] >= threshold * median_distance

        if not outlier_beads.sum():
            print(f"Chromosome {residue_number} has no outlier beads")
            chromosome_df = chromosome_df.drop("distance", axis=1)
            # Save chromosome into the full stucture.
            ATOMS = pd.concat([ATOMS, chromosome_df])

        else:
            # Print number of outlier beads that will be later removed.
            print(
                f"Chromosome {residue_number} has {outlier_beads.sum()} outlier beads"
            )
            deleted_atoms = chromosome_df[
                chromosome_df["distance"] >= threshold * median_distance
            ]
            print(
                deleted_atoms[
                    [
                        "record_name",
                        "atom_number",
                        "residue_name",
                        "residue_number",
                        "x_coord",
                        "y_coord",
                        "z_coord",
                        "distance",
                    ]
                ]
            )
            kept_atoms = chromosome_df[
                chromosome_df["distance"] < threshold * median_distance
            ]
            kept_atoms = kept_atoms.drop("distance", axis=1)
            # Save chromosome into the full stucture.
            ATOMS = pd.concat([ATOMS, kept_atoms])
    ATOMS.reset_index(inplace=True, drop=True)
    ATOMS["line_idx"] = ATOMS.index
    print(f"Found {ATOMS.shape[0]} beads after removal of outlier beads.")
    pdb.df["ATOM"] = ATOMS
    pdb.to_pdb(pdb_name_out)
    print(f"Wrote {pdb_name_out}")


if __name__ == "__main__":
    ARGS = get_cli_arguments()

    delete_outlier_beads(ARGS.input_pdb, ARGS.output_pdb, ARGS.threshold)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from watermark import watermark


def list_versions():
    """List package versions and describe environment."""
    print(
        watermark(
            machine=True,
            python=True,
            packages="numpy,pandas,scipy,biopandas,matplotlib,Bio,jinja2,sklearn,iced,pastis",
            watermark=True,
            conda=True
        )
    )


if __name__ == "__main__":
    list_versions()
    print("-" * 40)
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import argparse
import numpy as np
import pandas as pd
from pathlib import Path

import iced
from iced import io
from pastis.optimization import mds
from pastis import dispersion
from pastis.optimization import negative_binomial_structure
from pastis.io.write_struct.pdb import writePDB
from scipy import sparse

def is_file(parser, file_path):
    """Check file exists.

    Parameters
    ----------
    parser : argparse.ArgumentParser
        Command line argument parser
    file_path : str
        File path
    Returns
    -------
    str
        File path    
    """
    if not Path(file_path).is_file():
        parser.error(f"The file {file_path} does not exist")
    else:
        return file_path


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--matrix",
        action="store",
        type=lambda name: is_file(parser, name),
        help="Name of Matrix file",
        required=True,
    )
    parser.add_argument(
        "--bed",
        action="store",
        type=lambda name: is_file(parser, name),
        help="Name of Bed file",
        required=True,
    )
    parser.add_argument(
        "--output",
        action="store",
        type=str,
        help="Name of output PDB file containing the 3D structure of the genome",
        required=True,
    )
    return parser.parse_args()


def run_pastis_nb(matrix_filename, bed_filename, output_filename, seed=0, percentage_to_filter=0.02):
    """Build 3D structure of genome with Pastis NB algorithme.

    Parameters
    ----------
    matrix_filename : str
        Name of Matrix file
    bed_filename : str
        Name of Bed file
    output_filename : str
        Name of output PDB file with 3D structure of the genome
    seed : int (default: 0)
        Random generator seed
    percentage_to_filter : float (default: 0.02)
        Percentage to filter out data 
    """
    ###############################################################################
    #Load chromosome lengths and count data
    lengths = io.load_lengths(bed_filename)

    ###############################################################################
    # The sparse matrix generated by HiC-Pro seems to randomly miss a row.
    # The dense matrix generated by HiC-Pro_3.1.0/bin/utils/sparseToDense.py is always complete.
    # The 3D model is thus inferred from it.
    #counts = io.load_counts(matrix_filename, base=1)
    counts_maps = pd.read_csv(matrix_filename, sep='\t', header=None)
    counts_maps = counts_maps.values
    # to solve : File "iced/_filter_.pyx", line 15, in iced._filter_._filter_csr ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long'
    counts_maps = counts_maps.astype("double")
    #counts_maps = np.load(matrix_filename)
    counts_maps[np.isnan(counts_maps)] = 0
    counts = sparse.coo_matrix(counts_maps)
    counts.setdiag(0)
    counts.eliminate_zeros()
    counts = counts.tocoo()

    ###############################################################################
    # Normalize the data
    counts = iced.filter.filter_low_counts(
        counts, percentage=percentage_to_filter,
        sparsity=False)
    normed, bias = iced.normalization.ICE_normalization(
        counts, max_iter=300,
        output_bias=True)
    # Save the normalization data and biases just for other purposes
    # io.write_counts("./nb_normalized.matrix", normed)
    # np.savetxt("./nb.biases", bias)
    random_state = np.random.RandomState(seed)

    ###############################################################################
    # First estimate MDS for initialization
    X = mds.estimate_X(normed, random_state=random_state)
    ###############################################################################
    # Estimate constant dispersion parameters
    dispersion_ = dispersion.ExponentialDispersion(degree=0)
    _, mean, variance, weights = dispersion.compute_mean_variance(counts, lengths, bias=bias)
    dispersion_.fit(mean, variance, sample_weights=(weights**0.5))
    ###############################################################################
    # Now perform NB 3D inference.
    alpha = -3
    beta = 1
    counts = counts.tocoo()
    print("Estimating structure")
    X = negative_binomial_structure.estimate_X(
        counts, alpha, beta, bias=bias,
        lengths=lengths,
        dispersion=dispersion_,
        use_zero_entries=True,
        ini=X.flatten())
    ###############################################################################
    # Remove beads that were not infered
    mask = (np.array(counts.sum(axis=0)).flatten() + np.array(counts.sum(axis=1)).flatten() == 0)
    mask = mask.flatten()
    X_ = X.copy()
    X_[mask] = np.nan
    ###############################################################################
    np.savetxt(output_filename + ".txt", X_)
    print("Results written to", output_filename)
    writePDB(X_, output_filename)


if __name__ == "__main__":
    ARGS = get_cli_arguments()
    run_pastis_nb(
        matrix_filename=ARGS.matrix,
        bed_filename=ARGS.bed,
        output_filename=ARGS.output,
        seed=0, 
        percentage_to_filter=0.02
    )
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
import argparse
import math
from operator import invert
import shutil
import sys

from Bio import SeqIO
from Bio.Seq import Seq
from biopandas.pdb import PandasPdb
import numpy as np
import pandas as pd


def get_cli_arguments():
    """Command line argument parser.

    Returns
    -------
    argparse.Namespace
        Object containing arguments
    """
    parser = argparse.ArgumentParser(add_help=False)
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "--pdb",
        action="store",
        type=str,
        help="PDB file containing the 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--fasta",
        action="store",
        type=str,
        help="Fasta file containing the sequence of the genome.",
        required=True,
    )
    required.add_argument(
        "--resolution",
        action="store",
        type=int,
        help="HiC resolution from which the structure has been generated.",
        required=True,
    )
    required.add_argument(
        "--output-pdb",
        action="store",
        type=str,
        help="Output PDB file containing the fixed 3D structure of the genome.",
        required=True,
    )
    required.add_argument(
        "--output-fasta",
        action="store",
        type=str,
        help="Output FASTA file containing the fixed sequence of the genome.",
        required=True,
    )
    optional.add_argument(
        "--threshold",
        action="store",
        type=float,
        help="Distance threshold to detect flipped contigs. Default: 3",
        required=False,
        default=3.0,
    )
    optional.add_argument(
        "--debug",
        action="store_true",
        help="Debug flag. Output one TSV file per chromosome with bead distances.",
        required=False,
        default=False,
    )
    # Trigger to run the script or not.
    optional.add_argument(
        "--run",
        action="store",
        choices=("True", "False"),
        help="Run the script. Default: False.",
        required=False,
        default="False",
    )
    # Add help.
    optional.add_argument(
        "-h",
        "--help",
        action="help",
        default=argparse.SUPPRESS,
        help="Show this help message and exit.",
    )
    return parser.parse_args()


def extract_chromosome_name_length(fasta_name):
    """Extract chromosome name and length from a FASTA file.

    Parameters
    ----------
    fasta_name : str
        Name of the FASTA file containing the sequence of the genome.

    Returns
    -------
    tuple
        List of chromosome names.
        List of chromosome lengthes.
    """
    chromosome_name_lst = []
    chromosome_length_lst = []
    with open(fasta_name, "r") as fasta_file:
        print(f"Reading genome sequence in {fasta_name}")
        for record in SeqIO.parse(fasta_file, "fasta"):
            name = record.id
            length = len(record.seq)
            print(f"Found chromosome {name} with {length} bases")
            chromosome_name_lst.append(name)
            chromosome_length_lst.append(length)
    return chromosome_name_lst, chromosome_length_lst


def find_inverted_contigs(
    pdb_name_in, chromosome_lengths, HiC_resolution, threshold
):
    """Find inverted contigs.

    Inverted contigs are detected based on the eucledian distance
    between adjacent beads in the 3D structure of the genome.

    Parameters
    ----------
    pdb_name_in : str
        PDB file containing the 3D structure of the genome.
    chromosome_lengths : list
        List with chromosome lengths.
    HiC_resolution : int
        HiC resolution.
    threshold : float
        Threshold to detect flipped contigs.

    Returns
    -------
    inverted_contigs : dict
        Dictionnary with inverted contigs.
    """
    pdb_structure = PandasPdb().read_pdb(pdb_name_in)
    structure_df = pdb_structure.df["ATOM"]
    print(f"Number of beads read from structure: {structure_df.shape[0]}")

    coord_columns = ["x_coord", "y_coord", "z_coord"]
    missing_beads_number = structure_df[coord_columns].isna().sum().sum() // 3
    print(
        f"Found {missing_beads_number} missing beads in structure"
    )

    beads_per_chromosome = [
        math.ceil(length / HiC_resolution) for length in chromosome_lengths
    ]
    print(
        f"Number of expected beads deduced from sequence and HiC resolution: {sum(beads_per_chromosome)}"
    )

    if structure_df.shape[0] != sum(beads_per_chromosome):
        sys.exit(
            f"Cannot process structure {pdb_name_in} because it contains "
            f"{structure_df.shape[0]} beads "
            f"instead of {sum(beads_per_chromosome)}"
        )

    inverted_contigs = {}

    for chrom_num in structure_df["residue_number"].unique():
        print(f"\nLooking for inverted contigs into chromosome {chrom_num}")

        # Select beads of one chromosome
        chromosome_df = structure_df.query(
            f"residue_number == {chrom_num}"
        ).reset_index(drop=True)

        # Compute Euclidean distances between bead n and bead n+1
        chromosome_df["distance"] = np.linalg.norm(
            chromosome_df[coord_columns].to_numpy() - chromosome_df[coord_columns].shift(-1).to_numpy(),
            axis=1
        )
        # Compute median distance with possible Nan values
        median_distance = chromosome_df["distance"].median(skipna=True)
        print(f"Median distance between beads: {median_distance:.2f}")

        # Select extremities of inverted contigs
        # i.e. beads with distance above a given threshold.
        # Output beads coordinates with distances
        if ARGS.debug:
            filename = f"chr_{chrom_num}.tsv"
            target_columns = [
                "record_name", "atom_number", "atom_name",
                "residue_name", "chain_id", "residue_number",
                "x_coord", "y_coord", "z_coord", "line_idx",
                "distance",
            ]
            print(f"DEBUG: writing {filename} with distances.")
            chromosome_df[target_columns].to_csv(
                filename, sep="\t", index=False, na_rep="nan"
            )

        beads_selection = (
            chromosome_df["distance"] > threshold * median_distance
        )
        inversion_limits = chromosome_df.loc[
            beads_selection, "atom_number"
        ].values
        if len(inversion_limits) % 2 != 0:
            print("WARNING: odd number of inversion limits found")
            print(
                "WARNING: this might lead to a wrong detection of inverted contigs"
            )
            print(inversion_limits)
        if len(inversion_limits) != 0:
            for limit_1, limit_2 in zip(
                inversion_limits[0::2], inversion_limits[1::2]
            ):
                print(
                    f"Chromosome {chrom_num}: found inverted contig between bead {limit_1+1} and bead {limit_2}"
                )
                if chrom_num in inverted_contigs:
                    inverted_contigs[chrom_num].append((limit_1 + 1, limit_2))
                else:
                    inverted_contigs[chrom_num] = [(limit_1 + 1, limit_2)]
        else:
            inverted_contigs[chrom_num] = []
    return inverted_contigs


def flip_inverted_contigs_in_structure(
    inverted_contigs, pdb_name_in, pdb_name_out
):
    """Flip inverted contigs in the 3D structure of the genome.

    Parameters
    ----------
    inverted_contigs : dict
        Dictionnary with inverted contigs
    pdb_name_in : str
        PDB file containing the 3D structure of the genome
    pdb_name_out : str
        Output PDB file containing the 3D structure of the genome
    """
    pdb_structure = PandasPdb().read_pdb(pdb_name_in)
    coordinates = pdb_structure.df["ATOM"]
    if sum(map(len, inverted_contigs.values()))== 0:
        pdb_structure.to_pdb(
        path=pdb_name_out, records=None, gz=False, append_newline=True
    )
        print("\nNothing to fix. Structure is fine.")
        return
    print("\nFlipping contigs.")
    for chrom_num in inverted_contigs:
        for contig in inverted_contigs[chrom_num]:
            contig_start, contig_end = contig
            print(
                f"Structure of chromosome {chrom_num}: "
                f"flip contig between beads {contig_start} "
                f"and {contig_end}"
            )
            contig_start_index = coordinates[
                (coordinates["residue_number"] == chrom_num)
                & (coordinates["atom_number"] == contig_start)
            ].index[0]
            contig_end_index = coordinates[
                (coordinates["residue_number"] == chrom_num)
                & (coordinates["atom_number"] == contig_end)
            ].index[0]
            contig_before_df = coordinates.loc[: contig_start_index - 1, :]
            contig_df = coordinates.loc[contig_start_index:contig_end_index, :]
            contig_after_df = coordinates.loc[contig_end_index + 1 :, :]
            # Flip contig.
            contig_df = contig_df[::-1]
            # Assemble genome structure.
            coordinates = pd.concat(
                [contig_before_df, contig_df, contig_after_df]
            )

    coordinates = coordinates.reset_index(drop=True)
    # The 'line_idx' column keeps the real order of atoms in the PDB file.
    coordinates["line_idx"] = coordinates.index
    pdb_structure.df["ATOM"] = coordinates
    pdb_structure.to_pdb(
        path=pdb_name_out, records=None, gz=False, append_newline=True
    )


def flip_inverted_contigs_in_sequence(
    inverted_contigs,
    chromosome_names,
    fasta_name_in,
    HiC_resolution,
    fasta_name_out,
):
    """Flip inverted contigs in the genome 3D structure and sequence.

    Parameters
    ----------
    inverted_contigs : dict
        Dictionnary with inverted contigs.
    chromosome_names : list
        List with chromosome names.
    fasta_name_in : str
        Name of Fasta file containing the sequence of the genome.
    HiC_resolution : int
        HiC resolution.
    fasta_name_out : str
        Output FASTA file containing the fixed sequence (at the 3D structure resolution!).
    """
    # Flip inverted contigs in the genome sequence.
    genome_fasta = SeqIO.to_dict(SeqIO.parse(fasta_name_in, "fasta"))

    if bool([a for a in inverted_contigs.values() if a == []]):
        with open(fasta_name_out, "w") as fasta_file:
            SeqIO.write(genome_fasta.values(), fasta_file, "fasta")
        print("Nothing to fix. Sequence is fine.")
        return

    for chrom_num in inverted_contigs:
        chrom_name = chromosome_names[chrom_num - 1]
        chrom_sequence = str(genome_fasta[chrom_name].seq)
        for contig in inverted_contigs[chrom_num]:
            contig_start = contig[0] * HiC_resolution
            contig_end = contig[1] * HiC_resolution
            print(
                f"Sequence of chromosome {chrom_num}: "
                f"flip inverted contig between base {contig_start} "
                f"and {contig_end}"
            )
            contig_sequence = chrom_sequence[contig_start : contig_end + 1]
            # Flip contig.
            contig_sequence = contig_sequence[::-1]
            # Reassemble chromosome sequence.
            chrom_sequence = (
                chrom_sequence[:contig_start]
                + contig_sequence
                + chrom_sequence[contig_end + 1 :]
            )
        genome_fasta[chrom_name].seq = Seq(chrom_sequence)

    # Write genome sequence.
    with open(fasta_name_out, "w") as fasta_file:
        SeqIO.write(genome_fasta.values(), fasta_file, "fasta")


if __name__ == "__main__":
    # Parse command line arguments.
    ARGS = get_cli_arguments()

    if ARGS.run != "True":
        print("Do not verify inverted contigs.")
        print(f"Copying {ARGS.pdb} to {ARGS.output_pdb}.")
        shutil.copyfile(ARGS.pdb, ARGS.output_pdb)
        print(f"Copying {ARGS.fasta} to {ARGS.output_fasta}.")
        shutil.copyfile(ARGS.fasta, ARGS.output_fasta)
        sys.exit()

    # Read Fasta file and extract chromosome names and lengths.
    CHROMOSOME_NAMES, CHROMOSOME_LENGTHS = extract_chromosome_name_length(
        ARGS.fasta
    )

    # Find inverted contigs.
    INVERTED_CONTIGS = find_inverted_contigs(
        ARGS.pdb, CHROMOSOME_LENGTHS, ARGS.resolution, ARGS.threshold
    )
    # Flip inverted contigs in the genome 3D structure and sequence.
    flip_inverted_contigs_in_structure(
        INVERTED_CONTIGS, ARGS.pdb, ARGS.output_pdb
    )
    flip_inverted_contigs_in_sequence(
        INVERTED_CONTIGS,
        CHROMOSOME_NAMES,
        ARGS.fasta,
        ARGS.resolution,
        ARGS.output_fasta,
    )
31
32
33
34
shell:
    "python ../scripts/describe_conda_env.py >{output} &&"
    "fasterq-dump --version >>{output} &&"
    "pigz --version >>{output} "
46
47
48
49
50
51
shell:
    "/usr/local/bin/HiC-Pro_3.1.0/bin/utils/digest_genome.py --help >{output} && "
    "printf '\n\n\n' >>{output} && "
    "bowtie2-build --version >>{output} &&"
    "printf '\n\n\n' >>{output} && "
    "LC_ALL=C; HiC-Pro --version >>{output} "
74
75
76
shell: 
    "fasterq-dump {wildcards.sra_id} --threads {threads} --log-level {params.loglevel} "
    "{params.progress} --outdir {params.outdir} >{log} 2>&1 "
87
88
shell: 
    "mv {input} {output}"
122
123
124
125
126
shell:
    "python ../scripts/calculate_chromosome_sizes.py "
    "--fasta {input} "
    "--output {output} "
    ">{log} 2>&1 "
SnakeMake From line 122 of master/Snakefile
142
143
144
145
146
147
shell:
    "/usr/local/bin/HiC-Pro_3.1.0/bin/utils/digest_genome.py "
    "-r {config[hicpro_restriction_sites]} "
    "-o {output} "
    "{input} "
    ">{log} 2>&1 "
SnakeMake From line 142 of master/Snakefile
163
164
shell:
    "bowtie2-build {input} HiC-Pro/bowtie2_index/genome >{log} 2>&1"
184
185
186
187
188
189
190
191
192
shell:
    "python ../scripts/create_HiC_Pro_config.py "
    "--template {input.template} "
    "--chromosome-sizes {input.chromosome_sizes} "
    "--genome-fragment {input.genome_fragment} "
    "--genome-index-path {params.genome_index_path} "
    "--resolutions {params.resolutions} "
    "--output {output} "
    ">{log} 2>&1"
SnakeMake From line 184 of master/Snakefile
212
213
214
215
216
217
shell:
    "LC_ALL=C; echo 'y' | HiC-Pro "
    "-i fastq_files "
    "-o HiC-Pro/output "
    "-c {input.config} "
    ">{log} 2>&1 "
233
234
shell:
    "cp {input} {output}"
SnakeMake From line 233 of master/Snakefile
257
258
259
260
261
262
263
264
265
shell:
    "HiC-Pro "
    "-i HiC-Pro/merged_samples "
    "-o HiC-Pro/merged_output "
    "-c {input.config} "
    "-s merge_persample "
    "-s build_contact_maps "
    "-s ice_norm "
    ">{log} 2>&1 "
280
281
282
283
284
285
shell:
    "/usr/local/bin/HiC-Pro_3.1.0/bin/utils/sparseToDense.py "
    "--bins {input.bed} "
    "--output {output} "
    "{input.matrix} "
    ">{log} 2>&1 "
SnakeMake From line 280 of master/Snakefile
300
301
302
303
304
305
shell:
    "/usr/local/bin/HiC-Pro_3.1.0/bin/utils/sparseToDense.py "
    "--bins {input.bed} "
    "--output {output} "
    "{input.matrix} "
    ">{log} 2>&1 "
SnakeMake From line 300 of master/Snakefile
319
320
321
322
323
shell:
    "python ../scripts/build_contact_maps.py "
    "--contacts {input} "
    "--map {output} "
    ">{log} 2>&1 "
SnakeMake From line 319 of master/Snakefile
338
339
340
341
342
343
shell:
    "python ../scripts/infer_structures_nb.py "
    "--matrix {input.matrix} "
    "--bed {input.bed} "
    "--output {output} "
    ">{log} 2>&1 "
SnakeMake From line 338 of master/Snakefile
358
359
360
361
362
363
364
shell:
    "python ../scripts/assign_chromosomes.py "
    "--pdb {input.structure} "
    "--fasta {input.sequence} "
    "--resolution {wildcards.resolution} "
    "--output {output} "
    ">{log} 2>&1 "
SnakeMake From line 358 of master/Snakefile
382
383
384
385
386
387
388
389
390
shell:
    "python ../scripts/verify_inverted_contigs.py "
    "--run {params.verify} "
    "--pdb {input.structure} "
    "--fasta {input.sequence} "
    "--resolution {wildcards.resolution} "
    "--output-pdb {output.structure} "
    "--output-fasta {output.sequence} "
    ">{log} 2>&1 "
SnakeMake From line 382 of master/Snakefile
404
405
406
407
408
shell:
    "python ../scripts/add_missing_beads.py "
    "--input-pdb {input} "
    "--output-pdb {output} "
    ">{log} 2>&1 "
SnakeMake From line 404 of master/Snakefile
422
423
424
425
426
shell:
    "python ../scripts/delete_outlier_beads.py "
    "--input-pdb {input} "
    "--output-pdb {output} "
    ">{log} 2>&1 "
SnakeMake From line 422 of master/Snakefile
441
442
443
444
445
446
447
shell:
    "python ../scripts/convert_to_g3d.py "
    "--pdb {input.structure} "
    "--fasta {input.sequence} "
    "--resolution {wildcards.resolution} "
    "--output {output} "
    ">{log} 2>&1 "
SnakeMake From line 441 of master/Snakefile
ShowHide 23 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/data-fun/3d-genome-builder
Name: 3d-genome-builder
Version: v1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: BSD 3-Clause "New" or "Revised" License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...