SigMa is a probabilistic model for the sequential dependencies of mutation signatures

public public 1yr ago Version: v1.0 0 bookmarks

This repository contains the source code for SigMa (Signature Markov model) and related experiments. SigMa is a probabilistic model of the sequential dependencies among mutation signatures.

Below, we provide an overview of the SigMa model from the corresponding paper . "The input data consists of ( A ) a set of predefined signatures that form an emission matrix E (here, for simplicity, represented over six mutation types), and ( B ) a sequence of mutation categories from a single sample and a distance threshold separating sky and cloud mutation segments. ( C ) The SigMa model has two components: (top) a multinomial mixture model (MMM) for isolated sky mutations and (bottom) an extension of a Hidden Markov Model (HMM) capturing sequential dependencies between close-by cloud mutations; all model parameters are learned from the input data in an unsupervised manner. ( D ) SigMa finds the most likely sequence of signatures that explains the observed mutations in sky and clouds."

Setup

Dependencies

SigMa is written in Python 3. We recommend using Conda to manage dependencies, which you can do directly using the provided environment.yml file:

conda env create -f environment.yml
source activate sigma-env

For windows replace last command with

activate sigma-env

Usage

We use Snakemake to manage the workflow of running SigMa on hundreds of tumor samples.

Reproducing the experiments from the SigMa paper

First, download and preprocess the ICGC breast cancer whole-genomes and COSMIC mutation signatures. To do so, run:

cd data && snakemake all

Second, run SigMa and a multinomial mixture model (MMM) on each sample, and perform leave-one-out cross-validation (LOOCV):

snakemake all

This will create an output/ directory, with two subdirectories: models/ and loocv/ . models/ contains SigMa trained on each sample. loocv/ contains the results of LOOCV with SigMa using different cloud thresholds.

Configuration

To run the entire SigMa workflow on different mutation signatures or data, see the Snakefile for configuration options.

To train SigMa or MMM on individual mutation sequences, use the src/train_and_predict.py script. To get a list of command-line arguments, run:

python src/train_and_predict.py -h

Support

Please report bugs and feature requests in the Issues tab of this GitHub repository.

For further questions, please email Max Leiserson and [Itay Sason]([email protected]

  • directly.

References

Xiaoqing Huang*, Itay Sason*, Damian Wojtowicz*, Yoo-Ah Kim, Mark Leiserson^, Teresa M Przytycka^, Roded Sharan^. Hidden Markov Models Lead to Higher Resolution Maps of Mutation Signature Activity in Cancer. Genome Medicine (2019) doi: 10.1186/s13073-019-0659-1 .

* equal first author contribution ^ equal senior author contribution

Code Snippets

76
77
78
79
80
shell:
    'python {TRAIN_AND_PREDICT_PY} -mf {input.mutations} -sf {input.signatures} '\
    '-od {TRAINED_MODEL_DIR}/sigma{wildcards.threshold} -mn {wildcards.model} '\
    '{params.active_signatures} -sn {wildcards.sample} -mi {params.max_iter} '\
    '-ct {wildcards.threshold} -rs {params.random_seed} -tol {params.tolerance}'
 99
100
101
102
103
shell:
    'python {TRAIN_AND_PREDICT_PY} -mf {input.mutations} -sf {input.signatures} '\
    '-od {LOOCV_DIR}/sigma{wildcards.threshold} -mn {SIGMA_NAME} -sn {wildcards.sample} '\
    '{params.active_signatures} -mi {params.max_iter} -ct {wildcards.threshold} '\
    '-rs {params.random_seed} -tol {params.tolerance} --cross-validation-mode'
116
117
118
119
120
shell:
    'python {TRAIN_AND_PREDICT_PY} -mf {input.mutations} -sf {input.signatures} '\
    '-od {LOOCV_DIR}/mmm -mn {MMM_NAME} -sn {wildcards.sample} '\
    '{params.active_signatures} -mi {params.max_iter} -ct 0 '\
    '-rs {params.random_seed} -tol {params.tolerance} --cross-validation-mode'
SnakeMake From line 116 of master/Snakefile
128
129
shell:
    'python {CREATE_FIGURE_PY} -ld {LOOCV_DIR} -of {output}'
SnakeMake From line 128 of master/Snakefile
 4
 5
 6
 7
 8
 9
10
11
12
13
14
MMM_NAME = 'mmm'
SIGMA_NAME = 'sigma'
MODEL_NAMES = [MMM_NAME, SIGMA_NAME]

# Columns in mutation files
PATIENT = 'Patient'
CATEGORY = 'SBS_96'
CATEGORY_IDX = 'Category index'
MUT_DIST = 'Distance to Previous Mutation'
CHROMOSOME = 'Chromosome'
START_POS = 'Start Position'
Python From line 4 of src/constants.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import sys, os, argparse, logging, pandas as pd
from data_utils import get_logger
from constants import PATIENT

# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument('-mf', '--mutations_file', type=str, required=True)
parser.add_argument('-sf', '--samples_file', type=str, required=True)
parser.add_argument('-of', '--output_file', type=str, required=True)
parser.add_argument('-v', '--verbosity', type=int, required=False, default=logging.INFO)
args = parser.parse_args(sys.argv[1:])

# Load the mutations file
mut_df = pd.read_csv(args.mutations_file, sep='\t', low_memory=False)
cols = mut_df.columns

# Load the sample name mapping
sample_df = pd.read_csv(args.samples_file, sep='\t')
sample_ids = [ s.split('a')[0] if s.endswith('a') or s.endswith('a2') else s.split('b')[0]
               for s in sample_df['submitted_sample_id'] ]
patient_map = dict(zip(sample_df['icgc_donor_id'], sample_ids))

# Map the patients and save
mut_df[PATIENT] = [ patient_map[p] for p in mut_df[PATIENT] ]
mut_df.to_csv(args.output_file, sep='\t', index=0)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import matplotlib
matplotlib.use('Agg')
import sys, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from os import listdir
from os.path import isfile, join
sns.set_style('whitegrid')

# Load our modules
from data_utils import load_json
from constants import MODEL_NAMES, SIGMA_NAME, MMM_NAME

def across_all_dir(path, files):
    scores = np.zeros((len(files), 24))
    for i, f in enumerate(files):
        sample_dict = load_json(join(path, f))
        for j, c in enumerate(sample_dict['chromosomesToResults'].keys()):
            scores[i][j] = sample_dict['chromosomesToResults'][c]['score']
    return scores


def fix_nans(mat):
    """
    returns the matrix with average over models if a model, sample, chromosome had nan in it.
    :param mat: ndarray (model, sample, chromosome)
    :return: mat ndarray (model, sample, chromosome)
    """
    mat = np.nan_to_num(mat)
    idx, idy, idz = np.where(mat == 0)
    for x, y, z in zip(idx, idy, idz):
        mat[x, y, z] = mat[:, y, z].mean()
    return mat


def analyze(par_path, out_fig):
    dirs = [d for d in listdir(par_path)]
    dirs = sorted(dirs)

    thresholds = []
    tmp = []
    names = []
    if MMM_NAME in dirs:
        tmp.append(MMM_NAME)
        names.append(MMM_NAME.upper())
    for d in dirs:
        if SIGMA_NAME in d:
            thresholds.append(int(d.split('a')[1]))

    thresholds = sorted(thresholds)
    print(thresholds)
    for i in thresholds:
        tmp.append(SIGMA_NAME + str(i))
        names.append(str(i // 1000) + 'K')

    dirs = tmp
    files_dict = {}
    for d in dirs:
        curr_path = join(par_path, d)
        files_dict[d] = sorted([f.split('-')[1] for f in listdir(curr_path) if isfile(join(curr_path, f))])

    files_intersection = files_dict[dirs[0]]
    for d in dirs:
        files_intersection = np.intersect1d(files_dict[d], files_intersection)

    print('number of samples in common: {}'.format(len(files_intersection)))

    results_mat = np.zeros((len(dirs), len(files_intersection), 24))
    for i, d in enumerate(dirs):
        path = join(par_path, d)
        if SIGMA_NAME in d:
            prefix = SIGMA_NAME + '-'
        else:
            prefix = MMM_NAME + '-'

        scores = across_all_dir(path, [prefix + f for f in files_intersection])
        results_mat[i] = scores

    results_mat = fix_nans(results_mat)
    sum_results = np.sum(results_mat, axis=(1, 2)).tolist()
    for i in range(len(dirs)):
        print('{}:   {}'.format(dirs[i], sum_results[i]))

    # Create the plot, highlighting the one with maximum log-likelihood
    max_ind = np.argmax(sum_results)
    ind = list(range(0, max_ind)) + list(range(max_ind+1, len(names)))
    plt.bar(ind, sum_results[:max_ind] + sum_results[max_ind+1:])

    plt.bar([max_ind], [sum_results[max_ind]], color=(181./255, 85./255, 85./255))

    plt.xticks(range(len(names)), names)
    plt.ylim((min(sum_results) - 1000, max(sum_results) + 1000))
    plt.xlabel('Model')
    plt.ylabel('Held-out log-likelihood')
    plt.title('Comparing MMM and SigMa with various cloud thresholds')

    plt.tight_layout()
    plt.savefig(out_fig)


def get_parser():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('-ld', '--loocv_dir', type=str, required=True)
    parser.add_argument('-of', '--output_file', type=str, required=True)
    return parser


def main(args):
    analyze(args.loocv_dir, args.output_file)


if __name__ == '__main__': main( get_parser().parse_args(sys.argv[1:]) )
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import numpy as np
from pomegranate import *
import json

################################################################################
# LOGGING
################################################################################
import logging

# Logging format
FORMAT = '%(asctime)s SigMa %(levelname)-10s: %(message)s'
logging.basicConfig(format=FORMAT)

def get_logger(verbosity=logging.INFO):
    '''
    Returns logger object
    '''
    logger = logging.getLogger(__name__)
    logger.setLevel(verbosity)
    return logger

################################################################################
# UTILS
################################################################################
def sample_and_noise(model, noise_dist, n_seqs, seqs_len):
    noise_change_dist = DiscreteDistribution(dict(zip(range(96), [1.0 / 96] * 96)))
    seqs = []
    noised_seqs = []
    for i in range(n_seqs):
        seq = np.array(model.sample(seqs_len))
        seqs.append(seq)
        noised_seq = seq.copy()
        hits = noise_dist.sample(seqs_len)
        for j, hit in enumerate(hits):
            if hit == 0:
                noised_seq[j] = noise_change_dist.sample()
        noised_seqs.append(noised_seq)
    return seqs, noised_seqs


def get_emissions(file='data\emissions_for_breast_cancer'):
    return np.load(file + '.npy')


def sample_uniform_between_a_b(n_states, a=0.0, b=1.0):
    return (b - a) * np.random.sample(n_states) + a


def random_seqs_from_json(file_name, n_seqs=10):
    seqs = []
    seqs_names = []
    json_file = json.load(open(file_name))
    samples = json_file[u'samples']
    samples_to_seq = json_file[u'sampleToSequence']
    samples = np.random.permutation(samples)
    for i in range(n_seqs):
        seqs.append(samples_to_seq[samples[i]])
        seqs_names.append(samples[i])
    return seqs, seqs_names


def to_json(file_name, dict_to_save):
    with open(file_name + '.json', 'w') as fp:
        json.dump(dict_to_save, fp)


def full_sample_to_chromosomes_seqs(sample, dists_sample):
    np_sample = np.array(sample)
    starting_chromosome_idxs = np.where(np.array(dists_sample) >= 1e100)[0]
    return np.split(np_sample, starting_chromosome_idxs)[1:]


def load_json(file_name):
    return json.load(open(file_name))


def get_split_sequences(file_name, sample_numbers=None):
    json_file = json.load(open(file_name))
    samples = json_file[u'samples']
    samples_to_seq = json_file[u'sampleToSequence']
    samples_dists = json_file[u'sampleToPrevMutDists']
    out_seqs = []
    out_names = []

    if sample_numbers is None:
        sample_numbers = range(len(samples))

    for i in sample_numbers:
        n = samples[i]
        out_names.append(n)
        out_seqs.append(full_sample_to_chromosomes_seqs(samples_to_seq[n], samples_dists[n]))
    return zip(out_names, out_seqs)


def get_full_sequences(file_name='data/nik-zainal2016-wgs-brca-mutations-for-hmm.json'):
    json_file = json.load(open(file_name))
    samples = json_file[u'samples']
    samples_to_seq = json_file[u'sampleToSequence']
    out_seqs = []
    out_names = []
    for n in samples:
        out_names.append(n)
        out_seqs.append(samples_to_seq[n])
    return zip(out_names, out_seqs)


def get_count_sequences_as_mat(file_name='data/nik-zainal2016-wgs-brca-mutations-for-hmm.json'):
    json_file = json.load(open(file_name))
    samples = json_file[u'samples']
    samples_to_seq = json_file[u'sampleToSequence']

    # finding num_object + counting
    num_objects = 0
    samples_objects = []
    samples_counts = []
    for sample in samples:
        objects, counts = np.unique(samples_to_seq[sample], return_counts=True)
        samples_objects.append(objects)
        samples_counts.append(counts)
        num_objects = max(num_objects, np.max(objects))
    num_objects += 1

    count_mat = np.zeros((len(samples), num_objects))
    for i in range(len(samples)):
        count_mat[i, samples_objects[i]] = samples_counts[i]
    return count_mat


def get_samples_names(file_name='data/nik-zainal2016-wgs-brca-mutations-for-hmm.json'):
    json_file = json.load(open(file_name))
    samples = json_file[u'samples']
    return samples


def get_split_sequences_by_threshold(file_name, threshold, sample_numbers=None):
    json_file = json.load(open(file_name))
    samples = json_file[u'samples']
    samples_to_seq = json_file[u'sampleToSequence']
    samples_dists = json_file[u'sampleToPrevMutDists']
    out_seqs = []
    out_names = []

    if sample_numbers is None:
        sample_numbers = range(len(samples))

    for i in sample_numbers:
        n = samples[i]
        out_names.append(n)
        out_seqs.append(full_sample_to_chromosomes_seqs_by_threshold(samples_to_seq[n], samples_dists[n], threshold))
    return zip(out_names, out_seqs)


def full_sample_to_chromosomes_seqs_by_threshold(sample, dists_sample, threshold):
    np_sample = np.array(sample)
    np_dists = np.array(dists_sample)

    starting_chromosome_idxs = np.where(np_dists >= 1e100)[0]

    chromosomes = np.split(np_sample, starting_chromosome_idxs)[1:]
    chromosomes_dists = np.split(np_dists, starting_chromosome_idxs)[1:]

    out = []
    for i in range(len(chromosomes)):
        chromosome = chromosomes[i]
        chromosome_dists = chromosomes_dists[i]

        starting_seqs_idxs = np.where(chromosome_dists >= threshold)[0]
        seqs = np.split(chromosome, starting_seqs_idxs)[1:]
        out.append(seqs)
    return out


def seqs_to_seq(seqs):
    out = []
    for seq in seqs:
        out.extend(seq)
    return np.array(out)


def seqs_to_seq_of_prefix(seqs):
    out = []
    for seq in seqs:
        out.append(seq[0])
    return np.array(out)


def sample_indices_not_in_dir(dir_path):
    import os
    samples_in_dir = [f[:-5] for f in os.listdir(dir_path)]
    samples = get_samples_names()
    missing_indices = []
    for i in range(len(samples)):
        if samples[i] not in samples_in_dir:
            missing_indices.append(i)
    return missing_indices
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import sys, os, argparse, pandas as pd, logging, numpy as np

# Helpers for parsing categories into substitution, left flanking,
# and right flanking
def sub(c): return c.split('[')[1].split(']')[0]
def lf(c): return c.split('[')[0]
def rf(c): return c.split(']')[-1]

if __name__ == '__main__':
    # Parse command-line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input_file', type=str, required=True)
    parser.add_argument('-o', '--output_file', type=str, required=True)
    parser.add_argument('-v', '--verbosity', type=int, required=False, default=logging.INFO)
    args = parser.parse_args(sys.argv[1:])

    # Set up logger
    logger = logging.getLogger(__name__)
    logger.setLevel(args.verbosity)

    # Load the signatures
    logger.info('[Loading the signatures]')
    with open(args.input_file, 'r') as IN:
        arrs = [ l.rstrip('\n\t').split('\t') for l in IN ]
        header = arrs.pop(0)

        # Get categories and sort according to standard
        categories = [ arr[2] for arr in arrs ]
        categories.sort(key=lambda c: (sub(c), lf(c), rf(c)))

        # Create a container for the signatures
        sig_names = header[3:]
        K = len(sig_names)
        L = len(categories)
        sigs = np.zeros((K, L))

        # Parse the lines in the file
        for arr in arrs:
            j = categories.index(arr[2])
            for i, sig_name in enumerate(sig_names):
                sigs[i,j] += float(arr[3+i])

        logger.info('- Loaded %s x %s signature matrix' % sigs.shape)

    # Create dataframe and output to file
    logger.info('[Creating dataframe]')
    df = pd.DataFrame(index=sig_names, columns=categories, data=sigs)

    logger.info('- Saving to %s' % args.output_file)
    df.to_csv(args.output_file, sep='\t')
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import sys, os, argparse, json, pandas as pd, logging, numpy as np
from collections import defaultdict
from itertools import permutations
from pandas import read_csv, Categorical

# Load our modules
from data_utils import get_logger
from constants import PATIENT, CATEGORY, CATEGORY_IDX, MUT_DIST, CHROMOSOME, START_POS

# Command-line argument parser
def get_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument('-mf', '--mutations_file', type=str, required=True)
    parser.add_argument('-mbf', '--mappability_blacklist_file', type=str,
                        required=False, default=None)
    parser.add_argument('-sf', '--signatures_file', type=str, required=True)
    parser.add_argument('-of', '--output_file', type=str, required=True)
    parser.add_argument('-v', '--verbosity', type=int, required=False,
                        default=logging.INFO)
    return parser

# Main
def run( args ):
    # Set up logger
    logger = get_logger(args.verbosity)
    logger.info('[Loading input data]')

    # Load the signatures
    sig_df = pd.read_csv(args.signatures_file, sep='\t', index_col=0)
    categories = list(sig_df.columns)
    category_index = dict(zip(categories, range(len(categories))))

    logger.info('- Loaded %s x %s signature matrix' % sig_df.values.shape)

    # Load the mutations
    mut_df = pd.read_csv(args.mutations_file, sep='\t',
                         usecols=[PATIENT, CATEGORY, MUT_DIST, CHROMOSOME, START_POS],
                         dtype={PATIENT: str, CATEGORY: str, MUT_DIST: np.float,
                                CHROMOSOME: str, START_POS: int })
    samples = sorted(set(mut_df[PATIENT]))

    logger.info('- Loaded %s mutations in %s samples' % (len(mut_df), len(samples)))

    # If a mappability blacklist is provided, use it remove mutations
    if not (args.mappability_blacklist_file is None):
        # Load the dataframe and process into a dictionary
        mappability_df = pd.read_csv(args.mappability_blacklist_file, sep=',')
        chrom_idx, start_idx, stop_idx = mappability_df.columns[:3]

        map_blacklist = defaultdict(list)
        unmappable_bases = 0
        for _, r in mappability_df.iterrows():
            chrom = r[chrom_idx][3:]
            map_blacklist[chrom].append(( int(r[start_idx]), int(r[stop_idx]) ))
            unmappable_bases += map_blacklist[chrom][-1][1] - map_blacklist[chrom][-1][0]

        logger.info('- Loaded unmappable regions spanning %s bases in %s chromosomes' % (unmappable_bases, len(map_blacklist)))

        # Restrict mutations that fall in a blacklisted region
        logger.info('[Removing unmappable mutations]')
        def mappable(r):
            for start, stop in map_blacklist[r[CHROMOSOME]]:
                if start <= r[START_POS] <= stop:
                    return False
            return True

        n_muts = len(mut_df)
        mut_df = mut_df[mut_df.apply(mappable, axis='columns')]

        n_unmappable = n_muts-len(mut_df)
        logger.info('\t-> Removed %s mutations that were not mappable' % n_unmappable)

    # Add the category index and create sequences of mutations
    logger.info('[Processing data into SigMa format]')
    mut_df[CATEGORY_IDX] = mut_df.apply(lambda r: category_index[r[CATEGORY]],
                                    axis='columns')

    sampleToSequence = dict( (s, list(map(int, s_df[CATEGORY_IDX])))
                                 for s, s_df in mut_df.groupby([PATIENT]) )
    sampleToPrevMutDists = dict( (s, list(map(float, s_df[MUT_DIST])))
                                     for s, s_df in mut_df.groupby([PATIENT]) )

    # Save to JSON
    logger.info('- Saving to file %s' % args.output_file)
    with open(args.output_file, 'w') as OUT:
        output = dict(sampleToSequence=sampleToSequence,
                      sampleToPrevMutDists=sampleToPrevMutDists,
                      samples=samples, categories=categories,
                      params=vars(args))
        json.dump( output, OUT )

if __name__ == '__main__': run( get_parser().parse_args(sys.argv[1:]) )
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
import sys, os, numpy as np, json, logging, pandas as pd
from time import time
from tqdm import tqdm

# Load our modules
from constants import MODEL_NAMES, SIGMA_NAME, MMM_NAME
from models.MMMFrozenEmissions import MMMFrozenEmissions
from models.SigMa import SigMa
from data_utils import get_split_sequences_by_threshold, to_json, get_logger

################################################################################
# HELPERS
################################################################################
def leave_one_out(sample_seqs_tuple, model_name, emissions, max_iterations, epsilon):
    seqs = sample_seqs_tuple[1]
    n_seq = len(seqs)

    chromosomes_names = ['chromosome%s' % str(i).zfill(2) for i in range(n_seq)]

    chromosome_to_experiment_dict = {}

    total_score = 0
    total_time = 0
    total_test_length = 0
    total_train_length = 0
    total_num_iterations = 0
    for i in range(n_seq):

        train_data = []
        train_length = 0
        test_length = 0

        test_data = seqs[i]
        for k in range(n_seq):
            if k != i:
                train_data.extend(seqs[k])

        for seq in train_data:
            train_length += len(seq)
        for seq in test_data:
            test_length += len(seq)

        tic = time()
        model, num_iterations = get_trained_model(model_name, emissions, train_data, epsilon, max_iterations)
        train_time = time() - tic

        score = model.log_probability(test_data)

        current_dict = {'score': score, 'time': train_time, 'trainLength': train_length, 'testLength': test_length,
                        'numIterations': num_iterations}
        chromosome_to_experiment_dict[chromosomes_names[i]] = current_dict

        total_time += train_time
        total_score += score
        total_test_length += test_length
        total_train_length += train_length
        total_num_iterations += num_iterations

    summery_dict = {'time': total_time, 'score': total_score,
                    'testLength': total_test_length, 'trainLength': total_train_length,
                    'numIterations': total_num_iterations}
    output_dict = {'results': summery_dict, 'chromosomes': chromosomes_names,
                   'chromosomesToResults': chromosome_to_experiment_dict, 'numberChromosomes': n_seq}
    return output_dict


def get_viterbi(sample_seqs_tuple, model_name, emissions, max_iterations, epsilon):
    train_data = []
    train_length = 0
    for s in sample_seqs_tuple[1]:
        train_data.extend(s)
        train_length += len(s[0])

    tic = time()
    model, num_iterations = get_trained_model(model_name, emissions, train_data, epsilon, max_iterations)
    train_time = time() - tic

    score = model.log_probability(train_data)
    out_dict = {'score': score, 'numIterations': num_iterations, 'time': train_time, 'trainLength': train_length}

    viterbi = model.predict(train_data)
    if model_name == MMM_NAME:
        out_dict['viterbi'] = {'path': viterbi}
    elif model_name == SIGMA_NAME:
        viterbi_dict = {'path': viterbi[1], 'cloud_indicator': viterbi[0]}
        map_prediction = model.predict(train_data)
        map_dict = {'path': map_prediction[1], 'cloud_indicator': map_prediction[0]}
        out_dict['viterbi'] = viterbi_dict
        out_dict['map'] = map_dict

    return out_dict


def get_trained_model(model_name, emissions, train_data, epsilon, max_iterations):
    model = get_model(model_name, emissions)
    num_iterations = model.fit(train_data, stop_threshold=epsilon, max_iterations=max_iterations)
    return model, num_iterations


def get_model(model_name, emissions):
    if model_name == SIGMA_NAME:
        return SigMa(emissions)
    elif model_name == MMM_NAME:
        return MMMFrozenEmissions(emissions)
    else:
        raise NotImplementedError('Model "%s" not implemented' % args.model_name)

################################################################################
# MAIN
################################################################################
# Parser for command-line arguments
def get_parser():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('-sf', '--signatures_file', type=str, required=True)
    parser.add_argument('-mf', '--mutations_file', type=str, required=True)
    parser.add_argument('-mn', '--model_name', type=str, required=True, choices=MODEL_NAMES)
    parser.add_argument('-ct', '--cloud_threshold', type=int, required=True)
    parser.add_argument('-sn', '--sample_names', type=str, required=False, nargs='*',
                        default=[])
    parser.add_argument('-mi', '--max_iter', type=int, required=False, default=500)
    parser.add_argument('-tol', '--tolerance', type=float, required=False, default=1e-3)
    parser.add_argument('-rs', '--random_state', type=int, required=False, default=5733)
    parser.add_argument('-od', '--output_directory', type=str, required=True)
    parser.add_argument('-as', '--active_signatures', type=int, required=False,
                        default=[], nargs='*', help='1-based indices of signatures')
    parser.add_argument('--cross-validation-mode', action='store_true', default=False,
                        required=False)
    parser.add_argument('-v', '--verbosity', type=int, required=False, default=logging.INFO)
    return parser

# Main
def main(args):
    """
    The main function to reproducing the paper's results
    :param command: 'loo' for leave one out. 'viterbi' for viterbi
    :param model: 'sigma' or 'mmm'
    :param batch: Takes indices from batch * batch_size to (batch + 1) * batch_size
    :param batch_size: see batch
    :param threshold: Define maximal distance (in bp) in clouds. Use 0 to not split (i.e use whole chromosomes)
    :param max_iterations: Maximal number of iterations for training the model
    :param epsilon: Minimum improvement in every iteration of the model, if improvement is lower stop training
    :param random_state: Random state to initialize the models
    :param out_dir: where to save all the files
    :return:
    """
    # Create simple logger
    logger = get_logger(args.verbosity)
    logger.info('[Loading input data]')

    # Get the list of samples we're going to run on
    with open(args.mutations_file, 'r') as IN:
        obj = json.load(IN)
        samples = obj.get('samples')
        categories = obj.get('categories')

    if len(args.sample_names) == 0:
        sample_indices = range(len(samples))
    else:
        sample_indices = [ samples.index(s) for s in args.sample_names ]

    logger.info('- Loading data for %s samples' % len(sample_indices))

    # Load the emissions matrix
    sig_df = pd.read_csv(args.signatures_file, sep='\t', index_col=0)
    emissions = sig_df.values
    if len(args.active_signatures) > 0:
        emissions = emissions[np.array(args.active_signatures)-1]

    assert( list(sig_df.columns) == categories )

    logger.info('- Loaded %s x %s emissions matrix' % emissions.shape)
    # if threshold <= 0:         
    #     out_dir_for_file = os.path.join(out_dir, model)
    #     threshold = 1e99
    # else:
    #     out_dir_for_file = os.path.join(out_dir, model + '_' + str(threshold))

    experiment_tuples = get_split_sequences_by_threshold(args.mutations_file, args.cloud_threshold, sample_indices)

    # Perform the experiments
    logger.info('[Performing experiments]')
    if args.cross_validation_mode:
        logger.info('- Cross-validation mode')
        func = leave_one_out
    else:
        logger.info('- Viterbi mode')
        func = get_viterbi

    for experiment_tuple in tqdm(experiment_tuples, total=len(sample_indices), ncols=80):
        np.random.seed(args.random_state)  # setting the random state before each experiment
        sample = experiment_tuple[0]
        out_file = '%s/%s-%s' % (args.output_directory, args.model_name, sample)
        dict_to_save = func(experiment_tuple, args.model_name, emissions,
                            args.max_iter, args.tolerance)
        to_json(out_file, dict_to_save)

    logger.info('- Done')

if __name__ == '__main__': main( get_parser().parse_args(sys.argv[1:]) )
ShowHide 7 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/lrgr/sigma
Name: sigma
Version: v1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...