Obtain unbiased SNPs after FASTQ files bacoded for UMI

public public 1yr ago 0 bookmarks

Snakemake for SNPs is a flexible and user-friendly SNPs analysis workflow.

Snakemake for SNPs can be applied to both model and non-model organisms. It supports mapping RNA-Seq raw reads to the reference genome (can be downloaded from public database or can be homemade by users) and it can do both Allele Specific Expression for SNPs and obtain Differential Expressed Genes (DEGs), which in turn can be cross between them. It requires basic python programming skill for use. If you're beginner at programming, just jump on the config file and adapt it to your experiments!

If you use our pipeline you need to cite us:

WARNING: adapt the citation to our link:

NOTE: This pipeline is created in Linux and other platforms may not work out accurately.

Workflow

The usage of this workflow is described in the Snakemake Workflow Catalog .

Quick start

Clone the repository:

#git clone https://github.com/AylaScientist/Snakemake_for_SNPs.git

Create the environment:

conda create -n pipeline python=3.7

Activate the environment:

conda activate pipeline

Installation

Install the packages including the bio tools:

pip install git+https://github.com/snakemake/snakemake

conda install -c bioconda snakemake-wrapper-utils

conda install -c bioconda trimmomatic=0.39

conda install -c bioconda fastqc=0.11.9

conda install -c bioconda star=2.7.10a

conda install -c bioconda htseq=0.11.3

conda install -c bioconda picard=2.26

conda install -c bioconda gatk4=4.2.5.0

conda install -c bioconda samtools=1.16

conda install -c bioconda bcftools=1.16

conda install -c bioconda vcftools=0.1.16

conda install -c bioconda htslib=1.16

conda install -c anaconda perl=5.26.2

conda install -c anaconda pandas

conda install -c anaconda scipy

conda install -c anaconda statsmodels

conda install -c anaconda seaborn

conda install -c conda-forge matplotlib

conda install -c conda-forge py-bgzip

First test

Set the resources of the system in the file config.

gedit ~/Snakemake_for_SNPs/config/config_main.yaml

Now that the resources are adapted to your computer, run a dry run for the pipeline with the example data to build a dag of jobs

cd ~/Snakemake_for_SNPs/workflow/

snakemake -n

If this point doesn't work, please contact me: [email protected]

Run the pipeline with the desired resources.

This is an example for 4 threads at 4GB.

snakemake --use-conda --cores 4

Set up configuration for your personal project

Customize the workflow based on your need in the next file:

./config/config_main.yaml .

In this file you should also change the species and the different databases for gene/transcript/protein/GO_function/KEGG correct annotation and mining of the data

Modify the metafiles describing your data and the experiment:

config/Experimental_design.csv

config/Experimental_groups.csv

config/Sample_names.csv

config/Samples_MAE.csv

config/samples.csv

Please note that the column names on the file "Experimental_groups.csv" should be called "Group_1" and "Group_2" for applying the Chi-square test.

For configuring your pseudogenomes

You need to chose two samples from different groups, preferably one sample from the control group and one sample from a treatment group. The SNPs from these samples will be used to construct the pseudogenomes. The codes of these two samples in the example are GF6 and KS4. In order to create the pseudogenomes of your experiment, these codes should be substituted in the next files, including the file name of the *colnames.csv files:

config/Pseudogenome_codes.csv

config/tbGF6_colnames.csv

config/tbKS4_colnames.csv

Very important: ADD the genome or transcriptome of your species! Here we have the genome of the Nile Tilapia in the folder genome in the root of the git.

Evaluation

The pipeline for SNPs has been evaluated on 4 datasets including 2 non-model organism (Nile and Mozambique tilapias). WARNING: Put here the link to the article

Code Snippets

12
13
shell:
    "perl ./rules/scripts/convert2annovar.pl {input.gvcf1} -format vcf4 -allsample -withfreq -withfilter -context -out {output.o1}"
24
25
26
    shell:
        "touch {output.o1} "
"""
33
34
shell:
    "touch {output.o1} "
50
51
shell:
    "perl ./rules/scripts/annotate_variation.pl -geneanno {input.i1} -buildver {input.buildver} ./ -outfile {params.i2}"
61
62
shell:
    "touch {output}"
81
82
shell:
    "perl ./rules/scripts/table_annovar.pl {input.gvcf1} {input.path} -buildver {input.buildver} -out {input.i3} -remove -protocol refGene -operation g -nastring . -vcfinput"
19
20
script:
    "scripts/combinegvcfs.py"
38
39
script:
    "scripts/combinegvcfs.py"
17
18
script:
    "scripts/combinegvcfs.py"
17
18
script:
    "scripts/selectvariants.py"
18
19
script:
    "scripts/variantstotable.py"
17
18
script:
    "scripts/genotypegvcfs.py"
17
18
script:
    "scripts/genotypegvcfs.py"
18
19
script:
    "scripts/haplotypecaller.py"
18
19
script:
    "scripts/haplotypecaller.py"
17
18
script:
    "scripts/htseq.py"
18
19
script:
    "scripts/mark_duplicates.py"
17
18
script:
    "scripts/mark_duplicates.py"
18
19
script:
    "scripts/pseudogenomes.py"
41
42
script:
    "scripts/star_gi.py"
57
58
script:
    "scripts/create_genome_dictionary.py"
69
70
shell:
    "samtools faidx {input}"
16
17
script:
    "scripts/readgroups.py"
16
17
script:
    "scripts/readgroups.py"
26
27
script:
    "scripts/Tables.py"
26
27
script:
    "scripts/Tables.py"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
import pandas as pd
import numpy as np
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

"""
DEL SNPs AD<10
---------------
This new function cleans the SNPs that do not have enough counts and are considered possible poor quality reads.
The SNPs with AD < 3 in one of the alleles are also cleaned.
"""


def AD10(df, samples):
    # The iterator collects the index were AD from both alleles is below 10 or the AD for one allele is below 3
    # Keep all this analysis for the same individual including two tissues at a time
    i = 0
    indexes_ad10 = 0
    # There is only one tissue
    for sample in samples:
        print("AD10 filter in sample ", sample)
        sample_r_ad = str(sample + "_R_.AD")
        sample_a_ad = str(sample + "_A_.AD")
        index_sample = df[((df[sample_a_ad] + df[sample_r_ad]) < 10)and((df[sample_a_ad] <3) | (df[sample_r_ad] <3))].index
        """ Uncomment this if you have a way to verify the genotypes
        if i == 0: #For the first sample
            indexes_ad10 = np.array(index_sample)
            #print("Number of rows to drop", len(indexes_ad10))
            i = i + 1
        elif i != 0:#We accumulate the indexes to drop in the next samples
            index_sample = np.array(index_sample) # We need to convert into an array for compare with the array ad10_index
            # The next is the array of the common and unique elements in both arrays (common indexes or common rows mean the indexes of the rows to drop)
            indexes_ad10 = np.intersect1d(indexes_ad10, index_sample)
            print("Number of rows to drop", len(indexes_ad10))
            i = i + 1
        """
    df.drop(index=indexes_ad10, inplace=True)
    return df


def main():
    # Add the files
    av_df=snakemake.input.get("csv")
    ad10_df=snakemake.output[0]

    df_average = pd.read_csv(av_df, low_memory=False)
    df_average = pd.DataFrame(df_average)
    sample_names = pd.read_csv(snakemake.input.get("sn1"))
    # Create arrays of the sample names
    samples = sample_names['Sample_name'].values

    # Mine the data
    df_AD10 = AD10(df_average, samples)
    # Write the temporary file
    df_AD10.to_csv(ad10_df)



if __name__ == '__main__':
    main()


shell(
    "{log}"
)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2022, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
import pandas as pd
import numpy as np
from os import path
from typing import List, Any, Generator
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


"""
ELIMINATE MONOALLELIC EXPRESSION FROM THE ALLELEIC IMBALANCE
------------------------------------------------------------
The alleles that are expressed only in one of the tissues analysed
will be sent to a different path for studing the monoallelic expression
Then, the alleles expressed in both tissues can be studied separately
"""




def frequencies (df, samples):
    # Calculate the allele frequencies
    for sample in samples:
        sample_r_ad = str(sample + "_R_.AD")
        sample_a_ad = str(sample + "_A_.AD")
        af_sample = str("AF_" + sample)
        df[af_sample] = df[sample_r_ad]/(df[sample_r_ad] + df[sample_a_ad])
    df = df.fillna(0)
    print ("Allele frequencies calculated")

    return df



def MAE (df, samples, samples_mae):
    print(os.getcwd()) #print the working path
    print(str(path.exists(samples_mae))) #print the working path
    # Collect the indexes whose allele frequencies are 0 or 1 in both tissues of the same sample. These indexes correspond
    # to the SNPs that are MAE for at least one sample
    if path.exists(samples_mae)==True:
        print("File ",samples_mae," was found")
        mae = pd.DataFrame() # Create an empty dataframe
        mae_ind = np.array([])
        tissues = pd.read_csv(samples_mae)
        tissues = pd.DataFrame(tissues)
        first_samples = tissues.iloc[:, 0]
        second_samples = tissues.iloc[:, 1]
        i = 0
        for first_sample in first_samples:
            print("MAE loop in sample ", first_sample, " and sample ", second_samples[i])
            af1 = str("AF_" + first_sample)
            af2 = str("AF_" + second_samples[i])
            index_af = df[((df[af1] == 0)  & (df[af2] == 0)) | (df[af1] == 1)  & (df[af2] == 1)].index.values
            mae_ind = np.append ( mae_ind , index_af ) #append all the indexes for collecting the SNPs in monoallelic expression of all the samples to be dropped
            i = i + 1
        mae = df.iloc [mae_ind]# Using the operator .iloc[] to select multiple rows according to the index that points the monoallelic expression
        df_no_mae = df.drop(index = mae_ind) # using drop tp drop these rows with monoallelic expression
    else:
        print("No file ",samples_mae," was found. Please verify if your path to the Samples_MAE.csv in the config.yaml file is correct")
        mae = pd.DataFrame()
        mae_ind = np.array([])
        for sample in samples:
            af = str("AF_" + sample)
            index_af = df[(df[af] == 0) | (df[af] == 1)].index.values #this is a numpy array with the indexes to drop in one sample
            mae_ind = np.append ( mae_ind , index_af ) #append all the indexes for collecting the SNPs in monoallelic expression of all the samples to be dropped
        mae = df.iloc [mae_ind]# Using the operator .iloc[] to select multiple rows according to the index that points the monoallelic expression
        df_no_mae = df.drop(index = mae_ind) # using drop tp drop these rows with monoallelic expression

    return df_no_mae, mae




def main():

    #PATH = os.getcwd()
    #os.chdir("temp")
    uniform = snakemake.input.get("result1")
    df_uni = pd.read_csv(uniform, low_memory=False)
    df_uni = pd.DataFrame(df_uni)
    #os.chdir(PATH)
    names = snakemake.input.get("names")
    sample_names = pd.read_csv(names)
    samples_mae = snakemake.input.get("mae")

    result_mae = snakemake.output.get("result_mae")
    result_no_mae = snakemake.output.get("result_no_mae")

    # Create arrays of the sample names and the pseudogenome codes
    samples = sample_names["Sample_name"].values

    # Mine the data
    df_af = frequencies(df_uni, samples)
    df_no_mae, df_mae = MAE(df_af, samples, samples_mae)
    df_no_mae.to_csv(result_no_mae)
    df_mae.to_csv(result_mae)


if __name__ == '__main__':
    main()


shell(
    "{log}"
)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
import pandas as pd
import numpy as np
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)



"""
COMPARE GENOTYPES AND MAKE THE AVERAGE OF COUNTS
------------------------------------------------

Now let's make the average on the counts for the reference and the alternative alleles. For each sample, select the
genotype of the reference allele established by the reads mapped against PSG1 and compare it to both alternative and
reference genotypes of the values in the mapping with PSG2 genome. Then proceed to make the average of the counts. The
possible cases are the next:

1.- Both homozigots for the reference allele: In that case set the same genotype to both alleles and sum all the
counts from both sites, making the average for two individuals.

2.- Reference and alternative allele in the SNP from the PSG1 pseudogenome are placed in the same position for the SNP
from the PSG2 pseudogenome. Set the genotype to the reference and alternative alleles to the ones from the SNP in PSG1
and make the average of the counts for reference and alternative genome

3.- The reference allele in the reference pseudogenome is called as alternative allele in the alternative
pseudogenome. In that case, set the genotype to the reference allele and make the average of the counts for the
corresponding variant.

4.- While the mapping against the reference genome resulted in an homozygot for the reference allele, the mapping
against the alternative genome resulted in an heterozygot with the alternative allele in the alternative position.
The genotype of the reference allele will be set from genotype of the reference allele in the reference mapping and
the alternative allele will be set from the genotype of the alternative allele in the alternative mapping. The counts
of each allele will be set accordingly to this distribution, thus dividing the counts by two individuals in the
reference allele.

5.- The mapping against the reference genome resulted in an homozygot for the reference allele and the mapping
against the alternative genome is heterozygot. In that case, the genotype of the alternative allele is set as the
reference allele on the mapping against the alternative genome. The genotypes must be set accordingly,
being the reference allele the one found in the reference genome and the alternative allele the one found in the
alternative genome. The average of the counts will follow this pattern, being the average in the erference allele.

6.- The mapping against the reference genome produced an heterozygot with reference and alternative alleles. The
mapping against the alternative pseudogenome produced an homozygot for the reference allele. In that case,
reference and alternative allele must be set according to the reference pseudogenome and the counts must be averaged
for both individuals in the alternative allele.

7.- The mapping against the reference genome produced an heterozygot with reference and alternative alleles. The
mapping against the alternative pseudogenome produced an homozygot for the alternative allele. In that case,
reference and alternative allele must be set according to the reference pseudogenome and the counts of the
alternative allele must be averaged for both individuals.

8.- The mapping against the reference genome produced an homozygot for the reference allele. The mapping against the
alterantive pseudogenome produced an homozygot for the alternative allele. In that case, the reference allele is set
according to the first pseudogenome and summing the counts of this reference allele and the alternative allele is set
following the second pseudogenome and the counts of this alternative allele are summed.

This analysis is repeated for each sample and will assign reference and alternative alleles independently. The
assignation of reference and alternative alleles uniformly in all the samples will be developed in a further step of
the workflow.
"""



# Definitions:
def evaluation(df, sample, PSGs):
    # Evaluation of genotype and average. This is an iterator
    r = "_R_"
    a = "_A_"
    gt = ".GT"
    ad = ".AD"
    PSG1_code: str = PSGs[0]
    PSG2_code: str = PSGs[1]
    rPSG1_GT = int(df.columns.get_loc(str(sample + r + PSG1_code + gt)))
    aPSG1_GT = int(df.columns.get_loc(str(sample + a + PSG1_code + gt)))
    rPSG2_GT = int(df.columns.get_loc(str(sample + r + PSG2_code + gt)))
    aPSG2_GT = int(df.columns.get_loc(str(sample + a + PSG2_code + gt)))
    rPSG1_AD = int(df.columns.get_loc(str(sample + r + PSG1_code + ad)))
    aPSG1_AD = int(df.columns.get_loc(str(sample + a + PSG1_code + ad)))
    rPSG2_AD = int(df.columns.get_loc(str(sample + r + PSG2_code + ad)))
    aPSG2_AD = int(df.columns.get_loc(str(sample + a + PSG2_code + ad)))

    # Create the empty lists to collect the average variables and genotypes
    r_AD = []
    a_AD = []
    r_GT = []
    a_GT = []

    print("Evaluation loop in sample: ", sample)

    # Start iterator
    for i in range(len(df)):
        # Case 1 homozygots for the two pseudogenoes
        if (df.iloc[i, rPSG1_GT] == df.iloc[i, aPSG1_GT]) and (df.iloc[i, rPSG2_GT] == df.iloc[i, aPSG2_GT]) and (
                df.iloc[i, rPSG1_GT] == df.iloc[i, rPSG2_GT]) and (df.iloc[i, aPSG1_GT] == df.iloc[i, aPSG2_GT]):
            #print("Ref GT: ", df.iloc[i, rPSG1_GT], " Counts: ", df.iloc[i, rPSG1_AD], " ", df.iloc[i, aPSG1_AD], " ", df.iloc[i, rPSG2_AD],  " ", df.iloc[i, aPSG2_AD]," averaged by 2")
            r_GT.append(df.iloc[i, rPSG1_GT])
            if (df.iloc[i, rPSG1_AD] != 0 and df.iloc[i, aPSG1_AD] !=0):
                x = ((((df.iloc[i, rPSG1_AD] + df.iloc[i, aPSG1_AD])/2) + df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG2_AD]) / 2)
            elif (df.iloc[i, rPSG2_AD] != 0 and df.iloc[i, aPSG2_AD] !=0):
                x = ((((df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG2_AD]) / 2) + df.iloc[i, rPSG1_AD] + df.iloc[i, aPSG1_AD]) / 2)
            elif (df.iloc[i, rPSG1_AD] != 0 and df.iloc[i, aPSG1_AD] !=0 and df.iloc[i, rPSG2_AD] != 0 and df.iloc[i, aPSG2_AD] != 0):
                x = ((((df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG2_AD]) / 2) + (df.iloc[i, rPSG1_AD] + df.iloc[
                    i, aPSG1_AD])/2) / 2)
            else:
                x = ((df.iloc[i, rPSG1_AD] + df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG1_AD] + df.iloc[i, aPSG2_AD]) / 2)
            r_AD.append(x)
            a_GT.append(df.iloc[i, rPSG2_GT])
            y = 0
            a_AD.append(y)

        # Case 2 Same situation Ref and Alt in both for df
        elif (df.iloc[i, rPSG1_GT] == df.iloc[i, rPSG2_GT]) and (df.iloc[i, aPSG1_GT] == df.iloc[i, aPSG2_GT]):
            r_GT.append(df.iloc[i, rPSG1_GT])
            x = (df.iloc[i, rPSG1_AD] + df.iloc[i, rPSG2_AD]) / 2
            r_AD.append(x)
            a_GT.append(df.iloc[i, aPSG1_GT])
            y = (df.iloc[i, aPSG1_AD] + df.iloc[i, aPSG2_AD]) / 2
            a_AD.append(y)

        # Case 3 Inverse situation Ref and Alt in both for df
        elif (df.iloc[i, rPSG1_GT] == df.iloc[i, aPSG2_GT]) and (df.iloc[i, aPSG1_GT] == df.iloc[i, rPSG2_GT]):
            r_GT.append(df.iloc[i, rPSG1_GT])
            x = (df.iloc[i, rPSG1_AD] + df.iloc[i, aPSG2_AD]) / 2  # averaged by 2 (for each mapping method)
            r_AD.append(x)
            a_GT.append(df.iloc[i, aPSG1_GT])
            y = (df.iloc[i, aPSG1_AD] + df.iloc[i, rPSG2_AD]) / 2  # averaged by 2 (for each mapping method)
            a_AD.append(y)

        # Case 4 First homozygot, second with alternative allele in second position (as alternative)
        elif (df.iloc[i, rPSG1_GT] == df.iloc[i, aPSG1_GT]) and (df.iloc[i, rPSG2_GT] != df.iloc[i, aPSG2_GT]) and (
                df.iloc[i, rPSG1_GT] == df.iloc[i, rPSG2_GT]):
            # print("Ref GT: ", df.iloc[i, rPSG1_GT]," Counts: ",df.iloc[i, rPSG1_AD]," ",df.iloc[i, aPSG1_AD]," ",df.iloc[i, rPSG2_AD]," averaged by 2")
            r_GT.append(df.iloc[i, rPSG1_GT])
            if (df.iloc[i, rPSG1_AD] != 0 and df.iloc[i, aPSG1_AD] != 0 and df.iloc[i, rPSG2_AD] != 0):
                #print ("WARNING: Line ",i," in sample ",sample," has three counts for the same allele. It is corrected by averaging by 3")
                x = (df.iloc[i, rPSG1_AD] + df.iloc[i, aPSG1_AD] + df.iloc[i, rPSG2_AD]) / 3
            else:
                x = (df.iloc[i, rPSG1_AD] + df.iloc[i, aPSG1_AD] + df.iloc[i, rPSG2_AD]) / 2  # counts from 3 cells from
                # which one is 0, from 2 mapping methods (averaged by 2)
            r_AD.append(x)
            a_GT.append(df.iloc[i, aPSG2_GT])
            y = df.iloc[i, aPSG2_AD] / 2
            a_AD.append(y)

        # Case 5 First homozygot, second with alternative allele in first position  (as reference)
        elif ((df.iloc[i, rPSG1_GT] == df.iloc[i, aPSG1_GT]) and (df.iloc[i, rPSG2_GT] != df.iloc[i, aPSG2_GT]) and (
                df.iloc[i, rPSG1_GT] == df.iloc[i, aPSG2_GT])):
            #print("Ref GT: ", df.iloc[i, rPSG1_GT], " Counts: ", df.iloc[i, rPSG1_AD], " ", df.iloc[i, aPSG1_AD], " ",df.iloc[i, aPSG2_AD], " averaged by 2")
            r_GT.append(df.iloc[i, rPSG1_GT])
            if (df.iloc[i, rPSG1_AD] != 0 and df.iloc[i, aPSG1_AD] != 0 and df.iloc[i, aPSG2_AD]  != 0) :
                #print("WARNING: Line ", i, " in sample ", sample, " has three counts for the same allele. It is corrected by averaging by 3")
                x = (df.iloc[i, rPSG1_AD] + df.iloc[i, aPSG1_AD] + df.iloc[i, aPSG2_AD]) / 3
            else:
                x = (df.iloc[i, rPSG1_AD] + df.iloc[i, aPSG1_AD] + df.iloc[i, aPSG2_AD]) / 2  # counts from 3 cells from which one is 0, from 2 mapping methods (averaged by 2)
            r_AD.append(x)
            a_GT.append(df.iloc[i, rPSG2_GT])
            y = df.iloc[i, rPSG2_AD] / 2
            a_AD.append(y)

        # Case 6 Second homozygot, first heterozygot with alternative allele in second position (as alternative)
        elif (df.iloc[i, rPSG1_GT] != df.iloc[i, aPSG1_GT]) and (df.iloc[i, rPSG2_GT] == df.iloc[i, aPSG2_GT]) and (
                df.iloc[i, rPSG1_GT] == df.iloc[i, rPSG2_GT]):
            #print("Ref GT: ", df.iloc[i, rPSG1_GT], " Counts: ", df.iloc[i, rPSG1_AD], " ", df.iloc[i, rPSG2_AD], " ",df.iloc[i, aPSG2_AD], " averaged by 2")
            r_GT.append(df.iloc[i, rPSG1_GT])
            if (df.iloc[i, rPSG1_AD] != 0 and df.iloc[i, rPSG2_AD] != 0 and df.iloc[i, aPSG2_AD] !=0 ):
                #print("WARNING: Line ", i, " in sample ", sample, " has three counts for the same allele. It is corrected by averaging by 3")
                x = (df.iloc[i, rPSG1_AD] + df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG2_AD]) / 3
            else:
                x = (df.iloc[i, rPSG1_AD] + df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG2_AD]) / 2  # counts from 3 cells from which one is 0, from 2 mapping methods (averaged by 2)
            r_AD.append(x)
            a_GT.append(df.iloc[i, aPSG1_GT])
            y = df.iloc[i, aPSG1_AD] / 2
            a_AD.append(y)

        # Case 7 Second homozygot, first heterozygot with alternative allele in first position (as reference)
        elif (df.iloc[i, rPSG1_GT] != df.iloc[i, aPSG1_GT]) and (df.iloc[i, rPSG2_GT] == df.iloc[i, aPSG2_GT]) and (
                df.iloc[i, aPSG1_GT] == df.iloc[i, aPSG2_GT]):
            #print("Alt GT: ", df.iloc[i, rPSG2_GT], " Counts: ", df.iloc[i, aPSG1_AD], " ", df.iloc[i, rPSG2_AD], " ",df.iloc[i, aPSG2_AD], " averaged by 2")
            r_GT.append(df.iloc[i, rPSG1_GT])
            x = df.iloc[i, rPSG1_AD] / 2
            r_AD.append(x)
            a_GT.append(df.iloc[i, rPSG2_GT])
            if (df.iloc[i, aPSG1_AD] != 0 and df.iloc[i, rPSG2_AD] != 0 and df.iloc[i, aPSG2_AD] !=0):
                #print("WARNING: Line ", i, " in sample ", sample, " has three counts for the same allele. It is corrected by averaging by 3")
                y = (df.iloc[i, aPSG1_AD] + df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG2_AD]) / 3
            else:
                y = (df.iloc[i, aPSG1_AD] + df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG2_AD]) / 2  # counts from 3 cells from which one is 0, from 2 mapping methods (averaged by 2)
            a_AD.append(y)

        # Case 8 Both homozygot, first for the reference allele and second for the alternative allele
        elif (df.iloc[i, rPSG1_GT] == df.iloc[i, aPSG1_GT]) and (df.iloc[i, rPSG2_GT] == df.iloc[i, aPSG2_GT]) and (
                df.iloc[i, aPSG1_GT] != df.iloc[i, aPSG2_GT]):
            #print("Ref GT: ", df.iloc[i, rPSG1_GT], " Counts: ", df.iloc[i, rPSG1_AD], " ", df.iloc[i, aPSG1_AD], "averaged by 2")

            r_GT.append(df.iloc[i, rPSG1_GT])
            x = df.iloc[i, rPSG1_AD] + df.iloc[i, aPSG1_AD]  # one of these counts will be 0 so no need to average
            r_AD.append(x)
            a_GT.append(df.iloc[i, rPSG2_GT])
            y = df.iloc[i, rPSG2_AD] + df.iloc[i, aPSG2_AD]  # one of these counts will be 0 so no need to average
            a_AD.append(y)

        else:
            print("Error in the SNP on the coordinates ", df.iloc[i, "CHROM"], ", ", df.iloc[i, "POS"],
                    ", sample num ",
                    sample)

    return (r_GT, r_AD, a_GT, a_AD)


def sample_average(df, samples, PSGs):
    cols = []  # Index of columns to be drop in the end
    # Create 4 columns for each averaged sample:
    for sample in samples:
        print("Sample average in sample ", sample, "\n")
        # Average each sample according to genotype:
        (r_GT, r_AD, a_GT, a_AD) = evaluation(df, sample, PSGs)

        # Add the new columns
        sample_r_GT = str(sample + "_R_.GT")
        sample_a_GT = str(sample + "_A_.GT")
        sample_r_AD = str(sample + "_R_.AD")
        sample_a_AD = str(sample + "_A_.AD")
        df[sample_r_GT] = r_GT
        df[sample_a_GT] = a_GT
        df[sample_r_AD] = r_AD
        df[sample_a_AD] = a_AD

        # Drop the columns that won't be used
        r = "_R_"
        a = "_A_"
        gt = ".GT"
        ad = ".AD"
        PSG1_code: str = PSGs[0]
        PSG2_code: str = PSGs[1]
        rPSG1_GT = int(df.columns.get_loc(str(sample + r + PSG1_code + gt)))
        aPSG1_GT = int(df.columns.get_loc(str(sample + a + PSG1_code + gt)))
        rPSG2_GT = int(df.columns.get_loc(str(sample + r + PSG2_code + gt)))
        aPSG2_GT = int(df.columns.get_loc(str(sample + a + PSG2_code + gt)))
        rPSG1_AD = int(df.columns.get_loc(str(sample + r + PSG1_code + ad)))
        aPSG1_AD = int(df.columns.get_loc(str(sample + a + PSG1_code + ad)))
        rPSG2_AD = int(df.columns.get_loc(str(sample + r + PSG2_code + ad)))
        aPSG2_AD = int(df.columns.get_loc(str(sample + a + PSG2_code + ad)))

        cols.append(rPSG1_GT)
        cols.append(aPSG1_GT)
        cols.append(rPSG2_GT)
        cols.append(aPSG2_GT)
        cols.append(rPSG1_AD)
        cols.append(aPSG1_AD)
        cols.append(rPSG2_AD)
        cols.append(aPSG2_AD)

    return df, cols

def main():
    # Add files
    extra = snakemake.params.get("extra")
    java_opts = snakemake.params.get("java_opts")
    log = snakemake.log_fmt_shell(stdout=False, stderr=True)


    df_bi = snakemake.input.get("csv")
    df_bi = pd.read_csv(df_bi, low_memory=False)
    df_bi = pd.DataFrame(df_bi)


    sample_names = pd.read_csv(snakemake.input.get("sn1"))
    PSG_codes = pd.read_csv(snakemake.input.get("psc"))
    # Create arrays of the sample names and the pseudogenome codes
    samples = sample_names['Sample_name'].values
    PSGs = PSG_codes['PSGs'].values

    df_av = snakemake.output[0]

    # Mine the data
    df_average, cols = sample_average(df_bi, samples, PSGs)
    # Drop the columns that are not needed
    df_average.drop(df_average.columns[cols], axis=1, inplace=True)
    df_average.to_csv(df_av)


if __name__ == '__main__':
    main()


shell(
    "{log}"
)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
import pandas as pd
import numpy as np
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


"""
DEL MULTIALLELIC SITES
----------------------

Compare the genotypes for assessing the counts on each allele
Set the data for the mapping against the pseudogenome of sample PSG1 because it
has the variants of the reference genotype. The SNPs called by this sample are
set as reference and alternative alleles for all the other samples.

The tables containing the SNPs from mapping to PSG1 and PSG2 contained only
biallelic sites. However multiallelic possibilities can appear if one SNP was
called for a different genotpe in each of the pseudogenomes. The multiallelic
sites will be deleted from this dataframe to continue the analysis with
bialleleic sites only.

Delete multiallelic sites:
Let's start droping the multiallelic. The criteria compares if the reference
allele in the SNPs resulting from mapping against PSG1 pseudogenome is different
form the reference and the alternative alleles from the mapping against PSG2
pseudogenome. This is later repeated for the alternative allele. Note that the
SNPs mapped against PSG1 pseudogenome have been pre-filtered for biallelic sites
for all the samples.

Collect the indexes where the reference allele in PSG1 is different from both
alleles in the other mappings or the alternative allele in PSG1 is different
from both alleles in the PSG2 mapping.
"""



# Make an iterator for the collection of multiallelic sites:
def multiallelic_sample(df, sample, PSGs):
    # For obtaining the column names:
    r = "_R_"
    a = "_A_"
    gt = ".GT"
    PSG1_code: str = PSGs[0]
    PSG2_code: str = PSGs[1]
    rPSG1_GT = int(df.columns.get_loc(str(sample + r + PSG1_code + gt)))
    aPSG1_GT = int(df.columns.get_loc(str(sample + a + PSG1_code + gt)))
    rPSG2_GT = int(df.columns.get_loc(str(sample + r + PSG2_code + gt)))
    aPSG2_GT = int(df.columns.get_loc(str(sample + a + PSG2_code + gt)))
    print("Multiallelic loop in sample: ", sample)

    sample_index = []
    for i in range(len(df)):
        if (((df.iloc[i, rPSG1_GT]) != df.iloc[i, rPSG2_GT]) & (df.iloc[i, rPSG1_GT] != df.iloc[i, aPSG2_GT])) | (
                (df.iloc[i, aPSG1_GT] != df.iloc[i, rPSG2_GT]) & (df.iloc[i, aPSG1_GT] != df.iloc[i, aPSG2_GT])):
            sample_index.append(i)
    return sample_index


# Eliminate the multiallelic sites
def multiallelic(df, samples, PSGs, multi_index):
    for sample in samples:
        multiallelic_sample_index = multiallelic_sample(df, sample, PSGs)
        multi_index = multi_index + multiallelic_sample_index
    df_bi = df.drop(index=multi_index)
    return df_bi

def main():

    # Add files
    mdf = pd.read_csv(snakemake.input.get("csv"), low_memory=False)
    mdf = pd.DataFrame(mdf)
    sample_names = pd.read_csv(snakemake.input.get("sn1"))
    PSG_codes = pd.read_csv(snakemake.input.get("psc"))

    # Create arrays of the sample names and the pseudogenome codes
    samples = sample_names['Sample_name'].values
    PSGs = PSG_codes['PSGs'].values

    # Create an empty array for collect the indexes with multiallelic sites:
    multi_index = []

    # Mine the datafiles
    df_bi = multiallelic(mdf, samples, PSGs, multi_index)
    df_bi.to_csv(snakemake.output[0])


if __name__ == '__main__':
    main()


shell(
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts


extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")

files=snakemake.input.gvcfs
print(files)
gvcfs = list(map("-V {}".format, snakemake.input.gvcfs))
print(gvcfs)


log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "gatk --java-options '{java_opts}' CombineGVCFs {extra} "
    "{gvcfs} "
    "-R {snakemake.input.ref} "
    "-O {snakemake.output.gvcf} {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts


extra = snakemake.params.get("extra", "")
#sample = snakemake.params.get("sample")
#java_opts = get_java_opts(snakemake)


genome = snakemake.input[0]
dictionary = snakemake.output[0]
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "picard CreateSequenceDictionary "
    "REFERENCE={genome} "
    "OUTPUT= {dictionary} "
    "{log}"
)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts


extra = snakemake.params.get("extra", "")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "picard CreateSequenceDictionary"
        " {java_opts} {extra}"
        " --REFERENCE {snakemake.input[0]}"
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {snakemake.output[0]}"
        " --CREATE_INDEX true"
        " {log}"
    )
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell("samtools faidx {snakemake.input[0]} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts


extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
    "gatk --java-options '{java_opts}' GenotypeGVCFs {extra} "
    "-V {snakemake.input.gvcf} "
    "-R {snakemake.input.ref} "
    "-O {snakemake.output.vcf} {log}"
)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
import pandas as pd
import numpy as np
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


"""
ASSIGN REFERENCE AND ALTERNATIVE ALLELES UNIFORMLY IN ALL SAMPLES
-----------------------------------------------------------------
Reference and alternative alleles have been assigned by sample. This function will correct it and will assign the
reference and alternative alleles according to the pattern established in sample 1.
"""



def compare(df, sample, R_models, A_models,SNP_panel):
    # This iterator will construct the new columns for the samples according to the reference and alternative alleles
    # assigned in sample1

    # Position of the sample columns
    sample_r_gt = int(df.columns.get_loc(str(sample + "_R_.GT")))
    sample_a_gt = int(df.columns.get_loc(str(sample + "_A_.GT")))
    sample_r_ad = int(df.columns.get_loc(str(sample + "_R_.AD")))
    sample_a_ad = int(df.columns.get_loc(str(sample + "_A_.AD")))

    # Create the empty lists to collect the average variables and genotypes
    r_AD = []
    a_AD = []
    r_GT = []
    a_GT = []

    print("Compare genotypes loop in sample: ", sample)

    # Start iterator for model genotypes
    for k in range(len(df)):
        # Set in the model the new genotypes in case the samples didn't express this allele.
        if R_models[k] == A_models[k] and R_models[k] == ".":
            # print("Previous: ", R_models[k], A_models[k], ", new: ", df.iloc[k, sample_r_gt], df.iloc[k, sample_a_gt])
            R_models[k] = df.iloc[k, sample_r_gt]
            A_models[k] = df.iloc[k, sample_a_gt]
        elif R_models[k] == A_models[k] and R_models[k] == df.iloc[k, sample_r_gt]:
            A_models[k] = df.iloc[k, sample_a_gt]
        elif R_models[k] == A_models[k] and R_models[k] == df.iloc[k, sample_a_gt]:
            A_models[k] = df.iloc[k, sample_r_gt]


    # Start iterator for comparison with the new models
    for i in range(len(df)):
        if R_models[i] == df.iloc[i, sample_r_gt] and A_models[i] == df.iloc[i, sample_a_gt]:
            # 1 Reference and alternative are set in the same position for sample and sample1 (reference)
            # It applies to both samples heterozygots or both homozygots with same genotype
            r_AD.append(df.iloc[i, sample_r_ad])
            a_AD.append(df.iloc[i, sample_a_ad])
            r_GT.append(df.iloc[i, sample_r_gt])
            a_GT.append(df.iloc[i, sample_a_gt])
        elif A_models[i] == df.iloc[i, sample_r_gt] and R_models[i] == df.iloc[i, sample_a_gt]:
            # 2 Reference and alternative are set in the inverse position for sample and sample1 (reference)
            # It applies to both samples heterozygots or both homozygots with same genotype
            r_AD.append(df.iloc[i, sample_a_ad])
            a_AD.append(df.iloc[i, sample_r_ad])
            r_GT.append(df.iloc[i, sample_a_gt])
            a_GT.append(df.iloc[i, sample_r_gt])
        elif R_models[i] == df.iloc[i, sample_r_gt] and A_models[i] == df.iloc[i, sample_r_gt] and A_models[i] != \
                df.iloc[i, sample_a_gt]:
            # 3 It applies to first sample homozygot and second sample heterozygot with the alternative allele in second
            r_AD.append(df.iloc[i, sample_r_ad])
            a_AD.append(df.iloc[i, sample_a_ad])
            r_GT.append(df.iloc[i, sample_r_gt])
            a_GT.append(df.iloc[i, sample_a_gt])
        elif R_models[i] != df.iloc[i, sample_r_gt] and A_models[i] != df.iloc[i, sample_r_gt] and A_models[i] == \
                df.iloc[i, sample_a_gt]:
            # 4 It applies to first sample homozygot and second sample heterozygot with the alternative allele in first
            r_AD.append(df.iloc[i, sample_a_ad])
            a_AD.append(df.iloc[i, sample_r_ad])
            r_GT.append(df.iloc[i, sample_a_gt])
            a_GT.append(df.iloc[i, sample_r_gt])
        elif R_models[i] == A_models[i] and R_models[i] != df.iloc[i, sample_r_gt] and df.iloc[i, sample_r_gt] == \
                df.iloc[i, sample_a_gt]:
            # 5 It applies to first sample homozygot for the reference allele and second sample homozygot for the
            # alternative allele
            a_AD.append(df.iloc[i, sample_r_ad] + df.iloc[i, sample_a_ad])  # One of them will be 0
            r_GT.append(df.iloc[i, sample_a_gt])
            a_GT.append(df.iloc[i, sample_a_gt])
            r_AD.append(0)
        elif R_models[i] != A_models[i] and R_models[i] == df.iloc[i, sample_r_gt] and df.iloc[i, sample_r_gt] == \
                df.iloc[i, sample_a_gt]:
            # 6 It applies to first sample heterozygot and second sample homozygot for the reference allele
            r_GT.append(df.iloc[i, sample_r_gt])
            a_GT.append(df.iloc[i, sample_r_gt])
            r_AD.append(df.iloc[i, sample_r_ad] + df.iloc[i, sample_r_ad])  # One of them will be 0
            a_AD.append(0)
        elif R_models[i] != A_models[i] and A_models[i] == df.iloc[i, sample_a_gt] and df.iloc[i, sample_r_gt] == \
                df.iloc[i, sample_a_gt]:
            # 7 It applies to first sample heterozygot and second sample homozygot for the alternative allele
            a_AD.append(df.iloc[i, sample_r_ad] + df.iloc[i, sample_a_ad])  # One of them will be 0
            r_GT.append(df.iloc[i, sample_a_gt])
            a_GT.append(df.iloc[i, sample_a_gt])
            r_AD.append(0)
        elif R_models[i] == df.iloc[i, sample_r_gt] and A_models[i] == df.iloc[i, sample_r_gt] and df.iloc[
            i, sample_r_gt] == df.iloc[i, sample_a_gt]:
            # 8 It applies to first sample homozygot for the reference allele and second sample homozygot for the
            # reference allele too
            r_AD.append(df.iloc[i, sample_r_ad] + df.iloc[i, sample_a_ad])  # One of them will be 0
            r_GT.append(df.iloc[i, sample_r_gt])
            a_GT.append(df.iloc[i, sample_r_gt])
            a_AD.append(0)
        elif R_models[i] == A_models[i] and R_models[i] == ".":
            # 9 SNP not expressed in the sample1 as reference gentoype
            r_AD.append(df.iloc[i, sample_r_ad])
            a_AD.append(df.iloc[i, sample_a_ad])
            r_GT.append(df.iloc[i, sample_r_gt])
            a_GT.append(df.iloc[i, sample_a_gt])
            print(i)


        elif df.iloc[i, sample_r_gt] == df.iloc[i, sample_a_gt] and df.iloc[i, sample_r_gt] == ".":
            # 10 SNP not expressed in the sample to evaluate
            r_AD.append(0)
            a_AD.append(0)
            r_GT.append(".")
            a_GT.append(".")
        else:
            print("ERROR: Comparison of genotypes of sample ",
                    sample, " provided an error for the SNP in the row ", i, " with genotypes ",
                    df.iloc[i, sample_a_gt], ", ",
                    df.iloc[i, sample_r_gt], ". You must check if the evaluation of this case is correct")


    # Make a dataframe with the model genotypes
    SNP_panel["Reference allele"] = R_models
    SNP_panel["Alternative allele"] = A_models
    SNP_panel.to_csv(snakemake.output.get("result2"))
    print("Result 2, SNP panel, written in disk")


    genotype_models = pd.DataFrame(data=[R_models, A_models]).T
    genotype_models.to_csv(snakemake.output.get("result3"))
    print("Results 3, Genotype models, written in disk")

    return r_AD, a_AD, r_GT, a_GT, R_models, A_models



def genotype(df, samples):
    # Make the new df and prepare the first columns of the SNP_panel:
    df_uni = df.iloc[:, 1:10]
    SNP_panel = df.iloc[:, 1:10]
    # Determines the genotypes for sample1 and calls the function to compare genotypes
    j = 0
    for sample in samples:
        j = j + 1
        if j == 1:  # If we are in the first sample we set the reference
            R_models = df[str(sample + "_R_.GT")].to_numpy()
            A_models = df[str(sample + "_A_.GT")].to_numpy()
            # Collection the name of the samples
            sample_r_gt = str(sample + "_R_.GT")
            sample_a_gt = str(sample + "_A_.GT")
            sample_r_ad = str(sample + "_R_.AD")
            sample_a_ad = str(sample + "_A_.AD")
            # Assign the list to the column name
            df_uni[sample_r_gt] = df[str(sample + "_R_.GT")]
            df_uni[sample_a_gt] = df[str(sample + "_A_.GT")]
            df_uni[sample_r_ad] = df[str(sample + "_R_.AD")]
            df_uni[sample_a_ad] = df[str(sample + "_A_.AD")]
        else:  # If we already passed the first sample
            (r_ad, a_ad, r_gt, a_gt, R_models, A_models) = compare(df, sample, R_models, A_models, SNP_panel)
            # Collection the name of the samples
            sample_r_gt = str(sample + "_R_.GT")
            sample_a_gt = str(sample + "_A_.GT")
            sample_r_ad = str(sample + "_R_.AD")
            sample_a_ad = str(sample + "_A_.AD")
            # Assign the list to the column name
            df_uni[sample_r_gt] = r_gt
            df_uni[sample_a_gt] = a_gt
            df_uni[sample_r_ad] = r_ad
            df_uni[sample_a_ad] = a_ad
    #df_uni.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'Unnamed: 0_x'], axis=1)
    return df_uni

def main():
    # Add the files
    ad10_df=snakemake.input.get("csv")
    geno_df=snakemake.output.get("result1")
    result2=snakemake.output.get("result2")
    result3=snakemake.output.get("result3")

    df_ad10 = pd.read_csv(ad10_df, low_memory=False)
    df_ad10 = pd.DataFrame(df_ad10)
    sample_names = pd.read_csv(snakemake.input.get("sn1"))
    # Create arrays of the sample names
    samples = sample_names['Sample_name'].values


    # Mine the datafiles
    df_uni = genotype(df_ad10, samples)
    #drop the useless columns:

    df_uni.to_csv(geno_df)
    print("Final file, Uniform SNPs, written in disk")

if __name__ == '__main__':
    main()


shell(
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
__author__ = "Aurora Campo"
__copyright__ = "Cpoyright 2020, Aurora Campo, modified from Copyright 2018, Johannes Köster"
__email__ = "[email protected]; [email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")

known = snakemake.input.get("known", "")
if known:
    known = "--dbsnp " + known

bams = snakemake.input.bam
if isinstance(bams, str):
    bams = [bams]
bams = list(map("-I {}".format, bams))

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
    "gatk --java-options '{java_opts}' HaplotypeCaller {extra} "
    "-R {snakemake.input.ref} {bams} "
    "-ERC GVCF "
    "-O {snakemake.output.gvcf} {known} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts

input = snakemake.input.get("bam")
ref = snakemake.input.get("annotation")
output = snakemake.output[0]
extra = snakemake.params.get("extra")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell(
     "htseq-count "
     "{extra} "
     "{input} "
     "{ref} "
     "> {output} "
     "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2020, Aurora Campo modified from Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "CC"


from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")


shell(
    "picard MarkDuplicates "  # Tool and its subcommand
    "{java_opts} "  # Automatic java option
    "{extra} "  # User defined parmeters
    "INPUT={snakemake.input} "  # Input file
    "OUTPUT={snakemake.output.bam} "  # Output bam
    "METRICS_FILE={snakemake.output.metrics} "  # Output metrics
    "{log}"  # Logging
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
import pandas as pd
import numpy as np
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


"""
 MERGE THE DATAFRAMES
-----------------------

The format of the table is the same for both dataframes. Columns from 1 to 6 are:
    CHROM: Chromosome,
    POS: position,
    Gene.refGene: Gene_ID
    Func.refGene: Annotation of the function for this SNP
    ExonicFunc.refGene: Annotation of the function if this SNP is exonic
    AF: Allele frequency estimated for all the samples

From column 7th till the end there will be 4 columns for each sample. The structure of the column name is
as follows:
    NAME (determined by the sample name)
    _R or _A (reference or alternative allele)
    _??? (The pseudogenome code)
    .AD (allele depth)
    .GT (Genotype)

The next step is to merge the two dataframes on the common chromosome and position:
"""

# Read the files
csv1=snakemake.input.get("csv1")
csv2=snakemake.input.get("csv2")
merged_df=snakemake.output[0]

PSG1 = pd.read_csv(csv1, low_memory=False)
PSG2 = pd.read_csv(csv2, low_memory=False)
PSG1 = pd.DataFrame(PSG1)
PSG2 = pd.DataFrame(PSG2)

# Merge
df = pd.merge(PSG1, PSG2, on=('CHROM', 'POS'))
df.to_csv(merged_df, sep=',')
print('Files merged')

shell(
    "{log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo, modified from git"
__git__= "https://github.com/johanzi/make_pseudogenome"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts

input = snakemake.input.get("vcf")
extra = snakemake.params.get("extra")
ref = snakemake.input.get("ref")
pseudo = snakemake.output.get("pseudo")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell(
    "bcftools consensus "
    "{input} "
    "--sample {extra} "
    "--fasta-ref {ref} "
    "> {pseudo} "
    "{log}"
)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


"""
python 3.7

    @autor : Ayla
    @version : 1.0
    @date : September 2021

This script is a quality control of the data filtered by the script "ASE_workflow.py".
These SNPs are unbiased and biallelic, and the monoallelic expression (MAE) has been discarded.
This script will plot the distribution of the data in general and for each experimental group.
It needs a file with the name of the experimental group as the acronyms used in the sample name. For example:
    The sample number 1 corresponds to the gills exposed to freshwater
    The two experimental factors are "gills" (G) and "freshwater" (F)
    Then the sample is called "1GF" and the group name will be called "GF"
This script also needs a csv file with the first column naming the experimental groups and the next columns with the
sample names of each sample in a group
"""

import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


"""
    ALLELE FREQUENCIES
    ------------------
    Compute allele frequencies for each allele and later for each experimental group.
    """

def frequencies(df, samples):
    """
    :param df: dataframe with unbiased SNP counts without allele frequencies
    :param samples: list of strings including the sample names
    :return: dataframe with the allele frequencies by sample
    """
    # Calculate the allele frequencies
    for sample in samples:
        sample_r_ad = str(sample + "_R_.AD")
        sample_a_ad = str(sample + "_A_.AD")
        af_sample = str("AF_" + sample)
        df[af_sample] = df[sample_r_ad] / (df[sample_r_ad] + df[sample_a_ad])
    df = df.fillna(0)  # To fill cases where there are no counts and therefore AF is divided by 0
    print("Allele frequencies by sample calculated")

    return df

def allele_freqs(group_names, df):
    # Calculate the allele frequency for each experimental group in order to make a plot of each group:

    for group_name in group_names:
        af_group = "AF_" + group_name
        x_name = af_group + "_x"
        y_name = af_group + "_y"
        print(x_name)
        print(y_name)
        x = df.loc[:, df.columns.str.contains("([0-9])+" + group_name + "_R_.AD", regex=True)]
        y = df.loc[:, df.columns.str.contains("([0-9])+" + group_name + "_A_.AD", regex=True)]
        df[x_name] = x.sum(axis=1)
        df[y_name] = y.sum(axis=1)
        df[af_group] = df[x_name]/(df[x_name] + df[y_name])
        #df.drop(columns=x_name, inplace=True)
        #df.drop(columns=y_name, inplace=True)

    df = df.fillna(0)
    print("Allele frequencies by experimental group calculated")
    return df


def plot(df, group_names):
    # Plot all the histograms with the allele frequencies

    for group_name in group_names:
        af_group = "AF_" + group_name

        # Possibility 1
        # df.hist(column=af_group, bins=100, grid=False, figsize=(8,10), layout=(1,1), sharex=True, color='#86bf91', zorder=2, rwidth=0.9)

       # Possibility 3 previous in the script
        hist_data = df[af_group].values
        plt.hist(hist_data, 100, density=False, facecolor='#86bf91')
        plt.xlabel('Allele frequency')
        plt.ylabel('Frequency')
        plt.title(af_group)
        plt.grid(True)
        axes = plt.gca()
        axes.set_ylim([0, 15000])
        # plt.show()

        # Get the path
        PATH = os.getcwd ()
        os.chdir("results")
        plt.savefig(group_name + "_AF_mapping_bias.svg")
        os.chdir(PATH)

        plt.clf ()


def main():
    # Import the files
    df = pd.read_csv(snakemake.input.get("i1"))
    sample_names = pd.read_csv(snakemake.input.get("sn1"))
    # Create arrays of the sample names
    samples = sample_names['Sample_name'].values
    groups_df = pd.read_csv(snakemake.input.get("i2"))
    # Create a numpy array with the name of each group
    group_names = list(groups_df.columns[0])

    # Mine the data
    df_af = frequencies(df, samples)
    df_qc = allele_freqs(group_names, df_af)

    # Copy the quality control file for further calculation of the stats
    df_qc.to_csv(snakemake.output.get("results"))

    # Plot the histograms
    plot(df_qc, group_names)


if __name__ == '__main__':
    main()


shell(
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2020, Aurora Campo modified from Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")


shell(
    "picard AddOrReplaceReadGroups {java_opts} {snakemake.params} "
    "I={snakemake.input} "
    "O={snakemake.output} &> {snakemake.log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
    "gatk --java-options '{java_opts}' SelectVariants "
    "-R {snakemake.input.ref} "
    "-V {snakemake.input.vcf} "
    "{extra} "
    "-O {snakemake.output.vcf} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"

import os

from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
input = snakemake.input.get("bam")
ref = snakemake.input.get("ref")
output = snakemake.output.get("bam")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "gatk --java-options '{java_opts}' SplitNCigarReads {extra} "
    "-R {ref} "
    "-I {input} "
    "-O {output} "
    "{log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts

input = snakemake.input[0]
threads = snakemake.params.get("threads")
annotation = snakemake.params.get("annotation")
dir = snakemake.params.get("dir")
read_length = int(snakemake.params.get("read_length"))-1
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "STAR "
    "--runThreadN {threads} "
    "--runMode genomeGenerate "
    "--genomeDir {dir} "
    "--genomeFastaFiles {input} "
    "--sjdbGTFfile {annotation} "
    "--sjdbOverhang {read_length} " #ReadLength-1
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2020, Aurora Campo, modified from Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
threads = snakemake.params.get("threads")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
index = snakemake.input.get("index")
outprefix = snakemake.params.get("filename")
fq1 = snakemake.input.get("fastq1")
fq2 = snakemake.input.get("fastq2")
index_dir = index.rsplit('/', 1)[0] #retains all the string before the last character "/"

assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fastq1]
    if isinstance(snakemake.input.fastq1, str)
    else snakemake.input.fastq1
)

if fq2:
    fq2 = (
        [snakemake.input.fastq2]
        if isinstance(snakemake.input.fastq2, str)
        else snakemake.input.fastq2
    )
    assert len(fq1) == len(
        fq2
    ), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])


if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""



shell(
    "STAR "
    "{extra} "
    "--runThreadN {threads} "
    "--genomeDir {index_dir} "
    "--readFilesIn {fq1} {fq2} " #{input_str} for single ends
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "--twopassMode Basic "
    "--quantMode GeneCounts "
    "--outSAMattrIHstart 0 "
    "--alignSoftClipAtReferenceEnds No "
    "--limitBAMsortRAM 300647556788 "
    "--outBAMsortingThreadN 2 "
    "--outSAMstrandField intronMotif "
    "--outFilterIntronMotifs RemoveNoncanonical "
    "--outSAMattributes All "
    "--outFilterScoreMinOverLread 0 "
    "--outFilterMatchNminOverLread 0 "
    "--outFilterMatchNmin 0 "
    "--outFilterMismatchNmax 2 "
    "--outSAMtype BAM SortedByCoordinate "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2020, Aurora Campo, modified from Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
threads = snakemake.params.get("threads")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
index = snakemake.input.get("index")
outprefix = snakemake.params.get("filename")
fq1 = snakemake.input.get("fastq1")
fq2 = snakemake.input.get("fastq2")
index_dir = index.rsplit('/', 1)[0] #retains all the string before the last character "/"

assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fastq1]
    if isinstance(snakemake.input.fastq1, str)
    else snakemake.input.fastq1
)


if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""



shell(
    "STAR "
    "{extra} "
    "--runThreadN {threads} "
    "--genomeDir {index_dir} "
    "--readFilesIn {fq1} " 
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "--twopassMode Basic "
    "--quantMode GeneCounts "
    "--outSAMattrIHstart 0 "
    "--alignSoftClipAtReferenceEnds No "
    "--limitBAMsortRAM 300647556788 "
    "--outBAMsortingThreadN 2 "
    "--outSAMstrandField intronMotif "
    "--outFilterIntronMotifs RemoveNoncanonical "
    "--outSAMattributes All "
    "--outFilterScoreMinOverLread 0 "
    "--outFilterMatchNminOverLread 0 "
    "--outFilterMatchNmin 0 "
    "--outFilterMismatchNmax 2 "
    "--outSAMtype BAM SortedByCoordinate "
    "{log}"
)
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
import pandas as pd
from scipy import stats
from scipy.stats import chisquare
import scipy as sc
import matplotlib.pyplot as plt

import os
import statsmodels.stats as smt
from statsmodels.stats import multitest
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def chi_test(df, af_samples, experiment, exp):
    # Collect the row of the experiment where the design is described
    design = exp[exp["Test_ID"].str.match(experiment)].iloc[0]
    group1 = design["Group_1"]
    group2 = design["Group_2"]
    # Collect the columns of the dataframe that will go into the test:
    group1_samples = [i for i in af_samples if group1 in i]
    group2_samples = [i for i in af_samples if group2 in i]
    if len(group1_samples) > len(group2_samples):
        print("WARNING: CHI2 Test for ", experiment,
              " not possible with the group distribution.\nThe number of samples will adjust for performing the test")

        # Pop the extra samples:
        to_pop = len(group1_samples) - len(group2_samples)
        for j in range(to_pop):
            group1_samples.pop(j)

        cols_control = df[df.columns[df.columns.isin(group1_samples)]]
        cols_exp = df[df.columns[df.columns.isin(group2_samples)]]
        # Initialize the lists for collecting the statstics
        chi_group2 = []
        p_chi_group2 = []
        stat_chi_group2 = []

        # Perform the test
        for i in range(len(df)):
            chi_group2.append(chisquare(cols_exp.iloc[i,], f_exp=cols_control.iloc[i,]))
            p_chi_group2.append(chi_group2[i].pvalue)
            stat_chi_group2.append(chi_group2[i].statistic)

        # add the multitest also
        multi = smt.multitest.multipletests(p_chi_group2, alpha=0.05, method='bonferroni', is_sorted=False,
                                            returnsorted=False)
        # Append the p-values
        pvalue = experiment + "_CHI_p-val"
        pfdr = experiment + "_CHI_p-fdr"
        df[pvalue] = p_chi_group2
        df[pfdr] = multi[1]

    elif len(group1_samples) < len(group2_samples):
        print("WARNING: CHI2 Test for ", experiment,
              " not possible with the group distribution.\nThe number of samples will adjust for performing of the test")

        # Pop the extra samples
        to_pop = len(group2_samples) - len(group1_samples)
        for j in range(to_pop):
            group2_samples.pop(j)
        cols_control = df[df.columns[df.columns.isin(group1_samples)]]
        cols_exp = df[df.columns[df.columns.isin(group2_samples)]]
        # Initialize the lists for collecting the statstics
        chi_group2 = []
        p_chi_group2 = []
        stat_chi_group2 = []

        # Perform the test
        for i in range(len(df)):
            chi_group2.append(chisquare(cols_exp.iloc[i,], f_exp=cols_control.iloc[i,]))
            p_chi_group2.append(chi_group2[i].pvalue)
            stat_chi_group2.append(chi_group2[i].statistic)

        # add the multitest also
        multi = smt.multitest.multipletests(p_chi_group2, alpha=0.05, method='bonferroni', is_sorted=False,
                                            returnsorted=False)

        # Append the p-values
        pvalue = experiment + "_CHI_p-val"
        pfdr = experiment + "_CHI_p-fdr"
        df[pvalue] = p_chi_group2
        df[pfdr] = multi[1]

    elif (len(group1_samples) < 5) | (len(group2_samples) < 5):
        print("The number of counts for ", experiment," is too low and the chi_square test cannot be performed")
        print("The number of samples in each group must be at least 5")

    else:
        cols_control = df[df.columns[df.columns.isin(group1_samples)]]
        cols_exp = df[df.columns[df.columns.isin(group2_samples)]]
        # Initialize the lists for collecting the statstics
        chi_group2 = []
        p_chi_group2 = []
        stat_chi_group2 = []

        # Perform the test
        for i in range(len(df)):
            chi_group2.append(chisquare(cols_exp.iloc[i,], f_exp=cols_control.iloc[i,]))
            p_chi_group2.append(chi_group2[i].pvalue)
            stat_chi_group2.append(chi_group2[i].statistic)

        # add the multitest also
        multi = smt.multitest.multipletests(p_chi_group2, alpha=0.05, method='bonferroni', is_sorted=False,
                                            returnsorted=False)
        # Append the p-values
        pvalue = experiment + "_CHI_p-val"
        pfdr = experiment + "_CHI_p-fdr"
        df[pvalue] = p_chi_group2
        df[pfdr] = multi[1]

    return df


def chi_square(df, samples, exp, experiments):
    # Create the strings corresponding to the allele frequency in each sample
    global df_chi
    af_samples = []
    for sample in samples:
        af_sample = "AF_" + sample
        af_samples.append(af_sample)

    # Elements for the test
    for experiment in experiments:
        df_chi = chi_test(df, af_samples, experiment, exp)
    print("Chi^2 test performed if possible on all the experiments\nOtherwise, Fisher test will be used for the final report of the results.")

    return df_chi




def fisher_test(df, samples, exp, experiment):
    # Prepare the strings for the column names
    design = exp[exp["Test_ID"].str.match(experiment)].iloc[0]
    group1 = design["Group_1"]
    group2 = design["Group_2"]

    # Collect the columns of the dataframe that will go into the test:
    group1_samples = [i for i in samples if group1 in i]
    group2_samples = [i for i in samples if group2 in i]
    r = "_R_.AD"
    a = "_A_.AD"


    if len(group1_samples) == len(group2_samples):

        # Perform the test
        k = 0  # Iterator for the samples in group 2
        for group1_sample in group1_samples:
            p_fisher = []
            stat_fisher = []
            for i in range(len(df)):
                # sample names and positions
                control_r = df.iloc[i, int(df.columns.get_loc(group1_sample + r))]
                control_a = df.iloc[i, int(df.columns.get_loc(group1_sample + a))]
                exp_r = df.iloc[i, int(df.columns.get_loc(group2_samples[k] + r))]
                exp_a = df.iloc[i, int(df.columns.get_loc(group2_samples[k] + a))]

                # test
                oddsratio, pvalue = stats.fisher_exact([[exp_r, exp_a], [control_r, control_a]])
                stat_fisher.append(oddsratio)
                p_fisher.append(pvalue)

            # add the multitest also
            multi = smt.multitest.multipletests(p_fisher, alpha=0.05, method='fdr_bh', is_sorted=False,
                                                returnsorted=False)

            # Make columns of the dataframe with the results
            fisher_pval = experiment + "_Fisher_p-val"
            fisher_fdr = experiment + "_Fisher_p-fdr"
            df[fisher_pval] = pd.Series(p_fisher, index = df.index)
            df[fisher_fdr] = pd.Series(multi[1], index = df.index)
            k = (k + 1)


    elif len(group1_samples) > len(group2_samples):
        print("WARNING: Fisher Test for ", experiment,
              " not possible with the group distribution.\nThe number of samples will adjust for performing of the test")

        # Pop the extra samples
        to_pop = len(group1_samples) - len(group2_samples)
        for j in range(to_pop):
            group1_samples.pop(j)

        # Perform the test
        k = 0  # Iterator for the samples in group 2
        for group1_sample in group1_samples:
            p_fisher = []
            stat_fisher = []
            for i in range(len(df)):
                # sample names and positions
                control_r = df.iloc[i, int(df.columns.get_loc(group1_sample + r))]
                control_a = df.iloc[i, int(df.columns.get_loc(group1_sample + a))]
                exp_r = df.iloc[i, int(df.columns.get_loc(group2_samples[k] + r))]
                exp_a = df.iloc[i, int(df.columns.get_loc(group2_samples[k] + a))]

                # test
                oddsratio, pvalue = stats.fisher_exact([[exp_r, exp_a], [control_r, control_a]])
                stat_fisher.append(oddsratio)
                p_fisher.append(pvalue)


            # add the multitest also
            multi = smt.multitest.multipletests(p_fisher, alpha=0.05, method='fdr_bh', is_sorted=False,
                                                returnsorted=False)

            # Make columns of the dataframe with the results
            fisher_pval = experiment + "_" + group1_sample + group2_samples[k] + "_Fisher_p-val"
            fisher_fdr = experiment + "_" + group1_sample + group2_samples[k] + "_Fisher_p-fdr"
            df[fisher_pval] = pd.Series(p_fisher, index=df.index)
            df[fisher_fdr] = pd.Series(multi[1], index=df.index)
            k = k + 1

    else:
        print("WARNING: Fisher Test for ", experiment,
              " not possible with the group distribution.\nThe number of samples will adjust for performing of the test")

        # Pop the extra samples
        to_pop = len(group2_samples) - len(group1_samples)
        for j in range(to_pop):
            group2_samples.pop(j)

        # Perform the test
        k = 0  # Iterator for the samples in group 2
        for group1_sample in group1_samples:
            p_fisher = []
            stat_fisher = []
            for i in range(len(df)):
                # sample names and positions
                control_r = df.iloc[i, int(df.columns.get_loc(group1_sample + r))]
                control_a = df.iloc[i, int(df.columns.get_loc(group1_sample + a))]
                exp_r = df.iloc[i, int(df.columns.get_loc(group2_samples[k] + r))]
                exp_a = df.iloc[i, int(df.columns.get_loc(group2_samples[k] + a))]

                # test
                oddsratio, pvalue = stats.fisher_exact([[exp_r, exp_a], [control_r, control_a]])
                stat_fisher.append(oddsratio)
                p_fisher.append(pvalue)

            # add the multitest also
            multi = smt.multitest.multipletests(p_fisher, alpha=0.05, method='fdr_bh', is_sorted=False,
                                                returnsorted=False)

            # Make columns of the dataframe with the results
            fisher_pval = experiment + "_" + group1_sample + group2_samples[k] + "_Fisher_p-val"
            fisher_fdr = experiment + "_" + group1_sample + group2_samples[k] + "_Fisher_p-fdr"
            df[fisher_pval] = pd.Series(p_fisher, index=df.index)
            df[fisher_fdr] = pd.Series(multi[1], index=df.index)
            k = k + 1



    return df


def Fisher(df, samples, exp, experiments):
    for experiment in experiments:
        df_fisher = fisher_test(df, samples, exp, experiment)
    print("Fisher exact test successfully performed on  all the experiments")
    return df_fisher


def binomial(df,samples):
    r = "_R_.AD"
    a = "_A_.AD"
    for sample in samples:
        pval = []
        fdr_pval = []

        for i in range(len(df)):
            # Get the values of the rows
            ref = df.iloc[i, int(df.columns.get_loc(sample + r))]
            alt = df.iloc[i, int(df.columns.get_loc(sample + a))]
            total_reads = ref + alt

            # Perform the test
            binom = (sc.stats.binom_test(ref, total_reads, 0.5, alternative='two-sided'))
            pval.append(binom)

        # add the multitest also
        multi = smt.multitest.multipletests(pval, alpha=0.05, method='fdr_bh', is_sorted=False,
                                            returnsorted=False)
        col_name_binom = sample + "_Binomial_pvalue"
        col_name_multi = sample + "Binomial_fdr_pvalue"
        df[col_name_binom] = pd.Series(pval, index=df.index)
        df[col_name_multi] = pd.Series(multi[1], index=df.index)
    print("Binomial test successfully performed on all the experiments")
    return df


def main ():

    # Import the files
    exp = pd.read_csv(snakemake.input.get("i1"))
    sample_names = pd.read_csv(snakemake.input.get("i3"))
    df_qc = pd.read_csv(snakemake.input.get("i4"))

    groups_df = pd.read_csv(snakemake.input.get("i2"))
    # Create a numpy array of arrays with the samples of each group and the total samples of the experiment
    groups = groups_df.values
    # Create a numpy array with the name of each group
    group_names = list(groups_df["Group"])

    # Create arrays of the sample names
    samples = sample_names["Sample_name"].values

    # Create an array with the name of each experimental test
    experiments = list(exp["Test_ID"])


    """
    STATISTICAL TESTS
    -----------------
    We implement:
        Chi² test for allele specific expression in each experimental group based in allele frequencies
        Binomial test for allelic imbalance in each sample based in allele counts
        Fisher exact test for allele specific expression in samples from the same individual based in allele counts
    A Bonferroni one-step correction will be applied to the p-values of the chi-square test and a Benjamini-Hochberg
    non-negative test will be applied to the Fisher and Binomial tests respectively.


    Chi-square test for conditions based on allele frequencies
    ----------------------------------------------------------
    In this test we compare the allele frequencies in allelic imbalance according to the experimental design described in
    the file "Experimental_design.csv". It is possible to test more than one factor but it will compare them by pairs, being
    the "group1" the control group and the "group2" the experimental.

    The function used is "chisquare" from the package scipy:
        scipy.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0)[source]¶

        Calculate a one-way chi-square test.

        The chi-square test tests the null hypothesis: the categorical data has the expected frequencies.

        Parameters

            f_obsarray_like
                Observed frequencies in each category.

            f_exparray_like, optional
                Expected frequencies in each category. By default the categories are assumed to be equally likely.

            ddofint, optional
                “Delta degrees of freedom”: adjustment to the degrees of freedom for the p-value. The p-value is computed using
                a chi-squared distribution with k - 1 - ddof degrees of freedom, where k is the number of observed frequencies.
                The default value of ddof is 0.

            axisint or None, optional
                The axis of the broadcast result of f_obs and f_exp along which to apply the test. If axis is None, all values
                in f_obs are treated as a single data set. Default is 0.


        Returns

            chisqfloat or ndarray
                The chi-squared test statistic. The value is a float if axis is None or f_obs and f_exp are 1-D.
            pfloat or ndarray
                The p-value of the test. The value is a float if ddof and the return value chisq are scalars.

    Chi-square for each described test
    ----------------------------------
    This test will compare the differences between the two experimental groups described in the test. For this, the
    dataframe can't have allele frequencies equal to 0 for all the members of the experimental group. The formula needs the
    next variables:
        sampling size
        expected probability (reference allele frequency from control group)
        observed probability (reference allele frequency from the experimental group)

    NOTE: The number of samples in the control and experimental groups have to be the same. Otherwise the script will
    pop the extra samples in the biggest group. Also the number of samples in each group must be at least 5

    Now let's perform the test.
    """

    #os.chdir("temp")
    df_chi = chi_square(df_qc, samples, exp, experiments)

    df_chi.to_csv(snakemake.output.get("o1"))

    """
    Fisher exact test for ASE
    -------------------------
    This test is based on the read counts of each allele, including alternative and reference alleles.

    """

    df_fisher = Fisher(df_chi, samples, exp, experiments)

    df_fisher.to_csv(snakemake.output.get("o2"))

    """
    Binomial test
    -------------
    The binomial test will determine the allelic imbalance in each sample by comparing the counts of the reference allele
    with the total counts of the SNP. This test does not compare experimental groups, but each sample.

    Perform the binomial tests in the read counts of the reference allele (it can be done by
    experimental group or by individual)(x) over the total number of reads in this group (as before, it can be done by
    experimental group or by individual) (n). For this analysis, the expected frequency is 0.5 and the observed allele
    frequency is determined for the reference allele in each individual or for the control group (p). This test will
    provide the p-values for the significant differences of the groups. The formula is extracted from Wikipedia and
    adapted to each test. In python the formula is scipy.stats.binom_test (x,n,p, alternative "two-sided").
    """

    df_binomial = binomial(df_fisher, samples)

    #os.chdir(cwd)
    df_binomial.to_csv(snakemake.output.get("o3"))
    print("Analyzed data available in " + snakemake.output.get("o3"))

if __name__ == '__main__' :
    main()
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo, modified from git"
__git__= "https://github.com/johanzi/make_pseudogenome"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
#from snakemake_wrapper_utils.java import get_java_opts

vcf = snakemake.input.get("vcf")
out = snakemake.output.get("vcf")
outvcf = out.split(".")[0]
outvcfgz = outvcf + ".gz"
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    # Subset by keeping positions with GQ >= 30 and DP >=5
    # Not keeping indels
    "vcftools "
    "--vcf {vcf} "
    "--remove-indels "
    "--minGQ 30 "
    "--minDP 5 "
    "--recode "
    "--recode-INFO-all "
    "--out {outvcf} "
    "{log}"
)
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import pandas as pd
import numpy as np
import os
from os import path
from snakemake import shell


log = snakemake.log_fmt_shell(stdout=False, stderr=True)

def main():
    """
    Input the files and convert them into data frames
    """

    # Read the dataframe and convert it into pandas dataframe. tb stands for table:

    tb1_colnames = pd.read_csv ( snakemake.input.get("tb_colnames") )
    tb1_cols = tb1_colnames['Col_names'].values
    print("Cols read")

    #tb1 = pd.read_csv(snakemake.input.get("table"), sep = "\,|/|\t", header=None, engine = "python")
    tb1 = pd.read_csv(snakemake.input.get("table"), sep = "\,|/|\t", names = tb1_cols, engine = "python")
    print("Table read")

    tb1 = pd.DataFrame ( tb1 )

    tb1.drop ( tb1.index[:1], inplace=True )

    tb1.to_csv ( snakemake.output.get("csv") )
    print("Table with correct column names is created for each pseudogenome in the folder called \"variants\".")
    print("Please verify the order of the samples on the new .csv table is the same as in the header of the .table file")

if __name__ == '__main__':
    main()

shell(
    "{log}"
)
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
import pandas as pd
import numpy as np
import os
from os import path
from matplotlib import pyplot as plt
from math import pi
from scipy.stats import uniform
from scipy.stats import randint
import seaborn as sns
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def GO_graph (df_stats,experiments):
    # This function plots the GO terms that show SNPs in ASE. It also produces a table with the genes and SNPs in ASE
    #tests=["test1","test2","test3","test4"]
    tests=experiments
    for test in tests :
        test_name = str(test) + "_CHI_p-fdr"
        if test_name in df_stats:
            # Table 7
            index_exp = df_stats[(df_stats[test_name] >= 0.05)].index
            index_nan = df_stats[df_stats[test_name].isnull ()].index
            df_sig = df_stats.drop ( index=index_exp )  # Collect the significant SNPs in a table for this experiment
            df_exp = df_sig.drop ( index=index_nan )  # Drop the empty values for the tests
            genes_sig = df_exp["Gene_ID"].unique ()
            results = { 'SNPs in ASE':  [len(df_exp)],
                        'Genes containing ASE': [len(genes_sig)]}
            df_results = pd.DataFrame (results, columns = ['SNPs in ASE','Genes containing ASE'])
            PATH = os.getcwd()
            os.chdir("results")
            df_results.to_csv(test_name + "_basic_results_chi_square.csv")
            df_exp.to_csv(test_name + "_ASE_description_chi_square.csv")
            os.chdir(PATH)

            # GO terms analysis
            mega_dict = pd.read_csv ( snakemake.input.get("i1"), low_memory=False )
            df_func = pd.merge ( df_stats, mega_dict, on='Gene_ID', how='right' )
            df_func.to_csv ( snakemake.output.get("o1") )

            GO_dict = pd.read_csv ( snakemake.input.get("i2"), low_memory=False )

            df_GO = pd.merge(df_exp,GO_dict, on = "Genebank", how = "left")
            GO_counts = df_GO['GO term accession'].value_counts ()
            GO_counts = pd.DataFrame(GO_counts)
            GO_counts.columns = ['Frequency']
            GO_most = GO_counts.head(15)

            # Pie chart
            sizes = []
            labels = GO_most.index.values
            sizes.append (GO_most['Frequency'].tolist() )
            fig1, ax1 = plt.subplots ()
            ax1.pie ( sizes[0], labels=labels, autopct='%1.1f%%', shadow=True, startangle=90 )
            ax1.axis ( 'equal' )  # Equal aspect ratio ensures that pie is drawn as a circle.
            PATH = os.getcwd ()
            os.chdir("results")
            filename = test_name + "_GO_pie_chart_chi_square.svg"
            plt.savefig ( filename )
            plt.clf()
            os.chdir(PATH)

        else:
            test_name = str(test) + "_Fisher_p-fdr"
                 # Table 7
            index_exp = df_stats[(df_stats[test_name] >= 0.05)].index
            index_nan = df_stats[df_stats[test_name].isnull ()].index
            df_sig = df_stats.drop ( index=index_exp )  # Collect the significant SNPs in a table for this experiment
            df_exp = df_sig.drop ( index=index_nan )  # Drop the empty values for the tests
            genes_sig = df_exp["Gene_ID"].unique ()
            results = { 'SNPs in ASE':  [len(df_exp)],
                        'Genes containing ASE': [len(genes_sig)]}
            df_results = pd.DataFrame (results, columns = ['SNPs in ASE','Genes containing ASE'])
            PATH = os.getcwd()
            os.chdir("results")
            df_results.to_csv(test_name + "_basic_results_fisher_test.csv")
            df_exp.to_csv(test_name + "_ASE_description_fisher_test.csv")
            os.chdir(PATH)

            # GO terms analysis
            mega_dict = pd.read_csv ( snakemake.input.get("i1"), low_memory=False )
            df_func = pd.merge ( df_stats, mega_dict, on='Gene_ID', how='right' )
            df_func.to_csv ( snakemake.output.get("o1") )

            GO_dict = pd.read_csv ( snakemake.input.get("i2"), low_memory=False )

            df_GO = pd.merge(df_exp,GO_dict, on = "Genebank", how = "left")
            GO_counts = df_GO['GO term accession'].value_counts ()
            GO_counts = pd.DataFrame(GO_counts)
            GO_counts.columns = ['Frequency']
            GO_most = GO_counts.head(15)

            # Pie chart
            sizes = []
            labels = GO_most.index.values
            sizes.append (GO_most['Frequency'].tolist() )
            fig1, ax1 = plt.subplots ()
            ax1.pie ( sizes[0], labels=labels, autopct='%1.1f%%', shadow=True, startangle=90 )
            ax1.axis ( 'equal' )  # Equal aspect ratio ensures that pie is drawn as a circle.
            PATH = os.getcwd ()
            os.chdir("results")
            filename = test_name + "_GO_pie_chart_fisher_test.svg"
            plt.savefig ( filename )
            plt.clf()
            os.chdir(PATH)       



def tables_1_2(df_stats, experiments):
    # The columns will be the type of polymorphism, the GO terms and the relative values of these annotations. One row
    # correspond to a different test.
    # Initialize the tables and the columns that will be in table 1 and table 2:
    all_treatment_snp = pd.DataFrame ()
    df_polymorphism = pd.DataFrame ()
    df_stats = df_stats.fillna ( 1 )  # to drop rows with empty values

    for experiment in experiments:
        # Initalize the intermediate table:
        col_name = str ( experiment ) + "_CHI_p-fdr"
        if col_name in df_stats:
            index_exp = df_stats[(df_stats[col_name] >= 0.05)].index
            df_exp = df_stats.drop ( index=index_exp )  # Collect the significant SNPs in a table for this experiment
            df_new = pd.DataFrame ()
            df_new[col_name] = df_exp["SNP_ID"]

            # Table 1
            all_treatment_snp = pd.concat ( [all_treatment_snp, df_new], axis=1 )

            # Table 2
            func_refgene = df_exp['Func.refGene_x'].value_counts ()
            exonicfunc_refgene = df_exp['ExonicFunc.refGene_x'].value_counts ()
            polymorphism_exp = pd.Series ( func_refgene.append ( exonicfunc_refgene ) )
            df_polymorphism = pd.concat ( [df_polymorphism, polymorphism_exp.to_frame ().T] )
            df_polymorphism.rename ( index={0: col_name}, inplace=True )  # add row name that is the column of the origin
        else:
            col_name = str ( experiment ) + "_Fisher_p-fdr"
            index_exp = df_stats[(df_stats[col_name] >= 0.05)].index
            df_exp = df_stats.drop ( index=index_exp )  # Collect the significant SNPs in a table for this experiment
            df_new = pd.DataFrame ()
            df_new[col_name] = df_exp["SNP_ID"]

            # Table 1
            all_treatment_snp = pd.concat ( [all_treatment_snp, df_new], axis=1 )

            # Table 2
            func_refgene = df_exp['Func.refGene_x'].value_counts ()
            exonicfunc_refgene = df_exp['ExonicFunc.refGene_x'].value_counts ()
            polymorphism_exp = pd.Series ( func_refgene.append ( exonicfunc_refgene ) )
            df_polymorphism = pd.concat ( [df_polymorphism, polymorphism_exp.to_frame ().T] )
            df_polymorphism.rename ( index={0: col_name}, inplace=True )  # add row name that is the column of the origin


    # Table 1:
    all_treatment_snp.to_csv ( snakemake.output.get("o3") )

    # Table 2:
    df_polymorphism = df_polymorphism.fillna ( 0 )
    df_polymorphism.to_csv ( snakemake.output.get("o4") )

    # Radar chart for all treatents :
    # subset SNP Function and Exonic SNP Function
    categories_exo = list ( df_polymorphism )[11:20]  # It avoids the category "." that is non-informative
    print("Categories exo: ", categories_exo)
    df_pol_fun = df_polymorphism[2:9]
    df_pol_exo = df_polymorphism[categories_exo]


    # Polymorphism function by treatment:
    categories = list ( df_pol_fun )[0:]  # For function of SNP 1:9. Exonic function is from 10:20
    print(categories)
    values_list = []
    angles_list = []

    treatments = df_pol_fun.index.values
    fig, ax = plt.subplots ( nrows=1, ncols=1, figsize=(8, 8), subplot_kw=dict ( polar=True ) )
    for i in range ( len ( df_pol_fun ) ):
        values = df_pol_fun[i:(i + 1)].values.flatten ().tolist ()
        values += values[:1]  # repeat the first value to close the circular graph
        values_list.append ( values )
        angles = [n / float ( len ( categories ) ) * 2 * pi for n in range ( len ( categories ) )]
        angles += angles[:1]
        angles_list.append ( angles )
        ax.plot ( angles_list[i], values_list[i], linewidth=1, linestyle='solid', label=treatments[i] )
        ax.fill ( angles_list[i], values_list[i], alpha=0.4 )
    plt.xticks ( angles[:-1], categories, color='grey', size=12 )
    plt.yticks ( np.arange ( 1, 6 ), ['1', '2', '3', '4', '5'], color='grey', size=12 )
    plt.ylim ( 0, 800 )
    ax.set_rlabel_position ( 30 )
    plt.legend ( loc='upper right', bbox_to_anchor=(0.1, 0.1) )
    #plt.show ()
    PATH = os.getcwd ()
    os.chdir("results")
    plt.savefig("Polymorphism_function_by_treatment_radar_chart.svg")
    os.chdir(PATH)

    plt.clf ()


    # Polymorphism exonic function:
    categories = list ( df_pol_exo )[0:]  # For function of SNP 1:9. Exonic function is from 10:20
    values_list = []
    angles_list = []
    treatments = df_pol_exo.index.values
    fig, ax = plt.subplots ( nrows=1, ncols=1, figsize=(8, 8), subplot_kw=dict ( polar=True ) )
    for i in range ( len ( df_pol_exo ) ):
        values = df_pol_exo[i:(i + 1)].values.flatten ().tolist ()
        values += values[:1]  # repeat the first value to close the circular graph
        values_list.append ( values )
        angles = [n / float ( len ( categories ) ) * 2 * pi for n in range ( len ( categories ) )]
        angles += angles[:1]
        angles_list.append ( angles )
        ax.plot ( angles_list[i], values_list[i], linewidth=1, linestyle='solid', label=treatments[i] )
        ax.fill ( angles_list[i], values_list[i], alpha=0.4 )
    plt.xticks ( angles[:-1], categories, color='grey', size=12 )
    plt.yticks ( np.arange ( 1, 6 ), ['1', '2', '3', '4', '5'], color='grey', size=12 )
    plt.ylim ( 0, 400 )
    ax.set_rlabel_position ( 30 )
    plt.legend ( loc='upper right', bbox_to_anchor=(0.1, 0.1) )
    PATH = os.getcwd ()
    os.chdir("results")
    plt.savefig("Polymorphism_exonic_function_by_treatment.svg")
    os.chdir(PATH)

    plt.clf ()

    # Pie charts :
    labels = df_polymorphism.columns
    styles = list ( plt.style.available )
    for i in range ( (df_polymorphism.shape[0]) ):
        sizes = []
        # Using iloc to access the values of
        # the current row denoted by "i"
        plt.style.use ( styles[5] )
        sizes.append ( list ( df_polymorphism.iloc[i, :] ) )
        fig1, ax1 = plt.subplots ()
        ax1.pie ( sizes[0], labels=labels, autopct='%1.1f%%', shadow=True, startangle=90 )
        ax1.axis ( 'equal' )  # Equal aspect ratio ensures that pie is drawn as a circle.
        #plt.show ()
        PATH = os.getcwd ()
        os.chdir("results")
        filename = df_polymorphism.index.values[i] + "_pie_chart.svg"
        plt.savefig(filename)
        os.chdir(PATH)

        plt.clf ()


def heatmap(df, group_names):
    # This heatmap illustrates the allele frequencies for the genomic positions for identifying regions in AI
    af_groups = []
    for group_name in group_names:
        af_group = "AF_" + group_name
        af_groups.append ( af_group )

    htmap = df[af_groups]
    sns.heatmap ( htmap, cmap='YlGnBu' )
    #plt.show ()
    PATH = os.getcwd ()
    os.chdir("results")
    plt.savefig("Heatmap.svg")
    os.chdir(PATH)

    plt.clf ()


def table5(df_stats):
    # This part of the analysis have to be adapted to the different strings on each genome identifying chromosomes and scaffolds
    index_scf = df_stats[df_stats['CHROM'].str.match ( r'^NC_' )].index # Substitute the chain NC_ by the pattern on the genome of the organism you work with
    df_scf = df_stats.drop ( index_scf )
    intronic_scf = df_scf["Func.refGene_x"].value_counts ()
    exonic_scf = df_scf["ExonicFunc.refGene_x"].value_counts ()

    index_chrom = df_stats[df_stats['CHROM'].str.match ( r'^NW_' )].index # Substitute the chain NW_ by the pattern on the genome of the organism you work with
    df_chrom = df_stats.drop ( index_chrom )
    intronic_chrom = df_chrom["Func.refGene_x"].value_counts ()
    exonic_chrom = df_chrom["ExonicFunc.refGene_x"].value_counts ()
    print("Intronic chrom: ",intronic_chrom)
    print("Exonic chrom: ", exonic_chrom)


    # Create pandas DataFrame.
    data = {
        'Intronic': [sum ( intronic_chrom ) - intronic_chrom["exonic"], sum ( intronic_scf ) - intronic_scf["exonic"]],
        'Exonic': [sum ( exonic_chrom ) - exonic_chrom["."], sum ( exonic_scf ) - exonic_scf["."]]}
    table5 = pd.DataFrame ( data, index=["Chromosome", "Scaffold"] )
    table5["Intronic %"] = table5["Intronic"] * 100 / (table5["Intronic"] + table5["Exonic"])
    table5["Exonic %"] = table5["Exonic"] * 100 / (table5["Intronic"] + table5["Exonic"])
    table5.to_csv ( snakemake.output.get("o5") )


def Manhattan_plot(df_stats, experiments):
    # This function builds a Manhattan plot
    #plt.rcParams.update ( {'font.size': 10} )
    for experiment in experiments:
        index_chrom = df_stats[df_stats['CHROM'].str.match ( r'^NW_' )].index
        df_chrom = df_stats.drop ( index_chrom )

        df = pd.DataFrame ( {"SNP": df_chrom["SNP_ID"],
                             "pvalue": df_chrom[str ( experiment ) + "_CHI_p-val"],
                             "Chromosome": df_chrom["CHROM"]} )
        # -log_10(pvalue)
        df['minuslog10pvalue'] = -np.log10 ( df.pvalue )
        df.Chromosome = df.Chromosome.astype ( 'category' )
        df = df.sort_values ( 'Chromosome' )

        # How to plot gene vs. -log10(pvalue) and colour it by chromosome?
        df['ind'] = range ( len ( df ) )
        df_grouped = df.groupby ( 'Chromosome' )

        fig = plt.figure ()
        ax = fig.add_subplot ( 111 )
        colors = (
        "#a7414a", "#696464", "#00743f", "#563838", "#6a8a82", "#a37c27", "#5edfff", "#282726", "#c0334d", "#c9753d")
        # colors = ['red', 'green', 'blue', 'yellow']
        x_labels = []
        x_labels_pos = []
        for num, (name, group) in enumerate ( df_grouped ):
            group.plot ( kind='scatter', x='ind', y='minuslog10pvalue', color=colors[num % len ( colors )], ax=ax )
            x_labels.append ( name )
            x_labels_pos.append ( (group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0]) / 2) )
        ax.set_xticks ( x_labels_pos )
        ax.set_xticklabels ( x_labels )
        ax.set_xlim ( [0, len ( df )] )
        ax.set_ylim ( [0, 2] )
        ax.set_xlabel ( 'Chromosome' )
        #plt.show ()
        PATH = os.getcwd ()
        os.chdir("results")
        plt.savefig(experiment + "_Manhattan_plot.svg")
        os.chdir(PATH)

        plt.clf ()



def main():
    """
    Read the files
    """

    df_stats = pd.DataFrame ( pd.read_csv ( snakemake.input.get("i3") ) )
    sample_names = pd.DataFrame ( pd.read_csv ( snakemake.input.get("i4")) )
    groups_df = pd.DataFrame ( pd.read_csv ( snakemake.input.get("i5") ) )
    exp = pd.DataFrame ( pd.read_csv ( snakemake.input.get("i6") ) )

    # Rename the key Gene ID and add the Genebank accessions for further analysis
    df_stats.rename ( columns={"Gene.refGene_x": "Gene_ID"}, inplace=True )
    # Read the dictionaries and merge with the df:
    dict = pd.read_csv ( snakemake.input.get("i7") )
    df_stats = df_stats.merge ( dict, on='Gene_ID', how='left' )


    # Create a numpy array of arrays with the samples of each group and the total samples of the experiment
    groups = groups_df.values

    # Create a numpy array with the name of each group
    group_names = list ( groups_df.columns[0] )

    # Create arrays of the sample names
    samples = sample_names["Sample_name"].values

    # Create an array with the name of each experimental test
    experiments = list ( exp["Test_ID"] )



    """
    SNP_ID and dictionary
    """
    # Create a column for SNP_ID
    df_stats["SNP_ID"] = df_stats.index
    SNP_dictionary = df_stats[['SNP_ID', 'CHROM', 'POS', 'Gene_ID', 'Func.refGene_x', 'ExonicFunc.refGene_x']]
    SNP_dictionary.to_csv ( snakemake.output.get("o2") )

    """
        Physiological function: This table will include a column for each treatment (CHI²) and a row for each GO function.
        The cell content will be the % of SNPs in each GO function for treatment.
        Table 7
        """
    GO_graph ( df_stats, experiments )


    """
    1 Treatment table: one column for the SNP_IDs from each treatment (significant for each CHI²). The number of columns
        will vary with each experimental setup
            This table will be used for a Venn diagram

    2 Types of polymorphism by test: the columns will be the type of polymorphism, the GO terms and the relative values of
    these annotations. One row correspond to a different Chi² test.
    """

    tables_1_2 ( df_stats, experiments )


    """
    5 Compare the % of intronic and exonic SNPs (from the total SNPs) in Scaffolds and in Chromosomes. The columns will
    be intronic and exonic and there will be two rows: Chromosomes and Scaffolds.

    PLOT HEATMAP
    ------------
    Use the csv file produced in the control quality for plotting the allele frequencies of each experimental group.
    The matrix has to be specific for allele frequencies per group and Chromosome position.
    Now make a general comparison between treatments. Most of the colors in these graphs are or 0 or 1, since they
    correspond to the allelic imbalance with significance.
    """

    heatmap ( df_stats, group_names )
    table5 ( df_stats )

    """
    MANHATTAN PLOT
    --------------
    A Mannhattan plot for each test for ASE SNPs will be performed.
    """
    Manhattan_plot ( df_stats, experiments )




if __name__ == '__main__':
    main ()
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts


# Get the variables
#extra = snakemake.params.get("extra")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
trimmer =snakemake.params.get("trimmer")
threads = snakemake.threads
input_r1 = snakemake.input.get("r1")
input_r2 = snakemake.input.get("r2")
output_r1 = snakemake.output.get("r1")
output_r2 = snakemake.output.get("r2")
output_r1_unp = snakemake.output.get("r1_unpaired")
output_r2_unp = snakemake.output.get("r2_unpaired")
path = snakemake.params.get("path")


shell(
    "java -XX:ParallelGCThreads={threads} {java_opts} -jar {path} PE "#-threads {threads} {java_opts} "#"{extra} "
    "-phred64 "
    "-validatePairs "
	"./{input_r1} "
    "./{input_r2} "
    "./{output_r1} "
    "./{output_r1_unp} "
    "./{output_r2} "
    "./{output_r2_unp} "
    "{trimmer} "#"ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE"
    "{log}"
)
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2022, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts


# Get the variables
#extra = snakemake.params.get("extra")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
trimmer =snakemake.params.get("trimmer")
threads = snakemake.params.get("threads")
input_r1 = snakemake.input.get("r1")
print(input_r1)
output_r1 = snakemake.output.get("r1")
path = snakemake.params.get("path")


shell(
    "java -XX:ParallelGCThreads={threads} {java_opts} -jar {path} SE "#-threads {threads} {java_opts} "#"{extra} "
    "-phred64 "
	"./{input_r1} "
    "./{output_r1} "
    "{trimmer} "#"ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE"
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2021, Aurora Campo"
__email__ = "[email protected]"
__license__ = "MIT"


import os
import pandas as pd
import numpy as np
from os import path
from os.path import exists
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

shell.executable("bash")

"""
FILTER SNPs that are wrongly called by using the SNPs from the genome call.
"""

extra = snakemake.params.get("extra")
java_opts = snakemake.params.get("java_opts")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)




def main():
    # Input the files and convert them into data frames
    SNPs2val = snakemake.input.get("result1")
    valid = snakemake.output.get("result4")
    error = snakemake.output.get("error")

    SNPs2val = pd.read_csv(SNPs2val, low_memory=False)
    SNPs2val = pd.DataFrame(SNPs2val)

    # 
    if str(os.path.exists("config/AD_GT_counts_bi_DNA.csv"))==True:
        geno_df = pd.read_csv("config/AD_GT_counts_bi_DNA.csv", low_memory=False)
        geno_df = pd.DataFrame(geno_df)
        coincident = pd.merge(SNPs2val, geno_df, how="inner", on=["CHROM", "POS"])
        print("SNPs retrieved from the pipeline: ",len(SNPs2val),"; Valid SNPs: ", len(coincident))
        coincident.to_csv(valid)
        error = ((len(SNPs2val) - len(coincident)) * 100) / len(SNPs2val)
        print("Error: ", error)
        data = np.array([["SNPs retrived from the pipeline",len(SNPs2val)],
                        ["Valid SNPs",len(coincident)],
                        ["Error in SNP calling",error]])
        # {"SNPs retrived from the pipeline":len(SNPs2val),"Valid SNPs":len(coincident),"Error in SNP calling":error}
        df = pd.DataFrame(data)
        df.to_csv(error)
    else:
        SNPs2val=coincident
        coincident.to_csv(valid)
        print("Error: You need to call the SNPs from DNA in order to get the data for validation.")
        text_file = open(error, "w")
        text_file.write("Error: You need to call the SNPs from DNA in order to get the data for validation.")
        text_file.close()


if __name__ == '__main__':
    main()


shell(
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
__author__ = "Aurora Campo"
__copyright__ = "Copyright 2020, Aurora Campo"
__email__ = "[email protected]"
__license__ = "CC"


from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
    "gatk --java-options '{java_opts}' "
    "VariantsToTable "
    "-R {snakemake.input.ref} "
    "-V {snakemake.input.vcf} "
    "{extra} "
    "-O {snakemake.output.vcf} {log}"
)
17
18
script:
    "scripts/selectvariants.py"
19
20
script:
    "scripts/splitncigarreads.py"
19
20
script:
    "scripts/splitncigarreads.py"
19
20
script:
    "scripts/star_gi.py"
37
38
script:
    "scripts/createsequencedictionary.py"
55
56
script:
    "scripts/faidx.py"
27
28
script:
    "scripts/star.py"
28
29
script:
    "scripts/star.py"
24
25
script:
    "scripts/star_se.py"
25
26
script:
    "scripts/star_se.py"
19
20
script:
    "scripts/qc.py"
43
44
script:
    "scripts/Stats.py"
67
68
script:
    "scripts/Stats.py"
13
14
script:
    "scripts/subset_ref_vcf.py"
 9
10
11
12
13
run:
    shell("sed 's/|/\//g' {input.table1} > {params.char} ")
    shell("awk 'seen[$1,$2]++' {params.char} > {params.multi}") #Collect SNPs with same chromosome and position thus are multiallelic by duplication of CHROM and POS columns 1 and 2
    shell("awk -F'\t' 'NR==FNR{{a[$1,$2]++;next}} !(a[$1,$2])' {params.multi} {params.char}  > {output.table1} ") # Eliminate the rows collected as multiallelic previously
    shell("rm {params.char} {params.multi}") # Eliminate intermediary files
30
31
script:
    "scripts/table2df_step2.py"
24
25
script:
    "scripts/trimmomatic_pe.py"
20
21
script:
    "scripts/trimmomatic_se.py"
18
19
shell:
    "validatesamfile.py"
17
18
script:
    "scripts/variantstotable.py"
17
18
script:
    "scripts/mergedataframes.py"
38
39
script:
    "scripts/biallelicsites.py"
59
60
script:
    "scripts/average.py"
80
81
script:
    "scripts/ad10.py"
102
103
script:
    "scripts/genotype.py"
123
124
script:
    "scripts/ASE_workflow_MAE.py"
147
148
script:
    "scripts/valid.py"
ShowHide 72 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/AylaScientist/Snakemake_for_SNPs
Name: snakemake_for_snps
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...