kGWASflow is a Snakemake workflow for performing k-mers-based GWAS.

public public 1yr ago Version: v1.2.3 0 bookmarks

A modular, flexible, and reproducible Snakemake workflow to perform k-mers-based GWAS.

Table of Contents

Summary

kGWASflow is a Snakemake pipeline developed for performing k-mers-based genome-wide association study (GWAS) based on the method developed by Voichek et al. (2020) . It performs several pre-GWAS analyses, including read trimming, quality control, and k-mer counting. It implements the kmersGWAS method into an easy to use and accessible workflow. The pipeline also contains post-GWAS analyses, such as mapping k-mers to a reference genome, finding and mapping the source reads of k-mers, assembling source reads into contigs, and mapping them to a reference genome. kGWASflow is also highly customizable and offers users multiple options to choose from depending on their needs.

More information and explanations on how to install, configure and run kGWASflow are provided in the kGWASflow Wiki .

kGWASflow preprint is out on bioRxiv:

My project-1-3

Installation

Installing via Bioconda

‼️This is the preferred method .‼️

In order to use this workflow, you need conda to be installed (to install conda , please follow the instructions here ).

# Create a new conda environment with kgwasflow 
# and its dependencies
conda create -c bioconda -n kgwasflow kgwasflow
# Activate kGWASflow conda environment
conda activate kgwasflow
# test kGWASflow
kgwasflow --help

Installing via GitHub (Alternative Method)

Alternatively, kGWASflow can be installed by cloning the GitHub repository .

# Clone this repository to your local machin
git clone https://github.com/akcorut/kGWASflow.git
# Change into the kGWASflow directory
cd kGWASflow

After cloning the GitHub repo, you can install snakemake and the other dependencies using conda as below:

# This assumes conda is installed
conda env create -f environment.yaml
# Activate kGWASflow conda environment
conda activate kGWASflow

Finally, you can install kGWASflow using the setup script as below:

# Install kgwasflow
python setup.py install
# Test kgwasflow
kgwasflow --help

Other Options:

The other options on how to deploy this workflow can be found in the Snakemake Workflow Catalog .

Configuration

Initializing kGWASflow

To configure kGWASflow, you first need to initialize a new kGWASflow working directory by following the below steps:

# Activating the conda environment
conda activate kgwasflow
# Initializing a new kgwasflow working dir
kgwasflow init --work-dir path/to/your/work_dir 

or

# Activating the conda environment
conda activate kgwasflow
# Change into your preferred working directory
cd path/to/your/work_dir 
# Initializing a new kgwasflow working dir
kgwasflow init

kGWASflow Working Directory Structure

This command will initialize a new kGWASflow working directory with the default configuration files. Below is the directory structure of the working directory:

path/to/your/work_dir
├── config
│ ├── config.yaml
│ ├── phenos.tsv
│ └── samples.tsv
└── test

kGWASflow Configuration Files

Below are the configuration files generated by kgwasflow init command:

  • config/config.yaml is a YAML file containing the workflow configuration.

  • config/samples.tsv is a TSV file containing the sample information.

  • config/phenos.tsv is a TSV file containing the phenotype information.

For more information about each configuration file, please see kGWASflow Wiki .

Usage

After initializing ( kgwasflow init ) step and modifying the configuration files, kGWASflow can be run as below:

# Activating the conda environment
conda activate kgwasflow
# Change into your preferred working directory
cd path/to/your/work_dir 
# Run kgwasflow
kgwasflow run -t 16

Below are some of the run examples of kGWASflow:

Run examples:
1. Run kGWASflow with the default config file, default arguments and 16 threads:
 kgwasflow run -t 16 --snake-default
2. Run kGWASflow with a custom config file and default settings:
 kgwasflow run -t 16 -c path/to/custom_config.yaml
3. Run kGWASflow with user defined working directory:
 kgwasflow run -t 16 --work-dir path/to/work_dir
4. Run kGWASflow in dryrun mode to see what tasks would be executed:
 kgwasflow run -t 16 -n
5. Run kGWASflow using mamba as the conda frontend:
 kgwasflow run -t 16 --conda-frontend mamba
6. Run kGWASflow and generate an HTML report:
 kgwasflow run -t 16 --generate-report

Information about how to use kGWASflow with Snakemake commands can be found in the Snakemake Workflow Catalog .

Testing

In order to test kGWASflow using an E.coli dataset ( Earle et al. 2016 , Rahman et al. 2018 )

# Activating the conda environment
conda activate kgwasflow

After activating kGWASflow, you can perform a test run as axplained below:

Test examples:
1. Run the kGWASflow test in dryrun mode to see what tasks would be executed:
 kgwasflow test -t 16 -n
2. Run the kGWASflow test using the test config file with 16 threads:
 kgwasflow test -t 16
3. Run the kGWASflow test and define the test working directory:
 kgwasflow test -t 16 --work-dir path/to/test_work_dir

Authors

kGWASflow was developed by Adnan Kivanc Corut .

Issues

For Issues: https://github.com/akcorut/kGWASflow/issues

Contributions to the development of kGWASflow are welcome! Create Pull Requests to fix bugs or recommend new features!

Maintainers

Citation

‼️ kGWASflow preprint is out now on bioRxiv: https://doi.org/10.1101/2023.07.10.548365 ‼️

If you use kGWASflow in your research, please cite using the DOI: https://doi.org/10.1101/2023.07.10.548365 and the original method paper by Voichek et al. (2020) :

  • Adnan Kivanc Corut, Jason G. Wallace. kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS (2023). bioRxiv. https://doi.org/10.1101/2023.07.10.548365

  • Voichek, Y., Weigel, D. Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat Genet 52, 534–540 (2020). https://doi.org/10.1038/s41588-020-0612-7

License

kGWASflow is licensed under the MIT license.

Code Snippets

19
20
21
22
shell:
    """
    minimap2 -t {threads} -a {input.ref_gen} {input.contigs} > {output} 2> {log}
    """
39
40
41
42
shell:
    """
    awk -v s={params.min_mapping_score} '$5 > s || $1 ~ /^@/' {input} > {output} 2> {log}
    """
61
62
63
64
shell:
    """
    samtools view -@ {threads} -Sbh {input} > {output} 2> {log}
    """
83
84
85
86
shell:
    """
    samtools sort -@ {threads} {input} -o {output} 2> {log}
    """
105
106
107
108
shell:
    """
    samtools index -@ {threads} {input} 2> {log}
    """
128
129
130
131
shell:
    """
    bedtools bamtobed -i {input.bam} > {output} 2> {log}
    """
146
147
148
149
shell:
    """
    touch {output} 2> {log}
    """
23
24
25
26
27
shell:
    """
    bowtie -p {threads} -a --best --strata {params.extra} \
    -x {params.index} -f {input.kmers_list} --sam {output} 2> {log}
    """
48
49
50
51
52
shell:
    """
    bowtie2 -p {threads} {params.extra} \
    -x {params.index} -f {input.kmers_list} -S {output} 2> {log}
    """
71
72
73
74
shell:
    """
    samtools view -@ {threads} -Sbh {input} > {output} 2> {log}
    """
93
94
95
96
shell:
    """
    samtools sort -@ {threads} {input} -o {output} 2> {log}
    """
115
116
117
118
shell:
    """
    samtools index -@ {threads} {input} 2> {log}
    """
150
151
script:
    "../scripts/plot_manhattan.py"
166
167
168
169
shell:
    """
    touch {output} 2> {log}
    """
16
17
18
19
20
21
22
23
24
shell:
    """
    echo Merging r1 files:
    echo "$(ls {input.dir}/*_reads_with_kmers_R1.fastq)"
    cat {input.dir}/*_reads_with_kmers_R1.fastq > {output.merged_r1} 2> {log}
    echo Merging r2 files:
    echo "$(ls {input.dir}/*_reads_with_kmers_R2.fastq)"
    cat {input.dir}/*_reads_with_kmers_R2.fastq > {output.merged_r2} 2> {log}
    """
44
45
46
47
48
shell:
    """
    seqkit sort -n {input.r1} > {output.sorted_r1} 2> {log}
    seqkit sort -n {input.r2} > {output.sorted_r2} 2> {log}
    """
73
74
75
76
shell:
    """
    bowtie2 -p {threads} --very-sensitive-local {params.extra} -x {params.index} -1 {input.r1} -2 {input.r2} -S {output.out_sam} 2> {log}
    """
93
94
95
96
shell:
    """
    awk -v s={params.min_mapping_score} '$5 > s || $1 ~ /^@/' {input.sam} > {output.filtered_sam} 2> {log}
    """
115
116
117
118
shell:
    """
    samtools view -@ {threads} -Sbh {input.sam} > {output.bam} 2> {log}
    """
138
139
140
141
shell:
    """
    samtools sort -@ {threads} {input.bam} -o {output.sorted_bam} 2> {log}
    """
162
163
164
165
shell:
    """
    samtools index -@ {threads} {input.bam} 2> {log}
    """
187
188
189
190
shell:
    """
    bedtools bamtobed -i {input.bam} > {output.bed} 2> {log}
    """
209
210
211
212
shell:
    """
    bedtools merge -i {input.bed} -s -c 6 -o distinct > {output.merged_bed} 2> {log}
    """
231
232
233
234
235
shell:
    """
    sort -k1,1 -k2,2n {input.bed} | bgzip > {input.bed}.gz 2> {log}
    tabix -p bed {input.bed}.gz 2> {log}
    """
250
251
252
253
shell:
    """
    touch {output} 2> {log}
    """
22
23
24
25
shell:
    """
    spades.py --careful --only-assembler -t {threads} {params.extra} --pe1-1 {input.R1} --pe1-2 {input.R2} -o {params.out_dir} 2> {log}
    """
22
23
wrapper:
    "v1.25.0/bio/blast/makeblastdb"
38
39
40
41
shell:
    """
    seqkit head -n 1 {input} > {output} 2> {log}
    """
68
69
wrapper:
    "v1.25.0/bio/blast/blastn"
82
83
84
85
shell:
    """
    touch {output}
    """
17
18
shell:
    "wget https://github.com/voichek/kmersGWAS/releases/download/{params.version}/{params.prefix}.zip -O {output} 2> {log}"
38
39
shell:
    "unzip {input} -d {params.dir} 2> {log}"
21
22
script:
    "../scripts/generate_kmers_list_paths.py"
47
48
49
50
51
52
53
shell:
    """
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

    {input.kmersGWAS_bin}/list_kmers_found_in_multiple_samples -l {input.kmers_list} -k {params.kmer_len} \
    --mac {params.mac} -p {params.min_app} -o {output} 2> {log}
    """
75
76
script:
    "../scripts/plot_kmer_allele_counts.py"
30
31
32
33
34
35
36
37
shell:
    """
    echo Working on fastq files: {input.fastqs}
    echo Symlink -fastq1: {input.fastqs[0]} to {output.r1}
    ln -rs {input.fastqs[0]} {output.r1} 2> {log.log1}
    echo Symlink -fastq2: {input.fastqs[1]} to {output.r2}
    ln -rs {input.fastqs[1]} {output.r2} 2> {log.log2}
    """
62
63
script:
    "../scripts/generate_input_lists.py"
84
85
script:
    "../scripts/generate_input_lists.py"
114
115
116
117
118
119
120
shell:
    """
    kmc -t{threads} {params.extra} -v -k{params.kmer_len} -ci{params.count_threshold} \
    @{input} {params.outfile_prefix} {params.outdir_prefix} \
    1> {params.outdir_prefix}/kmc_canon.1 2> {params.outdir_prefix}/kmc_canon.2 \
    > {log}
    """
148
149
150
151
152
153
154
shell:
    """
    kmc -t{threads} -v -k{params.kmer_len} -ci0 -b \
    @{input} {params.outfile_prefix} {params.outdir_prefix} \
    1> {params.outdir_prefix}/kmc_all.1 2> {params.outdir_prefix}/kmc_all.2 \
    > {log}
    """
186
187
188
189
190
191
shell:
    """
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

    {input.kmersGWAS_bin}/kmers_add_strand_information -c {params.prefix_kmc_canon} -n {params.prefix_kmc_all} -k {params.kmer_len} -o {output.strand} > {log}
    """
258
259
script:
    "../scripts/plot_kmer_stats.py"
18
19
script:
    "../scripts/fetch_significant_kmers.py"
12
13
14
15
16
shell:
    """
    git clone --recurse-submodules https://github.com/voichek/fetch_reads_with_kmers.git 2> {log.log1}
    mv fetch_reads_with_kmers scripts/external 2> {log.log2}
    """
27
28
29
30
31
shell:
    """
    cd {params.prefix}
    make
    """
65
66
script:
    "../scripts/fetch_source_reads.py"
96
97
script:
    "../scripts/fetch_source_reads.py"
113
114
115
116
shell:
    """
    touch {output} 2> {log}
    """
21
22
23
24
25
26
shell:
    """
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

    {input.kmersGWAS_bin}/filter_kmers -t {params.prefix} -k {input.lists} -o {output} 2> {log}
    """
41
42
43
44
shell:
    """
    touch {output} 2> {log}
    """
24
25
26
27
28
29
30
shell:
    """
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

    {input.kmersGWAS_bin}/emma_kinship_kmers -t {params.prefix} -k {params.kmer_len} \
    --maf {params.maf} > {output} 2> {log}
    """
52
53
54
55
56
57
shell:
    """
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

    {input.kmersGWAS_bin}/emma_kinship {params.prefix} > {output} 2> {log}
    """
23
24
25
26
27
28
29
shell:
    """
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

    {input.kmersGWAS_bin}/build_kmers_table -l {input.list_paths} -k {params.kmer_len} \
    -a {input.kmers_to_use} -o {params.prefix} 2> {log}
    """
57
58
59
60
61
62
63
64
65
66
67
68
69
70
shell:
    """
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

    {input.kmersGWAS_bin}/kmers_table_to_bed \
    -t {params.input_prefix} \
    -p {input.phenotype} \
    -o {params.output_prefix} \
    -k {params.kmer_len} \
    --maf {params.maf} --mac {params.mac} \
    -b {params.max_num_var} \
    {params.only_unique} \
    >>{log} 2>&1
    """
25
26
wrapper:
    "v1.25.0/bio/igv-reports"
41
42
43
44
shell:
    """
    touch {output} 2> {log}
    """
74
75
wrapper:
    "v1.25.0/bio/igv-reports"
90
91
92
93
shell:
    """
    touch {output} 2> {log}
    """
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
shell:
    """
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

    rm -r {params.out_prefix}

    python2 {input.kmers_gwas_py} {params.extra} \
    --min_data_points {params.min_data_points} \
    --pheno {input.phenotype} \
    --kmers_table {params.kmers_tab_prefix} \
    --kmers_number {params.kmers_number} \
    --permutations {params.n_permutations} \
    --maf {params.maf} --mac {params.mac} \
    -l {params.kmer_len} -p {threads} \
    --outdir {params.out_prefix} >>{log} 2>&1
    """
17
18
wrapper:
    "v1.12.2/bio/fastqc"
39
40
wrapper:
    "v1.12.2/bio/multiqc"
15
16
17
18
shell:
    """
    ln -rs {input.reference} {output} 2> {log}
    """
SnakeMake From line 15 of rules/ref.smk
37
38
39
40
shell:
    """
    samtools faidx {input.reference} 2> {log}
    """
62
63
wrapper:
    "v1.25.0/bio/bowtie2/build"
SnakeMake From line 62 of rules/ref.smk
90
91
shell:
    "bowtie-build --threads {threads} {input.reference} {params.prefix} 2> {log}"
108
109
wrapper:
    "v1.23.5/bio/sra-tools/fasterq-dump"
SnakeMake From line 108 of rules/ref.smk
120
121
wrapper:
    "v1.23.5/bio/sra-tools/fasterq-dump"
SnakeMake From line 120 of rules/ref.smk
37
38
script:
    "../scripts/generate_kmers_gwas_summary.py"
21
22
wrapper:
    "v1.18.3/bio/cutadapt/pe"
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
import csv
import os, glob, shutil
import pandas as pd
import numpy as np
import sys

# Logging
with open(snakemake.log[0], "w") as f:
    sys.stderr = sys.stdout = f

    ## Set float presicion for the p-values
    pd.set_option("display.precision", 8)

    # Input directory path for kmersGWAS results
    res_dir_path = os.path.dirname(snakemake.params.in_prefix)

    ## Set the threshold for k-mers GWAS results (%5 or %10 family-wise error rate)
    ## 5 is the default
    threshold = snakemake.params.threshold

    # If threshold is not 5 or 10, print an error message and exit
    if threshold != 5 and threshold != 10:
        print("ERROR: threshold value must be 5 or 10. \nPlease enter a valid threshold (5 or 10). \nExiting...")
        sys.exit(1)

    # Define the output directory
    outdir = snakemake.output.dir
    print("Output directory: " + outdir)

    # Define a function to fetch significant k-mers
    def fetch_kmers(input_dir, pheno, out_dir, threshold):
        """
        Fetch significant k-mers for a given phenotype from the k-mers GWAS results.
        """

        kmers_pass_threshold = pd.read_csv(input_dir + '/' + pheno +  '/kmers/pass_threshold_{threshold}per'.
                                           format(threshold=threshold),
                               sep="\t") # read the k-mers that pass the threshold

        if not os.path.exists(out_dir):
                os.makedirs(out_dir) # create the output directory if it does not exist

        if kmers_pass_threshold.empty: # if no k-mers pass the 5% threshold:
            # create a file to indicate that no k-mers pass the 5% threshold
            with open(out_dir + '/{pheno}_NO_KMERS_PASS_5PER_THRESHOLD_FOUND'.
                      format(pheno=pheno), 'a'): 
                print("No k-mers pass the {threshold}% threshold for the phenotype {pheno}.".
                      format(threshold=threshold, pheno=pheno)) # print a message to the user and exit the script
                pass

        else: # if k-mers pass thethreshold

        ## Get significant k-mers without k-mer number
            kmers_pass_threshold['rs']= kmers_pass_threshold['rs'].apply(lambda x: x.split('_')[0])

            # print(kmers_pass_threshold) # For debugging only

            ## Write significant k-mers list in TEXT format 
            print("Writing significant k-mers list in TEXT format...")
            kmers_pass_threshold['rs'].to_csv(out_dir + "/" + pheno + "_kmers_list.txt", 
                                             index=False, 
                                             sep="\t", 
                                             header=None) # write the k-mers list in a text file
            print ("Done.")

            ## Make a new column with k-mer number and its p_value (e.g. kmer1_8.543334e-11)
            kmers_pass_threshold['kmer_name']= 'kmer' + (kmers_pass_threshold.index+1).astype(str) + '_' + kmers_pass_threshold['p_lrt'].astype(str)

            ### Write significant k-mers list in FASTA format (based on: https://www.biostars.org/p/271977/)
            ## Fasta file output
            fileOutput = open(out_dir + "/" + pheno + "_kmers_list.fa", "w")

            ## Seq count
            count = 1

            ## Loop through each line in the input file
            print("Writing significant k-mers list in FASTA format...")
            for index, row in kmers_pass_threshold.iterrows():

                #Output the header
                fileOutput.write(">" + row['kmer_name'] + "\n")
                fileOutput.write(row['rs'] + "\n")

                count = count + 1
            print ("Done.")
            print("")

            #Close the input and output file
            fileOutput.close()

    # Get the list of phenotypes
    phenos = [os.path.basename(x) for x in glob.glob(res_dir_path + '/*')]
    print("Phenotypes: " + str(phenos)) # print the list of phenotypes

    for pheno in phenos: # for each phenotype

        print("Fetching significant k-mers for phenotype: " + pheno)
        print("Threshold: " + str(threshold) + "%")

        # Fetch significant k-mers for the current phenotype
        fetch_kmers(input_dir=res_dir_path, pheno=pheno, out_dir=outdir , threshold=threshold)

    print("Done!.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
import csv
import sys
import os, argparse, glob, shutil
import pandas as pd
import numpy as np
import subprocess
from pathlib import Path

# logging
with open(snakemake.log[0], "w") as f:
    sys.stderr = sys.stdout = f

                #################################################
                ####### Load Snakemake Inputs and Params ########
                #################################################

    fq1_list= list(snakemake.input.r1) # list of fastq r1 files
    fq2_list= list(snakemake.input.r2) # list of fastq r2 files
    samp_list = list(snakemake.params.samples) # list of sample names
    lib_list = list(snakemake.params.library) # list of library names

    kmers_tab = snakemake.input.kmers_tab # kmers table

    kmers_list_fa = snakemake.input.kmers_list # kmers list fasta

    pheno = snakemake.params.pheno # phenotype name

    out_dir = snakemake.output.out_dir # output directory

    kmer_len = snakemake.params.kmer_len # kmer length

    fetch_reads = snakemake.input.fetch_reads # fetch_reads script


                    #########################################
                    ####### Generate a Samples Table ########
                    #########################################

    ## Make a samples info table w/ sample names, library names and corresponding fastq file paths
    reads_path_tab = {'sample_name': samp_list, 'library_name': lib_list, 'fq1': fq1_list, 'fq2': fq2_list}
    reads_path_tab = pd.DataFrame(data=reads_path_tab, dtype=object)
    print(reads_path_tab.head())

    # reads_path_tab.to_csv(snakemake.params.out_prefix + "test.txt", index=False, sep="\t")

            ###############################################################
            ####### Identify the samples having the k-mers present ########
            ###############################################################

    ## Read "kmers_table.txt" for a given phenotype
    kmers_table = pd.read_csv(kmers_tab, 
                            sep="\t")

    ## Filter out accessions/samples that doesn't have the kmers in them
    kmers_table_filt = kmers_table.loc[:, (kmers_table != 0).any(axis=0)]

    ## Get accesions with k-mers
    acc_list = list(kmers_table_filt.iloc[: , 1:].columns.values)

    print("Working on the phenotype: " + pheno)
    print("")
    print("################## Identifying accessions that has the k-mers... ##################")
    print("")
    print(str(len(acc_list)) + " accessions identified: " + ', \n'.join(acc_list))
    print("")
    print("######################### Fetching reads with k-mers... #########################")
    print("")

    ## Filter samples table to get only samples with k-mers
    reads_path_tab_filt = reads_path_tab[reads_path_tab['sample_name'].isin(acc_list)]
    reads_path_tab_filt = reads_path_tab_filt.reset_index(drop=True)

    ## Check if output directory is exist. If not create one
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

    ## Filter the reads that have one of the k-mers in them
    for index, row in reads_path_tab_filt.iterrows():
        if row["sample_name"] == row["library_name"]:

            ## Source code: https://github.com/voichek/fetch_reads_with_kmers
            subprocess.run(" {fetch_reads} {r1} {r2} {fa} {kmer_len} {out}/{acc}_reads_with_kmers".format(
                fetch_reads= fetch_reads, out=out_dir, 
                r1= row["fq1"], r2= row["fq2"],
                acc= row["sample_name"], lib = row["library_name"], 
                fa= kmers_list_fa, pheno= pheno, kmer_len= kmer_len), shell=True)

            # make sure that the output file is generated and if not exit the script with error
            if not os.path.exists("{out}/{acc}_reads_with_kmers_R1.fastq".format(out=out_dir, acc=row["sample_name"])):
                print("Error: {acc}_reads_with_kmers_R1.fastq is not generated!".format(acc=row["sample_name"]))
                sys.exit(1)
            if not os.path.exists("{out}/{acc}_reads_with_kmers_R2.fastq".format(out=out_dir, acc=row["sample_name"])):
                print("Error: {acc}_reads_with_kmers_R2.fastq is not generated!".format(acc=row["sample_name"]))
                sys.exit(1)

        else:
            ## Source code: https://github.com/voichek/fetch_reads_with_kmers
            subprocess.run(" {fetch_reads} {r1} {r2} {fa} {kmer_len} {out}/{acc}_{lib}_reads_with_kmers".format(
                fetch_reads= fetch_reads, out=out_dir, 
                r1= row["fq1"], r2= row["fq2"],
                acc= row["sample_name"], lib = row["library_name"], 
                fa= kmers_list_fa, pheno= pheno, kmer_len= kmer_len), shell=True)

             # make sure that the output file is generated and if not exit the script with error
            if not os.path.exists("{out}/{acc}_{lib}_reads_with_kmers_R1.fastq".format(out=out_dir, acc=row["sample_name"], lib=row["library_name"])): 
                print("Error: {acc}_{lib}_reads_with_kmers_R1.fastq is not generated!".format(acc=row["sample_name"], lib=row["library_name"]))
                sys.exit(1)
            if not os.path.exists("{out}/{acc}_{lib}_reads_with_kmers_R2.fastq".format(out=out_dir, acc=row["sample_name"], lib=row["library_name"])):
                print("Error: {acc}_{lib}_reads_with_kmers_R2.fastq is not generated!".format(acc=row["sample_name"], lib=row["library_name"]))
                sys.exit(1)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import csv
import os, argparse, glob, shutil
import pandas as pd
import numpy as np
import sys

# logging
with open(snakemake.log[0], "w") as f:
    sys.stderr = sys.stdout = f

    dir_path = snakemake.params.prefix # Directory path of input reads

    # Get dir names (each dir name shoud be same as sample names)
    dir_names = os.listdir(dir_path)

    # For each dir/sample create a list 
    # that contains full path of each 
    # fastq/fq/.gz files exist in that folder
    for dir in dir_names:
        input_list=[]
        df=[]
        types = ["*.gz", "*.fq", "*.fastq"]
        for type in types:
            temp = glob.glob(dir_path + '/' + dir + '/' + type)
            input_list += temp
        df = pd.DataFrame(sorted(input_list),columns=['input_files'])

        # Write out list of fastq files to input_files.txt
        # and save it in current dir/sample
        df.to_csv(dir_path + '/' + dir + '/' + "input_files.txt", index=False, header=False)

        # Check if input_files.txt is created and not empty
        try:
            if os.path.getsize(dir_path + '/' + dir + '/' + "input_files.txt") > 0:
                print("input_files.txt is successfully created for the sample: " + dir)
            else:
                print("Warning: input_files.txt is empty for the sample: "  + dir)

        except OSError as e:
            print("input_files.txt cannot be created for the sample: "  + dir)       
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import csv
import os, glob, shutil
import pandas as pd
import numpy as np
import sys
import matplotlib.pyplot as plt

# Logging
with open(snakemake.log[0], "w") as f:
    sys.stderr = sys.stdout = f

    # Read the kmersGWAS results file (pass_threshold_5per)
    pass_threshold_5per = pd.read_csv(snakemake.input.kmers_gwas_res,
                                      sep="\t") 

    # Input directory path for kmersGWAS results
    res_dir_path = os.path.dirname(snakemake.input.kmers_gwas_res)

    # Read the pass_threshold_10per file
    pass_threshold_10per = pd.read_csv(res_dir_path + '/pass_threshold_10per',
                                       sep="\t")

    # Get the permutation based p-value thresholds values for 5% and 10% 
    threshold_5per = pd.read_csv(res_dir_path + '/threshold_5per',
                                 sep="\t", 
                                 names=['threshold'])
    threshold_10per = pd.read_csv(res_dir_path + '/threshold_10per', 
                                 header=None, 
                                 sep="\t", 
                                 names=['threshold'])

    # read the phenotype_value.assoc.txt.gz file
    pheno_val_assoc = pd.read_csv(res_dir_path + '/output/phenotype_value.assoc.txt.gz', 
                              compression='gzip',
                              sep="\t")

    # Phenotype name
    pheno= snakemake.params.pheno

    # Number of k-mers filtered
    kmers_number=snakemake.params.kmers_number

    # Define a function to generate a table with the k-mers, p-values and p-value thresholds
    def generate_kmers_gwas_results_table(pass_threshold, threshold, pheno, output_file):

        # Generate an empty dataframe with the kmer, p-value and threshold
        res_table = pd.DataFrame(columns=['kmers_pass_threshold', 'pval', 'pval_threshold'])

        # Fill the dataframe with the kmer, p-value, threshold, phenotype, -log10(p-value) and allele frequency
        res_table['kmers_pass_threshold'] = pass_threshold['rs'].apply(lambda x: x.split('_')[0]) # get the kmer from the rs column

        # Get the p-value from the p_lrt column and round it to 6 decimal places
        res_table['pval'] = pass_threshold['p_lrt']

        # Get the -log10(p-value) from the p-value column
        res_table['-log10(pval)'] = -np.log10(res_table['pval'])

        res_table['pval'] = res_table['pval'].map('{:.6e}'.format) # round to 6 decimal places
        res_table['-log10(pval)'] = res_table['-log10(pval)'].map('{:.5f}'.format) # round to 5 decimal places

        # Get the allele frequency from the af column 
        res_table['allele_freq'] = pass_threshold['af'] # add the allele frequency column

        # Get the p-value threshold
        res_table['pval_threshold'] = threshold['threshold'][0] # add the p-value threshold column
        res_table['phenotype'] = pheno # add the phenotype column

        print(res_table.head())
        # Save the table to a tsv file
        print('Saving the results summary table to a tsv file...')
        res_table.to_csv(output_file, index=False, 
                     sep='\t', 
                     columns=['kmers_pass_threshold', 'phenotype', 'allele_freq', 'pval', '-log10(pval)', 'pval_threshold'])    
        print('Generating the results summary table is done!')

    # Define a function to plot the histogram of the -log10(p-value) values
    def plot_pval_histogram(pheno_val_assoc, threshold_5per, threshold_10per, kmers_number, pheno, out_dir): 

        # If output directory does not exist, create it
        if not os.path.exists(out_dir):
            os.makedirs(out_dir)

        # Plot the histogram of the -log10(p-value) values
        f, ax = plt.subplots(figsize=(16, 8), facecolor='w', edgecolor='k')
        plt.hist(-np.log10(pheno_val_assoc['p_lrt']), bins=50, color='g', alpha=0.5)

        # Add labels to the axes
        plt.xlabel('-log10(pval)', fontsize=24)
        plt.ylabel('Frequency', fontsize=24)

        # Add a vertical line to the histogram based on the 5% threshold
        plt.axvline(x=threshold_5per['threshold'][0], color='r', linestyle='--')
         # Add a caption to the right bottom corner of the outside of the plot
        ax.text(1.0, -0.1, "*Red dashed line indicates 5% family-wise error-rate threshold", color='r', transform=ax.transAxes, ha='right', va='bottom', fontsize=12)

        # Add label to the top of the vertical line
        ax.text(threshold_5per['threshold'][0]+0.3, 1, '5% threshold', 
                rotation=0, color='r', va='top' , ha='left', 
                bbox=dict(facecolor='w', edgecolor='r', boxstyle='round,pad=0.5'),
                transform=ax.get_xaxis_transform(),
                fontsize=12)

        # Add a vertical line to the histogram based on the 10% threshold
        plt.axvline(x=threshold_10per['threshold'][0], color='b', linestyle='--')

        # Add a caption to the right bottom corner of the outside of the plot
        ax.text(1.0, -0.13, "*Blue dashed line indicates 10% family-wise error-rate threshold", color='b', transform=ax.transAxes, ha='right', va='bottom', fontsize=12)
        # Add label to the top of the vertical line
        ax.text(threshold_10per['threshold'][0]-0.3, 1, '10% threshold', 
                rotation=0, color='b', va='top' , ha='right', 
                bbox=dict(facecolor='w', edgecolor='b', boxstyle='round,pad=0.5'),
                transform=ax.get_xaxis_transform(),
                fontsize=12)

        # Add title to the histogram
        plt.title('Histogram of (-log10) p-values - k-mers passing the first filtering step (n={kmers_num})'.format(kmers_num=kmers_number), 
                  fontsize=16, y=1.02)
        plt.xticks(fontsize = 18)
        plt.yticks(fontsize = 18)

        # Adjust the layout of the plot
        plt.tight_layout() 
        print('Saving the histogram plot to a pdf file...')

        # Save the plot to a PDF file
        plt.savefig(out_dir + "/{pheno}.kmers_pass_first_filter.pval_hist.pdf".format(pheno=pheno))
        print("Generating the p-value histogram plot is done!")

    outdir = os.path.dirname(snakemake.output.res_sum_5per)

    # Check if there are k-mers that pass the 5% threshold
    if pass_threshold_5per.empty: # if no k-mers pass the 5% threshold
        with open(outdir + "/NO_KMERS_PASS_5PER_THRESHOLD_FOUND", 'a'): # create a file to indicate that no k-mers pass the 5% threshold
            print("No k-mers pass the 5% threshold.") # print a message to the user

    # generate the table with the k-mers that pass the 5% threshold
    print("Generating kmersGWAS summary table with k-mers that pass the 5% threshold...")
    generate_kmers_gwas_results_table(pass_threshold=pass_threshold_5per, threshold=threshold_5per, pheno=pheno, output_file=snakemake.output.res_sum_5per) 

    # Check if there are k-mers that pass the 10% threshold
    if pass_threshold_10per.empty: # if no k-mers pass the 10% threshold
        with open(outdir + "/NO_KMERS_PASS_10PER_THRESHOLD_FOUND", 'a'): # create a file to indicate that no k-mers pass the 10% threshold
            print("No k-mers pass the 10% threshold.") # print a message to the user and exit the script

    # generate the table with the k-mers that pass the 10% threshold
    print("Generating kmersGWAS summary table with k-mers that pass the 10% threshold...")
    generate_kmers_gwas_results_table(pass_threshold=pass_threshold_10per, 
                                      threshold=threshold_10per, 
                                      pheno=pheno, 
                                      output_file=snakemake.output.res_sum_10per) 

    # if k-mers pass the 5% threshold or the 10% threshold 
    if not pass_threshold_5per.empty or not pass_threshold_10per.empty:
    # Plot the histogram of the -log10(p-value) values
        print("Plotting the histogram of the -log10(p-value) values...")
        plot_pval_histogram(pheno_val_assoc=pheno_val_assoc, 
                            threshold_5per=threshold_5per, 
                            threshold_10per=threshold_10per,
                            kmers_number=kmers_number, 
                            pheno=pheno,
                            out_dir=snakemake.output.pval_hist_dir)
    # if no k-mers pass the 5% threshold and no k-mers pass the 10% threshold
    if pass_threshold_5per.empty and pass_threshold_10per.empty:
        # If output directory does not exist, create it
        if not os.path.exists(snakemake.output.pval_hist_dir):
            os.makedirs(snakemake.output.pval_hist_dir)

        # Create a file to indicate that no k-mers pass the 5% threshold
        with open(snakemake.output.pval_hist_dir + "/NO_KMERS_PASS_10PER_THRESHOLD_FOUND", 'a'):
            print("No k-mers pass the 5% or the 10% threshold. No plots will be generated.") # print a message to the user and exit the script

    print("Done!")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import csv
import os, argparse, glob, shutil
import pandas as pd
import numpy as np
import sys

# logging
with open(snakemake.log[0], "w") as f:
    sys.stderr = sys.stdout = f

    in_dir = snakemake.params.input_dir # Path for k-mers counts folder
    samp_tab = snakemake.input.samples_tab # Path for sampels sheet (samples.tsv)

    dir_names = os.listdir(in_dir)

    ## Read samples.tsv (provided by the user) file 
    # to get the order of samples
    samples_tab = pd.read_csv(samp_tab, 
                                    sep="\t",
                                    usecols=["sample_name"],
                                    dtype='object')

    samples_tab = samples_tab.drop_duplicates()
    ## Loop through kmers_count results for each
    # accession/sample directory and make a list of 
    # kmers_with_strand file paths for each accession/sample
    input_list=[]
    dirs=[]
    for dir in dir_names:
        if os.path.isfile(in_dir + '/' + dir + '/' + "kmers_with_strand"):
            input_list.append(in_dir + '/' + dir + '/' + "kmers_with_strand")
            dirs.append(dir)

    ## Make a dataframe with columns accession name and 
    # kmers_with_strand file path for that acession
    list_paths = pd.DataFrame(input_list,columns=['input_files'])
    list_paths["accession"] = dirs

    ## Make sure the output file has same sample order as in samples.tsv file
    list_paths_sorted = samples_tab.merge(list_paths, left_on='sample_name',
              right_on = 'accession').drop('sample_name',axis=1)

    ## Write out output file
    list_paths_sorted.to_csv(snakemake.params.out_dir + '/' + "kmers_list_paths.txt", index=False, header=False, sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import csv
import os, glob, shutil
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp
from scipy import stats
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
from matplotlib.pyplot import hist

# Logging
with open(snakemake.log[0], "w") as f:
    sys.stderr = sys.stdout = f

    ## Read kmers_to_use.shareness file.
    # This file contains the information about
    # how many k-mers appeared in each individual/sample. 
    # More info: https://github.com/voichek/kmersGWAS/blob/master/manual.pdf
    kmers_shareness = pd.read_csv(snakemake.input.shareness, 
                                    sep="\t", header=None, skiprows=1)
    print(kmers_shareness)

    ## Plot the stats from kmers_to_use.shareness file
    # x-axis is allele count and y-axis is no. of k-mers
    # For every n accession/sample, the number of k-mers 
    # appeared in exactly that number of accessions.
    fig_dims = (10, 10)
    fig, ax = plt.subplots(figsize=fig_dims)
    sns_plot = sns.barplot(x=kmers_shareness[0], y=kmers_shareness[1], 
                           color="darksalmon", 
                           saturation=1)
    sns_plot.set_xlabel("No. of samples",fontsize=24)
    sns_plot.set_ylabel("No. of k-mers",fontsize=24)

    sns_plot.set_yscale("log")

    # set y-axis ticks
    yticks = [10**i for i in range(4, 10)]
    sns_plot.yaxis.set_major_locator(ticker.FixedLocator(yticks))

    # set y-axis tick labels
    ytick_labels = [f"$10^{i}$" for i in range(4, 10)]
    sns_plot.set_yticklabels(ytick_labels)

    # set x-axis ticks
    sns_plot.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
    sns_plot.xaxis.set_major_locator(ticker.MultipleLocator(base=30))
    plt.xticks(fontsize = 18)
    plt.yticks(fontsize = 18)
    # rotate x axis labels
    ax.tick_params(axis='x', rotation=45)
    fig.tight_layout()

    ## Save the plot
    sns_plot.figure.savefig(snakemake.output.kmer_allele_counts_plot, dpi=300)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
import csv
import os, glob, shutil
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
from scipy import stats
import matplotlib.ticker as ticker

# Logging
with open(snakemake.log[0], "w") as f:
    sys.stderr = sys.stdout = f

    ## Gather dir (sample names as directory) names in a list
    kmc_logs_path= snakemake.params.input_path
    dir_names= os.listdir(kmc_logs_path)

    ## Define a function to generate k-mers count stats table 
    # for each accession/sample. Accession name, no. total reads and no. of 
    # unique k-mers as columns, samples are rows.
    def generate_kmers_stats_tab(dir_path, file_name, dir_names):
        """
        Generate k-mers count stats table for each accession/sample.
        """
        # Define lists to store values
        unique_kmers=[] # list to store unique k-mers
        total_reads=[] # list to store total reads
        accessions=[] # list to store accession names

        # For each accession/sample gather accesssion name,
        # from dir names and no. total reads and no. of unique 
        # k-mers from kmc log files.
        for dir in dir_names: # for each accession/sample
            if os.path.isfile(dir_path + '/' + dir + '/' + file_name): # if kmc log file exists
                df= pd.read_csv(dir_path + '/' + dir + '/' + file_name, delimiter = ":", header=None) # read kmc log file
                df= df.dropna() # drop rows with NaN values
                unique_kmers.append(df.iloc[6][1]) # append unique k-mers to list
                total_reads.append(df.iloc[9][1])  # append total reads to list
                accessions.append(dir) # append accession name to list

        ## Create k-mers count stats table.
        target_df = pd.DataFrame(
            {'accessions': accessions,
             'unique_kmers': unique_kmers,
             'total_reads': total_reads
            })

        ## Convert unique k-mers and total reads values to integer.
        target_df["unique_kmers"] = pd.to_numeric(target_df["unique_kmers"]) #
        target_df["total_reads"] = pd.to_numeric(target_df["total_reads"])
        return target_df # return k-mers count stats table

    # Define functions to generete plots from k-mers count stats table (above). 
    def plot_kmers_stats_scatter(target_df, out_file, plot_name):
        """
        Plot the scatter plot with regression line.
        """
        # Define x-axis and y-axis
        X=target_df["unique_kmers"] # Unique k-mers as x-axis
        Y=target_df["total_reads"] # Total reads as y-axis
        # Plot scatter plot
        plt.figure(figsize=(10, 10)) # set figure size
        plt.scatter(X, Y, alpha=0.5, s=200) 
        slope, intercept = np.polyfit(X, Y, 1) # calculate regression line
        plt.plot(X, slope*X + intercept,color="red") # plot regression line
        plt.xlabel("No. of unique kmers", fontsize=24) # add x-axis label
        plt.ylabel("No. of total reads", fontsize=24) # add y-axis label
        plt.xticks(fontsize = 18)
        plt.yticks(fontsize = 18)
        plt.title(plot_name) # add plot title
        plt.tight_layout() # tight layout
        plt.savefig(out_file, dpi=300) # save plot
        plt.close() # close plot

    # Define a function to plot the joint plot with regression line and pearson correlation coefficient.
    def plot_kmers_stats_joint(target_df, out_file, plot_name):
        """
        Plot the joint plot with regression line and pearson correlation coefficient.
        """
        # plot joint plot
        joint_plot= sns.jointplot(x="unique_kmers", y="total_reads", 
                                  data=target_df, # data
                                  kind="reg", # regression line
                                  height=10,
                                  joint_kws={'line_kws':{'color':'red'},
                                             'scatter_kws': {'s': 150}})

        # pearson correlation coefficient
        r, p = stats.pearsonr(target_df["unique_kmers"], target_df["total_reads"]) # calculate pearson correlation coefficient
        phantom, = joint_plot.ax_joint.plot([], [], linestyle="", alpha=0) # add pearson correlation coefficient to the legend
        legend = joint_plot.ax_joint.legend([phantom],['r={:f}, p={:f}'.format(r,p)]) # add pearson correlation coefficient to the legend

        # add x-axis label, y-axis label and plot title
        joint_plot.ax_joint.set_xlabel('No. of unique kmers', fontsize=24) # add x-axis label
        joint_plot.ax_joint.set_ylabel('No. of total reads', fontsize=24) # add y-axis label
        plt.figure(figsize=(10, 10)) # set figure size
        joint_plot.ax_joint.xaxis.set_tick_params(labelsize=18)
        joint_plot.ax_joint.yaxis.set_tick_params(labelsize=18)
        plt.setp(legend.get_texts(), fontsize='14') # for legend text
        joint_plot.fig.suptitle(plot_name, y = 1) # add plot title
        joint_plot.fig.tight_layout() # tight layout
        joint_plot.savefig(out_file, dpi=300) # save plot
        plt.close() # close plot

    # Define a function to plot the distribution of k-mer counts for canonical and non-canonical k-mers.
    def plot_kmer_dist(target_df, out_file, plot_name):
        """
        Plot the distribution of k-mer counts for canonical and non-canonical k-mers.
        """
        # Plot the histogram of k-mer counts
        plt.figure(figsize=(10, 10)) # set figure size
        plt.hist(target_df["unique_kmers"], 
                 bins=30, 
                 color="#86bf91", 
                 alpha=0.5, 
                 edgecolor="black",
                 linewidth=1.2) #
        plt.xlabel("No. of unique kmers", fontsize=24) # add x-axis label
        plt.ylabel("No. of samples", fontsize=24) # add y-axis label
        plt.xticks(fontsize = 18)
        plt.yticks(fontsize = 18)
        plt.title(plot_name, fontsize=24) # add plot title
        plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0)) # set x-axis to scientific notation
        plt.gca().xaxis.set_major_formatter(plt.ScalarFormatter())

        # Add a boxed text indicating the total number of k-mers to the top right corner of the plot
        textstr = 'Total no. of k-mers:\n{:,}'.format(target_df["unique_kmers"].sum())
        # add box around the text
        plt.gca().text(0.95, 0.95, textstr, 
                       transform=plt.gca().transAxes, 
                       fontsize=16,
                       color="black",
                       verticalalignment='top', 
                       horizontalalignment='right',
                       bbox=dict(facecolor='wheat', pad=10.0),  
                       alpha=0.5)    

        plt.tight_layout() # tight layout
        plt.savefig(out_file, dpi=300) # save plot
        plt.close() # close plot    

    ## Generate stats tables for both canonized and non-canonized k-mers
    kmc_canon_stats= generate_kmers_stats_tab(dir_path=kmc_logs_path, 
                                              file_name="kmc_canon.log", 
                                              dir_names=dir_names)
    kmc_non_canon_stats= generate_kmers_stats_tab(dir_path=kmc_logs_path, 
                                                  file_name="kmc_all.log", 
                                                  dir_names=dir_names)

    ## Writing out KMC stats as a tsv table
    print("Writing out KMC stats as a tsv table...")
    kmc_canon_stats.to_csv(snakemake.output.kmc_canon_stats, 
                           index=False, sep="\t")
    kmc_non_canon_stats.to_csv(snakemake.output.kmc_all_stats, 
                               index=False, sep="\t")

    ## Plot the stats
    print("Plotting total reads vs. number of unique k-mers for canonical k-mers...")
    plot_kmers_stats_scatter(target_df=kmc_canon_stats, 
                             out_file=snakemake.output.kmc_canon_plot, 
                             plot_name="Total Reads vs. Number of Unique k-mers (Canonical)")
    print("Plotting total reads vs. number of unique k-mers for non-canonical k-mers...")
    plot_kmers_stats_scatter(target_df=kmc_non_canon_stats, 
                             out_file=snakemake.output.kmc_all_plot, 
                             plot_name="Total Reads vs. Number of Unique k-mers (Non-canonical)")

    print("Plotting total reads vs. number of unique k-mers for canonical k-mers (joint plot)...")
    plot_kmers_stats_joint(target_df=kmc_canon_stats, 
                           out_file=snakemake.output.kmc_canon_joint_plot, 
                           plot_name="Total Reads vs. Number of Unique k-mers (Canonical)")
    print("Plotting total reads vs. number of unique k-mers for non-canonical k-mers (joint plot)...")
    plot_kmers_stats_joint(target_df=kmc_non_canon_stats, 
                           out_file=snakemake.output.kmc_all_joint_plot, 
                           plot_name="Total Reads vs. Number of Unique k-mers (Non-canonical)")

    print("Plotting the distribution of k-mer counts for canonical k-mers...")
    plot_kmer_dist(target_df=kmc_canon_stats, 
                   out_file=snakemake.output.kmc_canon_kmer_dist_plot,
                   plot_name="Distribution of k-mer counts (Canonical)")
    print("Plotting the distribution of k-mer counts for non-canonical k-mers...")
    plot_kmer_dist(target_df=kmc_non_canon_stats, 
                   out_file=snakemake.output.kmc_all_kmer_dist_plot,
                   plot_name="Distribution of k-mer counts (Non-canonical)")

    print("Done!")
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
import csv
import os, glob, shutil
import re
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from qmplot import manhattanplot
import natsort
from natsort import natsorted
import seaborn as sns
import pysam

if __name__ == "__main__":

    ## Logging
    with open(snakemake.log[0], "w") as f:
      sys.stderr = sys.stdout = f

      ## Read the alignment data 
      align_kmers_sam = pd.read_table(snakemake.input[0], 
                                      sep='\t', comment='@', header=None,
                                      usecols=[0,2,3], names=['kmer_id', 'chr', 'bp'])

      ## Preparing the data for plotting
      align_kmers_sam['kmer'] = align_kmers_sam['kmer_id'].str.split('_').str[0]
      align_kmers_sam['p_value'] = align_kmers_sam['kmer_id'].str.split('_').str[1]
      align_kmers_sam['p_value'] = align_kmers_sam['p_value'].astype(float)
      align_kmers_sam['bp'] = align_kmers_sam['bp'].astype(int)

      # Sort the data by chromosome and chromosome position
      align_kmers_sam_sorted = align_kmers_sam.sort_values(by=["chr", "bp"], key=natsort.natsort_keygen())

      # Get colors for manhattan plot
      colors = sns.color_palette("colorblind").as_hex()
      colors_2 = sns.color_palette("husl").as_hex()

      # Make a column of minus log10 p-values
      align_kmers_sam_sorted['minuslog10pvalue'] = -np.log10(align_kmers_sam_sorted.p_value)

      ## Get min & max minus log10 p-values for y axis limits
      y_max = align_kmers_sam_sorted['minuslog10pvalue'].max()
      y_min = align_kmers_sam_sorted['minuslog10pvalue'].min()
      print("y_max: " + str(y_max))
      print("y_min: " + str(y_min))
      ## Check if only one chromosome is provided for the manhattan plot
      num_of_chrs = len(pd.unique(align_kmers_sam_sorted['chr']))

      # Define a function to extract chromosome names and lengths from a SAM file header
      def extract_chromosome_info(sam_file):
        """
        Extract chromosome names and lengths from a SAM file header and return as a Pandas DataFrame with columns "chr" and "bp".
        """
        chromosome_info = {} # dictionary to store chromosome names and lengths
        pattern = re.compile(r'^([Cc][Hh][Rr])?\d*[XYM]?$') # chromosome names pattern

        with pysam.AlignmentFile(sam_file, "r") as sam: # open SAM file
            for header_line in sam.header["SQ"]: # iterate over SQ header lines
                chromosome_name = header_line["SN"] # get chromosome name
                length = header_line["LN"] # get chromosome length
                if chromosome_name.startswith("chr"): # if chromosome name starts with "chr"
                    name = chromosome_name # use name as is
                else: # if chromosome name does not start with "chr"
                    match = pattern.match(chromosome_name) # 
                    if match: # if chromosome name matches pattern
                        # Convert name to "chrX" format
                        name = "chr" + match.group(1)
                    else: # if chromosome name does not match pattern
                        continue # skip chromosome
                chromosome_info[name] = length # add chromosome name and length to dictionary

        # Convert dictionary to DataFrame
        df = pd.DataFrame(chromosome_info.items(), columns=["chr", "bp"])

        return df # return the DataFrame

      # Plotting the manhattan plot
      print("Plotting...")

      # Set font sizes
      tick_fontsize = snakemake.params["tick_fontsize"]
      label_fontsize = snakemake.params["label_fontsize"]
      title_fontsize = snakemake.params["title_fontsize"]

      # Set the figure dpi
      dpi = snakemake.params["dpi"]

      ## If only one chromosome is provided, plot the k-mer's position on
      ## that chromosome on the x axis
      if num_of_chrs == 1:
        f, ax = plt.subplots(figsize=(18, 9), facecolor='w', edgecolor='k')
        manhattanplot(data=align_kmers_sam_sorted,
                      snp="kmer_id",
                      chrom="chr",
                      CHR= pd.unique(align_kmers_sam_sorted['chr']),
                      color=colors_2,
                      pos="bp",
                      pv="p_value",
                      suggestiveline=None,  # Turn off suggestiveline
                      genomewideline=None,  # Turn off genomewideline
                      xticklabel_kws={"rotation": "vertical"},
                      ax=ax,
                      s = snakemake.params["point_size"],
                      clip_on=False)
        ax.set_ylim([y_min-0.5, y_max+1]) # Set y axis limits

        # Set x axis tick interval
        xtick_interval = snakemake.params["xtick_interval"]

        # Calculate the minimum and maximum of your data, rounded to the nearest multiple of 5000
        min_val = align_kmers_sam_sorted['bp'].min() // xtick_interval * xtick_interval
        max_val = (align_kmers_sam_sorted['bp'].max() // xtick_interval + 1) * xtick_interval

        # Generate the tick locations
        xtick_locs = np.arange(min_val, max_val, xtick_interval)

        f.suptitle('k-mer Based GWAS Manhattan Plot for ' + snakemake.params["pheno"], fontsize=title_fontsize)
        plt.xlabel('Chromosome: ' + pd.unique(align_kmers_sam_sorted['chr'])[0], fontsize=label_fontsize)
        plt.ylabel(r"$-log_{10}{(P)}$", fontsize=label_fontsize) 
        plt.xticks(xtick_locs, fontsize = tick_fontsize)
        plt.yticks(fontsize = tick_fontsize)
        plt.tight_layout()

      ## If more than one chromosome is provided, use all chromosomes
      if num_of_chrs > 1:

        # Extract chromosome names and lengths from the SAM file header
        chrom_info_tab = extract_chromosome_info(snakemake.input[0])
        chrom_names = natsorted(chrom_info_tab['chr'].tolist())

        # Add extra chromosome names and lengths to the data frame
        align_kmers_sam_with_all_chrom = pd.concat([align_kmers_sam, chrom_info_tab], ignore_index=True)
        align_kmers_sam_with_all_chrom = align_kmers_sam_with_all_chrom[align_kmers_sam_with_all_chrom['chr'].isin(chrom_names)]

        # Sort the data by chromosome and chromosome position
        align_kmers_sam_with_all_chrom_sorted = align_kmers_sam_with_all_chrom.sort_values(by=["chr", "bp"], key=natsort.natsort_keygen()) 
        # Fill NaN values with 1
        align_kmers_sam_with_all_chrom_sorted['p_value'] = align_kmers_sam_with_all_chrom_sorted['p_value'].fillna(1) 

        # Plot the dots of chromosome length rows wiht the lowest opacity 
        extra_rows = align_kmers_sam_with_all_chrom_sorted[align_kmers_sam_with_all_chrom_sorted['p_value'] == 1]

        f, ax = plt.subplots(figsize=(18, 9), facecolor='w', edgecolor='k')
        manhattanplot(data=align_kmers_sam_with_all_chrom_sorted,
                      snp="kmer_id",
                      chrom="chr",
                      color=colors,
                      pos="bp",
                      pv="p_value",
                      suggestiveline=None,  # Turn off suggestiveline
                      genomewideline=None,  # Turn off genomewideline
                      xticklabel_kws={"rotation": "vertical"},
                      ax=ax,
                      s = snakemake.params["point_size"],
                      clip_on=True)
        ax.set_ylim([y_min-2, y_max+1]) # Set y axis limits
        plt.scatter(extra_rows['bp'], -np.log10(extra_rows['p_value']), alpha=0)

        # Calculate the cumulative distances from the start of each chromosome and store them in a list
        chrom_ends = chrom_info_tab['bp'].cumsum().tolist()

        # Plot the vertical lines for the end of each chromosome
        for end_position in chrom_ends:
          plt.axvline(x=end_position, color='grey', linestyle='--', alpha=0.2)

        # Add a caption to the right bottom corner of the outside of the plot
        caption = '*Vertical dashed lines indicate chromosome boundaries.'
        ax.text(1.0, -0.2, caption, transform=ax.transAxes, ha='right', va='bottom', fontsize=12)

        # Set the title of the plot
        f.suptitle('k-mer Based GWAS Manhattan Plot for ' + snakemake.params["pheno"], fontsize=title_fontsize) 

        # Set the x and y axis labels
        plt.xlabel('Chromosome', fontsize=label_fontsize) # Set the x axis label
        plt.ylabel(r"$-log_{10}{(P)}$", fontsize=label_fontsize) # Set the y axis label
        plt.xticks(fontsize = tick_fontsize)
        plt.yticks(fontsize = tick_fontsize)

        # Adjust the plot layout
        plt.tight_layout() 

      ## Saving the plot as pdf
      print("Plotting is done. Saving the plot...")
      plt.savefig(snakemake.output["manhattan_plot"], dpi=dpi)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " --cores {snakemake.threads}"
    " {adapters}"
    " {extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
__author__ = "Johannes Köster, Derek Croote"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.snakemake import get_mem


log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")


# Parse memory
mem_mb = get_mem(snakemake, "MiB")


# Outdir
outdir = os.path.dirname(snakemake.output[0])
if outdir:
    outdir = f"--outdir {outdir}"


# Output compression
compress = ""
mem = f"-m{mem_mb}" if mem_mb else ""

for output in snakemake.output:
    out_name, out_ext = os.path.splitext(output)
    if out_ext == ".gz":
        compress += f"pigz -p {snakemake.threads} {out_name}; "
    elif out_ext == ".bz2":
        compress += f"pbzip2 -p{snakemake.threads} {mem} {out_name}; "


with tempfile.TemporaryDirectory() as tmpdir:
    mem = f"--mem {mem_mb}M" if mem_mb else ""

    shell(
        "(fasterq-dump --temp {tmpdir} --threads {snakemake.threads} {mem} "
        "{extra} {outdir} {snakemake.wildcards.accession}; "
        "{compress}"
        ") {log}"
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2021, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
from os import path

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

format = snakemake.params.get("format", "")
blastdb = snakemake.input.get("blastdb", "")[0]
db_name = path.splitext(blastdb)[0]

if format:
    out_format = " -outfmt '{}'".format(format)

shell(
    "blastn"
    " -query {snakemake.input.query}"
    " {out_format}"
    " {snakemake.params.extra}"
    " -db {db_name}"
    " -num_threads {snakemake.threads}"
    " -out {snakemake.output[0]}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2021, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
from os import path

log = snakemake.log
out = snakemake.output[0]

db_type = ""
(out_name, ext) = path.splitext(out)

if ext.startswith(".n"):
    db_type = "nucl"
elif ext.startswith(".p"):
    db_type = "prot"

shell(
    "makeblastdb"
    " -in {snakemake.input.fasta}"
    " -dbtype {db_type}"
    " {snakemake.params}"
    " -logfile {log}"
    " -out {out_name}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Daniel Standage"
__copyright__ = "Copyright 2020, Daniel Standage"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


index = os.path.commonprefix(snakemake.output).rstrip(".")


shell(
    "bowtie2-build"
    " --threads {snakemake.threads}"
    " {extra}"
    " {snakemake.input.ref}"
    " {index}"
    " {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

extra = snakemake.params.get("extra", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

tracks = snakemake.input.get("tracks", [])
if tracks:
    if isinstance(tracks, str):
        tracks = [tracks]
    tracks = "--tracks {}".format(" ".join(tracks))

shell(
    "create_report {extra} --standalone --output {snakemake.output[0]} {snakemake.input.vcf} {snakemake.input.fasta} {tracks} {log}"
)
ShowHide 67 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/akcorut/kGWASflow/wiki
Name: kgwasflow
Version: v1.2.3
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...