Novel insertion detection with 10X reads

public 1yr ago Version: 0.3 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Novel-X: Novel sequence insertion detection using Linked-Reads

Novel-X detects and genotypes novel sequence insertions in 10X sequencing dataset using non-trivial read alignment signatures and barcode information.

Installation
Command Options
Output Formats
Example Commands
Publications
Contact & Support

Installation

To start working with Novel-X please clone this repository recursively:

git clone --recursive [email protected]:1dayac/Novel-X.git

If you clone repository non-recursively Novel-X will not work. To fix this run from Novel-X folder:

git submodule update --init --recursive

Novel-X is a pipeline based on a popular Snakemake workflow management system and consists of multiple steps and requires a lot of external software.

First, the following software should be installed (version numbers used for testing are shown in brackets, but other versions should also work):

Longranger (version 2.15) - Download Page
Velvet (commit 9adf09f) - GitHub Page - outdated but still useful assembler with minimal assumptions about the data. Note that we use kmer length of 63 during the assembly for 10X data, and Velvet should be compiled using

make ’MAXKMERLENGTH=63’

command. For more information, refer to the Velvet manual.

BlastN (version 2.2.31) - Download Page
Samtools (version 1.7) - Project Page
SPAdes (version 3.13) - Project Page
Quast (version 4.4)- Project Page

All this programs (except LongRanger) can be installed with conda package. We provide conda-env.yml file that allows to install them using the following command:

conda env create -f conda-env.yml

Path to executables (if executables are not in $PATH) should be provided in path_to_executables_config.json file.

Python dependencies are listed in requirements.txt file. They can be downloaded and installed with following command:

pip install -r requirements.txt

Inside bxtools folder run following commands (estimated execution time is around 2 minutes):

./configure
make
make install

We tested our tool using CentOS Linux 7 OS, but we suppose that it should work at any modern Unix-like system.

Then you are ready to go.

Command Options

Novel-X can be run with novel-x.py script with two modes:

run - run pipeline from the scratch
restart - if previous pipeline was not finished for some reason you can try to catch up with novel-x.py restart command.

A typical command to start Novel-X is

python novel-x.py run --bam my_bam.bam --genome my_genome.fasta --outdir my_dir

Optional arguments are:

--lr20 - needed if you run pipeline on a bam file obtained by LongRanger2.0 pipeline
--nt - optional filtering of non-human sequences from the orphan contigs

We added two option groups to handle different data and its properties (molecule length, intra-molecule coverage, etc.).

Data option group:

--10x - for 10X Genomics data [Default]
--tellseq - for Tell-Seq data
--stlfr - for stLFR data

Tell-Seq and stLFR data should be converted to LongRanger-compatible bam. For stLFR data, use this pipeline . For Tell-Seq data refer to Tell-Seq paper.

Coverage group:

--high-coverage - best for 60X coverage and higher [Default]
--low-coverage - best for 20X-40X coverage

You can invoke help message by typing:

python novel-x.py run --help

python novel-x.py restart --help

Output Formats

Novel-X write results into vcf-file. If your bam-file was named HM2KYBBXX_NA18509.bam, the resulting vcf-file will be named HM2KYBBXX_NA18509.vcf and will be stored inside the outdir folder.

Example Commands

Run from the start:

python ~/Novel-X/novel-x.py run --bam /athena/ihlab/scratch/dmm2017/70_samples_data/HLF3WBBXX_NA12006_longranger.bam -t 8 -m 200 --nt /athena/ihlab/scratch/dmm2017/blast_database/ --genome /athena/ihlab/scratch/dmm2017/hg38/hg38.fa --outdir /athena/ihlab/scratch/dmm2017/70_samples/novelx_NA12006

Restart from the last stage:

python ~/Novel-X/novel-x.py restart --outdir novelx_NA12006

There is a problem on filter_target_contig stage at the moment. It can exit with non-zero exit code. We recommend to comment out the next line before using restart option:

parallel --jobs {THREADS} filter_target_contigs ::: {input.contigs}/*

Demo command

We placed a toy dataset in demo folder to test that software is installed correctly. You can run command:

python ~/Novel-X/novel-x.py run --bam ~/Novel-X/demo/demo.bam -t 1 -m 20 --genome ~/Novel-X/demo/demo.fasta --outdir out

This command takes about 15 minutes to finish on our hardware. It produces a vcf-file with a single vcf record.

chr1_25500000_25535000 29503 . T TGTATTGTGTGTATGAGGGTTGTGTGCTGTGTGTTGTGTATATATTGTATGTGTTATGTGTATGTATGTCGTATGAGTGTATTCTGTATATGTGTTTTGTGTGGTCTATTATGTATGTGGCATGTGTTGTGTATGTGTGTTGTGTGTGATGTGTTGTATGTGTGTTGTGCATATATGTTGTTTCTGTGTATGTATGTTATGTGTATGTGTATGTTGTGTTGTATGTATGGGTTGTGCCTATGTGCTGTGTTGTGTGCTGCATGCATGTTTGTGTGGTGTGTGTATTTAGGTTGTGTGCTATTTATGTGTCTATATTGTATGTGTTGTATGTGTGTTGTATGTATGTGTAGTGTATGTGTGTTGTGTGTGATGTGTATATGTGGTGTGTGTATGTCTGTTATGTGTATGTATGAGTGTATGTGTGTTGTGTGTGTTGTGTATATGTGTTGTGTGTGTTGTGTATGTGTGTTGTGAGTTGTGTATATGTGGTGTGAGTTGTGTTGTGTCATGTATGTGTGCATTGTGTATAGGTGTTGCATGTGTGTTGTGTTGTGTGTATGTGTTGTGTTGTGTATATGTGGTATGTGAATGTGTATGTTGTATGTTGTGTTGTATGTATATGTGTTATGTATATGTGATGTGTGTGTTGTGTATATGCTGGGTGTGTGTGTACATGTGTGTATGTGTGTTGTATGTATGTGTGTATGCATGTGTGTTGCGTATATGTGGTATGTGTGCATGTGTGTTGTCATGTGTATGTGTGTTGTGTATATGTGTGTGTTGTGTATATGTGTTGTGTGTATGTGTATCATGTTGTGTGTATGTGTTATGTTGTGTATATGTGGTGTGTGAATGTGTGTTGTGTGTATGTGTATGTTGTCTGTTTTGTGTGTGTATACGTGGTGTGTGTGTGTTGTGTTGTGTATATGTGTTGTGTGTGTTGCGTGTATGTGTTGTGTGTT . PASS DP=100 NODE_1_length_4180_cov_43.887399 2776 276 347 1306

Output may slightly differ based on your software versions.

Publications

"Novel sequence insertion detection using Linked-Reads" preprint is available at https://www.biorxiv.org/content/10.1101/551028v1 .

Contact & Support

Feel free to drop any inquiry to [email protected]

Code Snippets

import sys
from Bio import SeqIO

if len(sys.argv) < 4: 
    print("Usage: %s <length threshold> <contigs_file> <output>" % sys.argv[0])
    sys.exit(1) 

f_n = sys.argv[2]
input_seq_iterator = SeqIO.parse(open(f_n, "r"), "fasta")
filtered_iterator = (record for record in input_seq_iterator \
                      if len(record.seq) > int(sys.argv[1]) )

output_handle = open(sys.argv[3], "w")
SeqIO.write(filtered_iterator, output_handle, "fasta")
output_handle.close()

Python Biopython From line 2 of master/contig_length_filter.py

import sys
from Bio import SeqIO

with open(sys.argv[2], "r") as nucmer_file:
    lines = nucmer_file.readlines()
    record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[1], "fasta"))
    if len(lines) == 1:
        exit()

    records_to_output = []
    with open(sys.argv[3], "w") as output_handle:
        for line in lines:
            split = line.split("\t")
            if len(split) < 5 or split[0] == "S1":
                pass
            else:
                if split[4].strip() not in records_to_output:
                    records_to_output.append(split[4].strip())
        for record in records_to_output:
            SeqIO.write([record_dict[record]], output_handle, "fasta")

Python Biopython From line 1 of master/filter_correct_record.py

import sys
import re
from Bio import SeqIO



current_set = []
current_node = ""
current_chrom = ""


unique_insertions = []
non_unique_insertions = []
unique_deletions = []
non_unique_deletions = []


class Deletion(object):
    def __init__(self, node, contig_pos, rc, ref_id, ref_pos1, ref_pos2, anchor1, anchor2):
        self.node = node
        self.contig_pos = contig_pos
        self.rc = rc
        self.ref_id = ref_id
        self.ref_pos1 = ref_pos1
        self.ref_pos2 = ref_pos2
        self.anchor1 = anchor1
        self.anchor2 = anchor2

    def __eq__(self, other):
        return (abs(self.ref_pos1 - other.ref_pos1) < 100 or abs(self.ref_pos2 - other.ref_pos2) < 100 or abs(self.ref_pos2 - other.ref_pos1) < 100 or abs(self.ref_pos1 - other.ref_pos2) < 100)

    def size(self):
        return  abs(self.ref_pos2 - self.ref_pos1) - 1

class Insertion(object):
    def __init__(self, node, pos1, pos2, rc, ref_id, ref_pos, anchor1, anchor2, overlap):
        self.node = node
        self.pos1 = pos1
        self.pos2 = pos2
        self.rc = rc
        self.ref_id = ref_id
        self.ref_pos = ref_pos
        self.anchor1 = anchor1
        self.anchor2 = anchor2
        self.overlap = overlap

    def __eq__(self, other):
        return abs(self.ref_pos - other.ref_pos) < 100

    def size(self):
        return abs(self.pos2 - self.pos1) + self.overlap - 1

def Overlaps(al1, al2):
    cont11 = al1.cont1
    cont12 = al1.cont2

    cont21 = al2.cont1
    cont22 = al2.cont2

    if cont11 > cont12:
        cont11, cont12 = cont12, cont11

    if cont21 > cont22:
        cont21, cont22 = cont22, cont21

    return (cont11 >= cont21 and cont11 <= cont22) or (cont12 <= cont21 and cont12 >= cont22)


def Near(sv1, sv2):
    return sv2 - sv1 < 50 and sv2 - sv1 > -10

def BigAnchors(al1, al2):
    return abs(al1.cont1 - al1.cont2) > 300 or abs(al2.cont1 - al2.cont2) > 300



def IsDeletion(al1, al2):
    ref11 = al1.ref1
    ref12 = al1.ref2

    is_rc = ref12 - ref11 < 0
    ref21 = al2.ref1
    ref22 = al2.ref2
    is_rc2 = ref22 - ref21 < 0

    if is_rc != is_rc2:
        return False
    if ref11 > ref12:
        ref11, ref12 = ref12, ref11

    if ref21 > ref22:
        ref21, ref22 = ref22, ref21

    if is_rc:
        contig_breakpoint = al1.cont1
    else:
        contig_breakpoint = al1.cont2

    if cont21 - cont12 < 5:
        if ref21 - ref12 > 5:
            deletion = Deletion(al1.node, contig_breakpoint, is_rc, al1.chrom, ref12, ref21, abs(cont22 - cont21), abs(cont12 - cont11))
            if deletion.size() > 1000:
                print(deletion.size())
                print("Deletion: " + str(al1) + " \t " + str(al2))
            if deletion not in unique_deletions:
                unique_deletions.append(deletion)
            else:
                non_unique_deletions.append(deletion)
            return True

    if cont22 < cont11:
        deletion = Deletion(al1.node, contig_breakpoint, is_rc, al1.chrom, ref12, ref21, abs(cont22 - cont21), abs(cont12 - cont11))
        if deletion.size() > 1000:
            print(deletion.size())
            print("Deletion: " + str(al1) + " \t " + str(al2))

        if deletion in non_unique_deletions:
            return False
        if cont11 - cont22 > 5:
            if deletion not in unique_deletions:
                unique_deletions.append(deletion)
            else:
                non_unique_deletions.append(deletion)
            return True
    return False

def GetOverlap(ref11, ref12, ref21, ref22):

    if ref11 > ref12:
        ref11, ref12 = ref12, ref11
    if ref21 > ref22:
        ref21, ref22 = ref22, ref21


    if (min(ref21, ref22) > max(ref11, ref12)) or (max(ref21, ref22) < min(ref11, ref12)):
        return 0

    if ref11 < ref21 < ref12 and ref22 > ref12:
        return ref12 - ref21

    if ref11 < ref22 < ref12 and ref21 < ref11:
        return ref22 - ref11



    return 0

def IsInsertion(al1, al2):
    cont11 = al1.cont1
    cont12 = al1.cont2
    ref_breakpoint11 = al1.ref1
    ref_breakpoint12 = al1.ref2
    is_rc = cont12 - cont11 < 0

    cont21 = al2.cont1
    cont22 = al2.cont2
    ref_breakpoint21 = al2.ref1
    ref_breakpoint22 = al2.ref2
    is_rc2 = cont22 - cont21 < 0

    if cont21 > cont22:
        cont21, cont22 = cont22, cont21
        ref_breakpoint21, ref_breakpoint22 = ref_breakpoint22, ref_breakpoint21

    if is_rc != is_rc2:
        return False


    if cont11 > cont12:
        cont11, cont12 = cont12, cont11
        ref_breakpoint11, ref_breakpoint12 = ref_breakpoint12, ref_breakpoint11

    if (ref_breakpoint21 < ref_breakpoint11 and ref_breakpoint11 < ref_breakpoint12):
        return False

    overlap = GetOverlap(ref_breakpoint11, ref_breakpoint12, ref_breakpoint21, ref_breakpoint22)
    print(overlap)
    if abs(ref_breakpoint12 - ref_breakpoint21) > 50 and overlap == 0:
        print("Not an exact insertion")
        return False

    if cont12 < cont21:
        if cont21 - cont12 > 5:
            ins = Insertion(al1.node, cont12, cont21, is_rc, al1.chrom, ref_breakpoint12, abs(cont22 - cont21), abs(cont12 - cont11), overlap)
            if ins.size() > 500:
                print(ins.size())
                print("Insertion: " + str(al1) + " \t " + str(al2))
            if ins not in unique_insertions:
                unique_insertions.append(ins)
            else:
                non_unique_insertions.append(ins)
            return True


    if cont22 < cont11:
        ins = Insertion(al1.node, cont22, cont11, is_rc, al1.chrom, ref_breakpoint12, abs(cont22 - cont21), abs(cont12 - cont11), overlap)
        if ins.size() > 500:
            print(ins.size())
            print("Insertion: " + str(al1) + " \t " + str(al2))
        if cont11 - cont22 > 5:
            if ins not in unique_insertions:
                unique_insertions.append(ins)
            else:
                non_unique_insertions.append(ins)
            return True
    return False


def AnalyzeCurrentSet(al1, al2):
    if BigAnchors(al1, al2):
        if IsInsertion(al1, al2):
            pass
            #print(str(al1) + " \t " +  str(al2))
        if IsDeletion(al1, al2):
            pass
            #print(str(al1) + " \t " +  str(al2))


class ContigFilter(object):

    def __init__(self):
        self.coverage_cutoff = 0.0
        self.length_cutoff = 0

    def ParseCoverage(self, contig_name):
        m = re.search("cov_([\d\.]+)", contig_name)
        return float(m.group(1))

    def ParseLength(self, contig_name):
        m = re.search("length_([\d\.]+)_", contig_name)
        return int(m.group(1))


    def PassFilter(self, contig_name):
        return self.ParseCoverage(contig_name) > self.coverage_cutoff \
               and self.ParseLength(contig_name) > self.length_cutoff




class Alignment(object):
    def __init__(self, ref1, ref2, cont1, cont2, chrom, node):
        self.ref1 = ref1
        self.ref2 = ref2
        self.cont1 = cont1
        self.cont2 = cont2
        self.chrom = chrom
        self.node = node

    def __str__(self):
        return str(self.ref1) + "\t" + str(self.ref2) + "\t" + str(self.cont1) + "\t" + str(self.cont2) + "\t" + self.chrom + "\t" + self.node
i = 0

filter = ContigFilter()

with open(sys.argv[1], "r") as nucmer_file:
    lines = nucmer_file.readlines()
    for i in range(1, len(lines) - 1):
        line = lines[i]
        if line.startswith("local misassembly") or line.startswith("transloc") or line.startswith("relocation") or line.startswith("indel") or line.startswith("unknown"):
            line1 = lines[i-1]
            line2 = lines[i+1]
            if line2.startswith("CON"):
                continue

            chrom = line1.split("\t")[4].strip()

            ref11 = int(line1.split("\t")[0].strip())
            ref12 = int(line1.split("\t")[1].strip())

            ref21 = int(line2.split("\t")[0].strip())
            ref22 = int(line2.split("\t")[1].strip())

            cont11 = int(line1.split("\t")[2].strip())
            cont12 = int(line1.split("\t")[3].strip())

            cont21 = int(line2.split("\t")[2].strip())
            cont22 = int(line2.split("\t")[3].strip())

            contig_name = line1.split("\t")[5].strip()

            if not filter.PassFilter(contig_name):
                continue

            i += 1
            if i > 100000:
                pass
                #break

            al1 = Alignment(ref11, ref12, cont11, cont12, chrom, contig_name)
            al2 = Alignment(ref21, ref22, cont21, cont22, chrom, contig_name)

            AnalyzeCurrentSet(al1, al2)

def get_insertion_pos(cont11, cont12, contig_length):
    if contig_length - max(cont11, cont12) > min(cont11, cont12):
        return max(cont11, cont12), contig_length
    else:
        return 0, min(cont11, cont12)



with open(sys.argv[1], "r") as nucmer_file:
    lines = nucmer_file.readlines()
    for i in range(2, len(lines)):
        line = lines[i]
        if line.startswith("CONTIG") and (lines[i-2].startswith("CONTIG") or lines[i-2].startswith("S1")):
            alignment_line = lines[i-1]
            if not alignment_line[0].isdigit():
                continue
            chrom = alignment_line.split("\t")[4].strip()
            ref11 = int(alignment_line.split("\t")[0].strip())
            ref12 = int(alignment_line.split("\t")[1].strip())
            cont11 = int(alignment_line.split("\t")[2].strip())
            cont12 = int(alignment_line.split("\t")[3].strip())
            contig_name = alignment_line.split("\t")[5].strip()
            if not filter.PassFilter(contig_name):
                continue
            contig_length = filter.ParseLength(contig_name)
            if abs(cont12 - cont11) < 600:
                continue

            ins_start, ins_end = get_insertion_pos(cont11, cont12, contig_length)

            if abs(ins_start - ins_end) < 50:
                continue

            if ins_start == 0:
                if cont12 < cont11:
                    ref_breakpoint = ref12
                else:
                    ref_breakpoint = ref11
            else:
                if cont12 < cont11:
                    ref_breakpoint = ref11
                else:
                    ref_breakpoint = ref12

            is_rc = cont11 > cont12
            if cont12 < cont11:
                cont12, cont11 = cont11, cont12



            ins = Insertion(contig_name, ins_start, ins_end, is_rc, chrom, ref_breakpoint, abs(cont12 - cont11),
                            0, 0)

            if ins.size() > 500:
                print(ins.size())
                print("Insertion: " + str(chrom) + " \t " + str(ref_breakpoint))
            if ins not in unique_insertions:
                unique_insertions.append(ins)
            else:
                non_unique_insertions.append(ins)

records = list(SeqIO.parse(sys.argv[2], "fasta"))
record_dict = {}
for record in records:
    if record.id not in record_dict.keys():
        record_dict[record.id] = record
with open(sys.argv[3], "w") as vcf, open("insertions_with_anchors.fasta", "w") as fasta:
    for insertion in unique_insertions:
        if insertion.size() >= 300:
            fasta.write(">" + insertion.node + "\n")
            fasta.write(str(record_dict[insertion.node].seq))
            fasta.write("\n")
        try:
            ins_seq = str(record_dict[insertion.node].seq[insertion.pos1 : insertion.pos2 + insertion.overlap] if not insertion.rc else record_dict[insertion.node].seq[insertion.pos1 : insertion.pos2 + insertion.overlap].reverse_complement())
            vcf.write(str(insertion.ref_id) + "\t" + str(insertion.ref_pos) + "\t" + "." + "\t" + str(record_dict[insertion.node].seq[insertion.pos1]) + "\t" + ins_seq + "\t" + "." + "\t" + "PASS" + "\t" + "DP=100" + "\t" + insertion.node + "\t" + str(insertion.anchor1) + "\t" + str(insertion.anchor2) + "\t" + str(insertion.pos1) + "\t" + str(insertion.pos2) + "\n")
        except:
            pass

Python Biopython From line 1 of master/minimap_to_vcf.py

import sys
import re
import math
import os


class Usage(Exception):
    def __init__(self, msg):
        self.msg = msg


def usage():
    print("\nUsage: python extractContaminants.py contaminationFile orphan.fq.contigs.fa orphan.fq.contigs.wocontamination.fa")
    sys.exit(-1)


def main():
    args = sys.argv[1:]
    if len(args) != 3:
        usage()

    fcontamination = open(sys.argv[1], 'r')
    fcontig = open(sys.argv[2], 'r')
    fout = open(sys.argv[3], 'w')

    contamination = fcontamination.readline()
    while contamination != '':
        contamination = contamination.split()[0]
        contig = fcontig.readline()[:-1]
        if (contig != ''):
            contig = re.split(">| ", contig)[1]
            while contamination != contig and contig != '':
                fout.write(">" + contig + "\n")
                while True:
                    temp_string = fcontig.readline()
                    if temp_string != '' and temp_string[0] != '>':
                        fout.write(temp_string)
                    else:
                        break
                contig = temp_string
                if (contig != ''):
                    contig = re.split(">| ", contig)[1][:-1]
            if (contamination == contig and contig != ''):
                last_pos = fcontig.tell()
                fcontig.readline()
                while contig != '':
                    if contig[0] == '>':
                        fcontig.seek(last_pos)
                        break
                    last_pos = fcontig.tell()
                    contig = fcontig.readline()
            contamination = fcontamination.readline()
    contig = fcontig.readline()
    while contig != '':
        fout.write(contig)
        fout.write(fcontig.readline())
        contig = fcontig.readline()
    fout.close()


if __name__ == "__main__":
    sys.exit(main())

Python From line 3 of master/remove_contaminations.py

shell:
    """
    {SAMTOOLS} sort -@ {THREADS} -n {input} -o {output.sorted}
    {GIT_ROOT}/bxtools/bin/bxtools filter {output.sorted} -b -s 0.2 -q 10 >{output.unmapped_bam}
    """

SnakeMake From line 36 of master/Snakefile

run:
    shell("rm -rf temp_reads")
    shell("mkdir temp_reads")
    shell("mkdir -p fasta")

    shell("{GIT_ROOT}/bxtools/bin/bxtools bamtofastq {input.bam} temp_reads/")
    if DATA == "other":
        shell("{SPADES} -t {THREADS} -k {VELVET_K} -1 temp_reads/{wildcards.sample}_R1.fastq -2 temp_reads/{wildcards.sample}_R2.fastq --only-assembler -o spades_{wildcards.sample}")
        shell("cp spades_{wildcards.sample}/scaffolds.fasta fasta/{wildcards.sample}.fasta")
    else:
        shell("{VELVETH} velvet_{wildcards.sample} {VELVET_K} -shortPaired -fastq -separate temp_reads/{wildcards.sample}_R1.fastq temp_reads/{wildcards.sample}_R2.fastq")
        shell("{VELVETG} velvet_{wildcards.sample} -exp_cov auto -cov_cutoff {VELVET_COVERAGE} -max_coverage 100 -scaffolding no")
        shell("cp velvet_{wildcards.sample}/contigs.fa fasta/{wildcards.sample}.fasta")
    shell("rm -rf temp_reads/")

SnakeMake From line 48 of master/Snakefile

shell:
     """
     {GIT_ROOT}/contig_length_filter.py 200 {input.fasta} {output.filtered_fasta}
     """

SnakeMake From line 69 of master/Snakefile

shell:
    """
    {GIT_ROOT}/bamtofastq {ADDITIONAL_BAMTOFASTQ_FLAGS} {input} {output.temp_dir}
    """

SnakeMake From line 79 of master/Snakefile

run:
    if BLAST_DB != 'None':
        shell("mkdir -p blast")
        shell("{BLASTN} -task megablast -query {input.filtered_fasta} -db {BLAST_DB} -num_threads {THREADS} > blast/{wildcards.sample}.megablast")
        shell("{GIT_ROOT}/cleanmega blast/{wildcards.sample}.megablast blast/{wildcards.sample}.cleanmega")
        shell("{GIT_ROOT}/find_contaminations.py blast/{wildcards.sample}.cleanmega blast/{wildcards.sample}.contaminants")
        shell("python {GIT_ROOT}/remove_contaminations.py blast/{wildcards.sample}.contaminants {input.filtered_fasta} {output.filtered_fasta}")
    else:
        shell("cp {input.filtered_fasta} {output.filtered_fasta}")

SnakeMake BLAST From line 89 of master/Snakefile

shell:
    """
    {LONGRANGER} mkref {input.filtered_fasta}
    {LONGRANGER} align --localcores={THREADS} --localmem={MEMORY} --id=temp_{wildcards.sample} --reference={output.refdata} --fastqs={input.temp_dir}/
    {SAMTOOLS} view -b -F 12 temp_{wildcards.sample}/outs/possorted_bam.bam >{output.mapped_bam}
    """

SnakeMake From line 109 of master/Snakefile

shell:
    """
    {GIT_ROOT}/bxtools/bin/bxtools filter {input.mapped_bam} -s 0.2 -c 0.2 -q 50 >mapped/{wildcards.sample}.filtered.bam
    {GIT_ROOT}/bxtools/bin/bxtools split-by-ref mapped/{wildcards.sample}.filtered.bam -o {output.barcode_folder}
    """

SnakeMake From line 122 of master/Snakefile

shell:
    """
    mkdir small_bams_{wildcards.sample}
    {GIT_ROOT}/bxtools/bin/bxtools extract {input.sample} {input.barcode_folder} {output.small_bams}/
    """

SnakeMake From line 135 of master/Snakefile

shell:
     """
     mkdir -p {output.small_reads}
     function prepare_reads {{

     a="$(basename $1 | sed "s/\..*//")"
     if [ -d "{output.small_reads}_2/$a" ]; then
     mv {output.small_reads}_2/$a {output.small_reads}/$a
     return
     fi
     mkdir -p {output.small_reads}/$a
     {GIT_ROOT}/bxtools/bin/bxtools bamtofastq {input.small_bams}/$a.bam {output.small_reads}/$a/
     }}
     export -f prepare_reads
     parallel --jobs {THREADS} prepare_reads ::: {input.small_bams}/*
     """

SnakeMake From line 148 of master/Snakefile

shell:
    """
    mkdir -p {output.contigs}
    mkdir -p {output.assemblies_folder}
    function local_assembly {{
    a="$(basename $1 | sed "s/\..*q//")"
    if [ -f "{output.assemblies_folder}/$a/scaffolds.fasta" ]; then
    cp {output.assemblies_folder}/$a/scaffolds.fasta {output.contigs}/$a.fasta
    return
    fi
    {SPADES} --only-assembler -t 1 -m {MEMORY_PER_THREAD} -k {SPADES_K} --cov-cutoff 3 --pe1-1 {input.small_reads}/$a/$a\_R1.fastq --pe1-2 {input.small_reads}/$a/$a\_R2.fastq -o {output.assemblies_folder}/$a
    cp {output.assemblies_folder}/$a/scaffolds.fasta {output.contigs}/$a.fasta
    rm -rf {output.assemblies_folder}/$a/K*
    }}
    export -f local_assembly
    parallel --jobs {THREADS} local_assembly ::: {input.small_reads}/*
    """

SnakeMake From line 171 of master/Snakefile

shell:
    """
    mkdir -p {output.splitted_insertions}
    mkdir -p {output.prefilter_contigs}
    python {GIT_ROOT}/split_fasta.py {input.insertions} {output.splitted_insertions}
    mkdir -p quast_res
    mkdir -p {output.filtered_contigs}
    function filter_target_contigs {{
    a="$(basename $1 | sed "s/\..*//")"
    {GIT_ROOT}/contig_length_filter.py 500 {input.contigs}/$a.fasta {output.prefilter_contigs}/$a.fasta
    {QUAST} -t 1 --fast --min-contig 199 -R {output.prefilter_contigs}/$a.fasta {output.splitted_insertions}/$a.fasta -o quast_res/$a
    python {GIT_ROOT}/filter_correct_record.py {output.prefilter_contigs}/$a.fasta quast_res/$a/contigs_reports/all_alignments_$a.tsv {output.filtered_contigs}/$a.fasta
    }}
    export -f filter_target_contigs
    set +oe pipefail
    parallel --jobs {THREADS} filter_target_contigs ::: {input.contigs}/*
    cat {output.filtered_contigs}/*.fasta >{output.contigs}
    rm -rf quast_res
    rm -f core*
    """

SnakeMake From line 199 of master/Snakefile

shell:
    """
    {QUAST} --fast -R {GENOME} {input.contigs} -o quast_res
    cp quast_res/contigs_reports/all_alignments_final_set_*.tsv {output.final_alignments}
    """

SnakeMake From line 225 of master/Snakefile

shell:
    """
    python {GIT_ROOT}/minimap_to_vcf.py {input.final_alignments} {input.contigs} {output.final_vcf}
    """

SnakeMake From line 237 of master/Snakefile

import sys
from Bio import SeqIO

import os

directory = sys.argv[2]
try:
    os.stat(directory)
except:
    os.mkdir(directory)


current_index = 1
for record in SeqIO.parse(sys.argv[1], "fasta"):
    with open(directory + os.sep + str(current_index) + ".fasta", "w") as output_handle:

        SeqIO.write([record], output_handle, "fasta")
        current_index += 1