Novel insertion detection with 10X reads

public public 1yr ago Version: 0.3 0 bookmarks

Novel-X: Novel sequence insertion detection using Linked-Reads

Novel-X detects and genotypes novel sequence insertions in 10X sequencing dataset using non-trivial read alignment signatures and barcode information.

  1. Installation

  2. Command Options

  3. Output Formats

  4. Example Commands

  5. Publications

  6. Contact & Support

Installation

To start working with Novel-X please clone this repository recursively:

git clone --recursive [email protected]:1dayac/Novel-X.git

If you clone repository non-recursively Novel-X will not work. To fix this run from Novel-X folder:

git submodule update --init --recursive

Novel-X is a pipeline based on a popular Snakemake workflow management system and consists of multiple steps and requires a lot of external software.

First, the following software should be installed (version numbers used for testing are shown in brackets, but other versions should also work):

  • Longranger (version 2.15) - Download Page

  • Velvet (commit 9adf09f) - GitHub Page - outdated but still useful assembler with minimal assumptions about the data. Note that we use kmer length of 63 during the assembly for 10X data, and Velvet should be compiled using

make ’MAXKMERLENGTH=63’

command. For more information, refer to the Velvet manual.

All this programs (except LongRanger) can be installed with conda package. We provide conda-env.yml file that allows to install them using the following command:

conda env create -f conda-env.yml

Path to executables (if executables are not in $PATH) should be provided in path_to_executables_config.json file.

Python dependencies are listed in requirements.txt file. They can be downloaded and installed with following command:

pip install -r requirements.txt

Inside bxtools folder run following commands (estimated execution time is around 2 minutes):

./configure
make
make install

We tested our tool using CentOS Linux 7 OS, but we suppose that it should work at any modern Unix-like system.

Then you are ready to go.

Command Options

Novel-X can be run with novel-x.py script with two modes:

  • run - run pipeline from the scratch

  • restart - if previous pipeline was not finished for some reason you can try to catch up with novel-x.py restart command.

A typical command to start Novel-X is

python novel-x.py run --bam my_bam.bam --genome my_genome.fasta --outdir my_dir

Optional arguments are:

  • --lr20 - needed if you run pipeline on a bam file obtained by LongRanger2.0 pipeline

  • --nt - optional filtering of non-human sequences from the orphan contigs

We added two option groups to handle different data and its properties (molecule length, intra-molecule coverage, etc.).

Data option group:

  • --10x - for 10X Genomics data [Default]

  • --tellseq - for Tell-Seq data

  • --stlfr - for stLFR data

Tell-Seq and stLFR data should be converted to LongRanger-compatible bam. For stLFR data, use this pipeline . For Tell-Seq data refer to Tell-Seq paper.

Coverage group:

  • --high-coverage - best for 60X coverage and higher [Default]

  • --low-coverage - best for 20X-40X coverage

You can invoke help message by typing:

python novel-x.py run --help

or

python novel-x.py restart --help

Output Formats

Novel-X write results into vcf-file. If your bam-file was named HM2KYBBXX_NA18509.bam, the resulting vcf-file will be named HM2KYBBXX_NA18509.vcf and will be stored inside the outdir folder.

Example Commands

Run from the start:

python ~/Novel-X/novel-x.py run --bam /athena/ihlab/scratch/dmm2017/70_samples_data/HLF3WBBXX_NA12006_longranger.bam -t 8 -m 200 --nt /athena/ihlab/scratch/dmm2017/blast_database/ --genome /athena/ihlab/scratch/dmm2017/hg38/hg38.fa --outdir /athena/ihlab/scratch/dmm2017/70_samples/novelx_NA12006

Restart from the last stage:

python ~/Novel-X/novel-x.py restart --outdir novelx_NA12006

There is a problem on filter_target_contig stage at the moment. It can exit with non-zero exit code. We recommend to comment out the next line before using restart option:

parallel --jobs {THREADS} filter_target_contigs ::: {input.contigs}/*

Demo command

We placed a toy dataset in demo folder to test that software is installed correctly. You can run command:

python ~/Novel-X/novel-x.py run --bam ~/Novel-X/demo/demo.bam -t 1 -m 20 --genome ~/Novel-X/demo/demo.fasta --outdir out

This command takes about 15 minutes to finish on our hardware. It produces a vcf-file with a single vcf record.

chr1_25500000_25535000 29503 . T TGTATTGTGTGTATGAGGGTTGTGTGCTGTGTGTTGTGTATATATTGTATGTGTTATGTGTATGTATGTCGTATGAGTGTATTCTGTATATGTGTTTTGTGTGGTCTATTATGTATGTGGCATGTGTTGTGTATGTGTGTTGTGTGTGATGTGTTGTATGTGTGTTGTGCATATATGTTGTTTCTGTGTATGTATGTTATGTGTATGTGTATGTTGTGTTGTATGTATGGGTTGTGCCTATGTGCTGTGTTGTGTGCTGCATGCATGTTTGTGTGGTGTGTGTATTTAGGTTGTGTGCTATTTATGTGTCTATATTGTATGTGTTGTATGTGTGTTGTATGTATGTGTAGTGTATGTGTGTTGTGTGTGATGTGTATATGTGGTGTGTGTATGTCTGTTATGTGTATGTATGAGTGTATGTGTGTTGTGTGTGTTGTGTATATGTGTTGTGTGTGTTGTGTATGTGTGTTGTGAGTTGTGTATATGTGGTGTGAGTTGTGTTGTGTCATGTATGTGTGCATTGTGTATAGGTGTTGCATGTGTGTTGTGTTGTGTGTATGTGTTGTGTTGTGTATATGTGGTATGTGAATGTGTATGTTGTATGTTGTGTTGTATGTATATGTGTTATGTATATGTGATGTGTGTGTTGTGTATATGCTGGGTGTGTGTGTACATGTGTGTATGTGTGTTGTATGTATGTGTGTATGCATGTGTGTTGCGTATATGTGGTATGTGTGCATGTGTGTTGTCATGTGTATGTGTGTTGTGTATATGTGTGTGTTGTGTATATGTGTTGTGTGTATGTGTATCATGTTGTGTGTATGTGTTATGTTGTGTATATGTGGTGTGTGAATGTGTGTTGTGTGTATGTGTATGTTGTCTGTTTTGTGTGTGTATACGTGGTGTGTGTGTGTTGTGTTGTGTATATGTGTTGTGTGTGTTGCGTGTATGTGTTGTGTGTT . PASS DP=100 NODE_1_length_4180_cov_43.887399 2776 276 347 1306

Output may slightly differ based on your software versions.

Publications

"Novel sequence insertion detection using Linked-Reads" preprint is available at https://www.biorxiv.org/content/10.1101/551028v1 .

Contact & Support

Feel free to drop any inquiry to [email protected]

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import sys
from Bio import SeqIO

if len(sys.argv) < 4: 
    print("Usage: %s <length threshold> <contigs_file> <output>" % sys.argv[0])
    sys.exit(1) 

f_n = sys.argv[2]
input_seq_iterator = SeqIO.parse(open(f_n, "r"), "fasta")
filtered_iterator = (record for record in input_seq_iterator \
                      if len(record.seq) > int(sys.argv[1]) )

output_handle = open(sys.argv[3], "w")
SeqIO.write(filtered_iterator, output_handle, "fasta")
output_handle.close()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import sys
from Bio import SeqIO

with open(sys.argv[2], "r") as nucmer_file:
    lines = nucmer_file.readlines()
    record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[1], "fasta"))
    if len(lines) == 1:
        exit()

    records_to_output = []
    with open(sys.argv[3], "w") as output_handle:
        for line in lines:
            split = line.split("\t")
            if len(split) < 5 or split[0] == "S1":
                pass
            else:
                if split[4].strip() not in records_to_output:
                    records_to_output.append(split[4].strip())
        for record in records_to_output:
            SeqIO.write([record_dict[record]], output_handle, "fasta")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
import sys
import re
from Bio import SeqIO



current_set = []
current_node = ""
current_chrom = ""


unique_insertions = []
non_unique_insertions = []
unique_deletions = []
non_unique_deletions = []


class Deletion(object):
    def __init__(self, node, contig_pos, rc, ref_id, ref_pos1, ref_pos2, anchor1, anchor2):
        self.node = node
        self.contig_pos = contig_pos
        self.rc = rc
        self.ref_id = ref_id
        self.ref_pos1 = ref_pos1
        self.ref_pos2 = ref_pos2
        self.anchor1 = anchor1
        self.anchor2 = anchor2

    def __eq__(self, other):
        return (abs(self.ref_pos1 - other.ref_pos1) < 100 or abs(self.ref_pos2 - other.ref_pos2) < 100 or abs(self.ref_pos2 - other.ref_pos1) < 100 or abs(self.ref_pos1 - other.ref_pos2) < 100)

    def size(self):
        return  abs(self.ref_pos2 - self.ref_pos1) - 1

class Insertion(object):
    def __init__(self, node, pos1, pos2, rc, ref_id, ref_pos, anchor1, anchor2, overlap):
        self.node = node
        self.pos1 = pos1
        self.pos2 = pos2
        self.rc = rc
        self.ref_id = ref_id
        self.ref_pos = ref_pos
        self.anchor1 = anchor1
        self.anchor2 = anchor2
        self.overlap = overlap

    def __eq__(self, other):
        return abs(self.ref_pos - other.ref_pos) < 100

    def size(self):
        return abs(self.pos2 - self.pos1) + self.overlap - 1

def Overlaps(al1, al2):
    cont11 = al1.cont1
    cont12 = al1.cont2

    cont21 = al2.cont1
    cont22 = al2.cont2

    if cont11 > cont12:
        cont11, cont12 = cont12, cont11

    if cont21 > cont22:
        cont21, cont22 = cont22, cont21

    return (cont11 >= cont21 and cont11 <= cont22) or (cont12 <= cont21 and cont12 >= cont22)


def Near(sv1, sv2):
    return sv2 - sv1 < 50 and sv2 - sv1 > -10

def BigAnchors(al1, al2):
    return abs(al1.cont1 - al1.cont2) > 300 or abs(al2.cont1 - al2.cont2) > 300



def IsDeletion(al1, al2):
    ref11 = al1.ref1
    ref12 = al1.ref2

    is_rc = ref12 - ref11 < 0
    ref21 = al2.ref1
    ref22 = al2.ref2
    is_rc2 = ref22 - ref21 < 0

    if is_rc != is_rc2:
        return False
    if ref11 > ref12:
        ref11, ref12 = ref12, ref11

    if ref21 > ref22:
        ref21, ref22 = ref22, ref21

    if is_rc:
        contig_breakpoint = al1.cont1
    else:
        contig_breakpoint = al1.cont2

    if cont21 - cont12 < 5:
        if ref21 - ref12 > 5:
            deletion = Deletion(al1.node, contig_breakpoint, is_rc, al1.chrom, ref12, ref21, abs(cont22 - cont21), abs(cont12 - cont11))
            if deletion.size() > 1000:
                print(deletion.size())
                print("Deletion: " + str(al1) + " \t " + str(al2))
            if deletion not in unique_deletions:
                unique_deletions.append(deletion)
            else:
                non_unique_deletions.append(deletion)
            return True

    if cont22 < cont11:
        deletion = Deletion(al1.node, contig_breakpoint, is_rc, al1.chrom, ref12, ref21, abs(cont22 - cont21), abs(cont12 - cont11))
        if deletion.size() > 1000:
            print(deletion.size())
            print("Deletion: " + str(al1) + " \t " + str(al2))

        if deletion in non_unique_deletions:
            return False
        if cont11 - cont22 > 5:
            if deletion not in unique_deletions:
                unique_deletions.append(deletion)
            else:
                non_unique_deletions.append(deletion)
            return True
    return False

def GetOverlap(ref11, ref12, ref21, ref22):

    if ref11 > ref12:
        ref11, ref12 = ref12, ref11
    if ref21 > ref22:
        ref21, ref22 = ref22, ref21


    if (min(ref21, ref22) > max(ref11, ref12)) or (max(ref21, ref22) < min(ref11, ref12)):
        return 0

    if ref11 < ref21 < ref12 and ref22 > ref12:
        return ref12 - ref21

    if ref11 < ref22 < ref12 and ref21 < ref11:
        return ref22 - ref11



    return 0

def IsInsertion(al1, al2):
    cont11 = al1.cont1
    cont12 = al1.cont2
    ref_breakpoint11 = al1.ref1
    ref_breakpoint12 = al1.ref2
    is_rc = cont12 - cont11 < 0

    cont21 = al2.cont1
    cont22 = al2.cont2
    ref_breakpoint21 = al2.ref1
    ref_breakpoint22 = al2.ref2
    is_rc2 = cont22 - cont21 < 0

    if cont21 > cont22:
        cont21, cont22 = cont22, cont21
        ref_breakpoint21, ref_breakpoint22 = ref_breakpoint22, ref_breakpoint21

    if is_rc != is_rc2:
        return False


    if cont11 > cont12:
        cont11, cont12 = cont12, cont11
        ref_breakpoint11, ref_breakpoint12 = ref_breakpoint12, ref_breakpoint11

    if (ref_breakpoint21 < ref_breakpoint11 and ref_breakpoint11 < ref_breakpoint12):
        return False

    overlap = GetOverlap(ref_breakpoint11, ref_breakpoint12, ref_breakpoint21, ref_breakpoint22)
    print(overlap)
    if abs(ref_breakpoint12 - ref_breakpoint21) > 50 and overlap == 0:
        print("Not an exact insertion")
        return False

    if cont12 < cont21:
        if cont21 - cont12 > 5:
            ins = Insertion(al1.node, cont12, cont21, is_rc, al1.chrom, ref_breakpoint12, abs(cont22 - cont21), abs(cont12 - cont11), overlap)
            if ins.size() > 500:
                print(ins.size())
                print("Insertion: " + str(al1) + " \t " + str(al2))
            if ins not in unique_insertions:
                unique_insertions.append(ins)
            else:
                non_unique_insertions.append(ins)
            return True


    if cont22 < cont11:
        ins = Insertion(al1.node, cont22, cont11, is_rc, al1.chrom, ref_breakpoint12, abs(cont22 - cont21), abs(cont12 - cont11), overlap)
        if ins.size() > 500:
            print(ins.size())
            print("Insertion: " + str(al1) + " \t " + str(al2))
        if cont11 - cont22 > 5:
            if ins not in unique_insertions:
                unique_insertions.append(ins)
            else:
                non_unique_insertions.append(ins)
            return True
    return False


def AnalyzeCurrentSet(al1, al2):
    if BigAnchors(al1, al2):
        if IsInsertion(al1, al2):
            pass
            #print(str(al1) + " \t " +  str(al2))
        if IsDeletion(al1, al2):
            pass
            #print(str(al1) + " \t " +  str(al2))


class ContigFilter(object):

    def __init__(self):
        self.coverage_cutoff = 0.0
        self.length_cutoff = 0

    def ParseCoverage(self, contig_name):
        m = re.search("cov_([\d\.]+)", contig_name)
        return float(m.group(1))

    def ParseLength(self, contig_name):
        m = re.search("length_([\d\.]+)_", contig_name)
        return int(m.group(1))


    def PassFilter(self, contig_name):
        return self.ParseCoverage(contig_name) > self.coverage_cutoff \
               and self.ParseLength(contig_name) > self.length_cutoff




class Alignment(object):
    def __init__(self, ref1, ref2, cont1, cont2, chrom, node):
        self.ref1 = ref1
        self.ref2 = ref2
        self.cont1 = cont1
        self.cont2 = cont2
        self.chrom = chrom
        self.node = node

    def __str__(self):
        return str(self.ref1) + "\t" + str(self.ref2) + "\t" + str(self.cont1) + "\t" + str(self.cont2) + "\t" + self.chrom + "\t" + self.node
i = 0

filter = ContigFilter()

with open(sys.argv[1], "r") as nucmer_file:
    lines = nucmer_file.readlines()
    for i in range(1, len(lines) - 1):
        line = lines[i]
        if line.startswith("local misassembly") or line.startswith("transloc") or line.startswith("relocation") or line.startswith("indel") or line.startswith("unknown"):
            line1 = lines[i-1]
            line2 = lines[i+1]
            if line2.startswith("CON"):
                continue

            chrom = line1.split("\t")[4].strip()

            ref11 = int(line1.split("\t")[0].strip())
            ref12 = int(line1.split("\t")[1].strip())

            ref21 = int(line2.split("\t")[0].strip())
            ref22 = int(line2.split("\t")[1].strip())

            cont11 = int(line1.split("\t")[2].strip())
            cont12 = int(line1.split("\t")[3].strip())

            cont21 = int(line2.split("\t")[2].strip())
            cont22 = int(line2.split("\t")[3].strip())

            contig_name = line1.split("\t")[5].strip()

            if not filter.PassFilter(contig_name):
                continue

            i += 1
            if i > 100000:
                pass
                #break

            al1 = Alignment(ref11, ref12, cont11, cont12, chrom, contig_name)
            al2 = Alignment(ref21, ref22, cont21, cont22, chrom, contig_name)

            AnalyzeCurrentSet(al1, al2)

def get_insertion_pos(cont11, cont12, contig_length):
    if contig_length - max(cont11, cont12) > min(cont11, cont12):
        return max(cont11, cont12), contig_length
    else:
        return 0, min(cont11, cont12)



with open(sys.argv[1], "r") as nucmer_file:
    lines = nucmer_file.readlines()
    for i in range(2, len(lines)):
        line = lines[i]
        if line.startswith("CONTIG") and (lines[i-2].startswith("CONTIG") or lines[i-2].startswith("S1")):
            alignment_line = lines[i-1]
            if not alignment_line[0].isdigit():
                continue
            chrom = alignment_line.split("\t")[4].strip()
            ref11 = int(alignment_line.split("\t")[0].strip())
            ref12 = int(alignment_line.split("\t")[1].strip())
            cont11 = int(alignment_line.split("\t")[2].strip())
            cont12 = int(alignment_line.split("\t")[3].strip())
            contig_name = alignment_line.split("\t")[5].strip()
            if not filter.PassFilter(contig_name):
                continue
            contig_length = filter.ParseLength(contig_name)
            if abs(cont12 - cont11) < 600:
                continue

            ins_start, ins_end = get_insertion_pos(cont11, cont12, contig_length)

            if abs(ins_start - ins_end) < 50:
                continue

            if ins_start == 0:
                if cont12 < cont11:
                    ref_breakpoint = ref12
                else:
                    ref_breakpoint = ref11
            else:
                if cont12 < cont11:
                    ref_breakpoint = ref11
                else:
                    ref_breakpoint = ref12

            is_rc = cont11 > cont12
            if cont12 < cont11:
                cont12, cont11 = cont11, cont12



            ins = Insertion(contig_name, ins_start, ins_end, is_rc, chrom, ref_breakpoint, abs(cont12 - cont11),
                            0, 0)

            if ins.size() > 500:
                print(ins.size())
                print("Insertion: " + str(chrom) + " \t " + str(ref_breakpoint))
            if ins not in unique_insertions:
                unique_insertions.append(ins)
            else:
                non_unique_insertions.append(ins)

records = list(SeqIO.parse(sys.argv[2], "fasta"))
record_dict = {}
for record in records:
    if record.id not in record_dict.keys():
        record_dict[record.id] = record
with open(sys.argv[3], "w") as vcf, open("insertions_with_anchors.fasta", "w") as fasta:
    for insertion in unique_insertions:
        if insertion.size() >= 300:
            fasta.write(">" + insertion.node + "\n")
            fasta.write(str(record_dict[insertion.node].seq))
            fasta.write("\n")
        try:
            ins_seq = str(record_dict[insertion.node].seq[insertion.pos1 : insertion.pos2 + insertion.overlap] if not insertion.rc else record_dict[insertion.node].seq[insertion.pos1 : insertion.pos2 + insertion.overlap].reverse_complement())
            vcf.write(str(insertion.ref_id) + "\t" + str(insertion.ref_pos) + "\t" + "." + "\t" + str(record_dict[insertion.node].seq[insertion.pos1]) + "\t" + ins_seq + "\t" + "." + "\t" + "PASS" + "\t" + "DP=100" + "\t" + insertion.node + "\t" + str(insertion.anchor1) + "\t" + str(insertion.anchor2) + "\t" + str(insertion.pos1) + "\t" + str(insertion.pos2) + "\n")
        except:
            pass
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import sys
import re
import math
import os


class Usage(Exception):
    def __init__(self, msg):
        self.msg = msg


def usage():
    print("\nUsage: python extractContaminants.py contaminationFile orphan.fq.contigs.fa orphan.fq.contigs.wocontamination.fa")
    sys.exit(-1)


def main():
    args = sys.argv[1:]
    if len(args) != 3:
        usage()

    fcontamination = open(sys.argv[1], 'r')
    fcontig = open(sys.argv[2], 'r')
    fout = open(sys.argv[3], 'w')

    contamination = fcontamination.readline()
    while contamination != '':
        contamination = contamination.split()[0]
        contig = fcontig.readline()[:-1]
        if (contig != ''):
            contig = re.split(">| ", contig)[1]
            while contamination != contig and contig != '':
                fout.write(">" + contig + "\n")
                while True:
                    temp_string = fcontig.readline()
                    if temp_string != '' and temp_string[0] != '>':
                        fout.write(temp_string)
                    else:
                        break
                contig = temp_string
                if (contig != ''):
                    contig = re.split(">| ", contig)[1][:-1]
            if (contamination == contig and contig != ''):
                last_pos = fcontig.tell()
                fcontig.readline()
                while contig != '':
                    if contig[0] == '>':
                        fcontig.seek(last_pos)
                        break
                    last_pos = fcontig.tell()
                    contig = fcontig.readline()
            contamination = fcontamination.readline()
    contig = fcontig.readline()
    while contig != '':
        fout.write(contig)
        fout.write(fcontig.readline())
        contig = fcontig.readline()
    fout.close()


if __name__ == "__main__":
    sys.exit(main())
36
37
38
39
40
shell:
    """
    {SAMTOOLS} sort -@ {THREADS} -n {input} -o {output.sorted}
    {GIT_ROOT}/bxtools/bin/bxtools filter {output.sorted} -b -s 0.2 -q 10 >{output.unmapped_bam}
    """
48
49
50
51
52
53
54
55
56
57
58
59
60
61
run:
    shell("rm -rf temp_reads")
    shell("mkdir temp_reads")
    shell("mkdir -p fasta")

    shell("{GIT_ROOT}/bxtools/bin/bxtools bamtofastq {input.bam} temp_reads/")
    if DATA == "other":
        shell("{SPADES} -t {THREADS} -k {VELVET_K} -1 temp_reads/{wildcards.sample}_R1.fastq -2 temp_reads/{wildcards.sample}_R2.fastq --only-assembler -o spades_{wildcards.sample}")
        shell("cp spades_{wildcards.sample}/scaffolds.fasta fasta/{wildcards.sample}.fasta")
    else:
        shell("{VELVETH} velvet_{wildcards.sample} {VELVET_K} -shortPaired -fastq -separate temp_reads/{wildcards.sample}_R1.fastq temp_reads/{wildcards.sample}_R2.fastq")
        shell("{VELVETG} velvet_{wildcards.sample} -exp_cov auto -cov_cutoff {VELVET_COVERAGE} -max_coverage 100 -scaffolding no")
        shell("cp velvet_{wildcards.sample}/contigs.fa fasta/{wildcards.sample}.fasta")
    shell("rm -rf temp_reads/")
69
70
71
72
shell:
     """
     {GIT_ROOT}/contig_length_filter.py 200 {input.fasta} {output.filtered_fasta}
     """
79
80
81
82
shell:
    """
    {GIT_ROOT}/bamtofastq {ADDITIONAL_BAMTOFASTQ_FLAGS} {input} {output.temp_dir}
    """
89
90
91
92
93
94
95
96
97
run:
    if BLAST_DB != 'None':
        shell("mkdir -p blast")
        shell("{BLASTN} -task megablast -query {input.filtered_fasta} -db {BLAST_DB} -num_threads {THREADS} > blast/{wildcards.sample}.megablast")
        shell("{GIT_ROOT}/cleanmega blast/{wildcards.sample}.megablast blast/{wildcards.sample}.cleanmega")
        shell("{GIT_ROOT}/find_contaminations.py blast/{wildcards.sample}.cleanmega blast/{wildcards.sample}.contaminants")
        shell("python {GIT_ROOT}/remove_contaminations.py blast/{wildcards.sample}.contaminants {input.filtered_fasta} {output.filtered_fasta}")
    else:
        shell("cp {input.filtered_fasta} {output.filtered_fasta}")
109
110
111
112
113
114
shell:
    """
    {LONGRANGER} mkref {input.filtered_fasta}
    {LONGRANGER} align --localcores={THREADS} --localmem={MEMORY} --id=temp_{wildcards.sample} --reference={output.refdata} --fastqs={input.temp_dir}/
    {SAMTOOLS} view -b -F 12 temp_{wildcards.sample}/outs/possorted_bam.bam >{output.mapped_bam}
    """
SnakeMake From line 109 of master/Snakefile
122
123
124
125
126
shell:
    """
    {GIT_ROOT}/bxtools/bin/bxtools filter {input.mapped_bam} -s 0.2 -c 0.2 -q 50 >mapped/{wildcards.sample}.filtered.bam
    {GIT_ROOT}/bxtools/bin/bxtools split-by-ref mapped/{wildcards.sample}.filtered.bam -o {output.barcode_folder}
    """
SnakeMake From line 122 of master/Snakefile
135
136
137
138
139
shell:
    """
    mkdir small_bams_{wildcards.sample}
    {GIT_ROOT}/bxtools/bin/bxtools extract {input.sample} {input.barcode_folder} {output.small_bams}/
    """
SnakeMake From line 135 of master/Snakefile
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
shell:
     """
     mkdir -p {output.small_reads}
     function prepare_reads {{

     a="$(basename $1 | sed "s/\..*//")"
     if [ -d "{output.small_reads}_2/$a" ]; then
     mv {output.small_reads}_2/$a {output.small_reads}/$a
     return
     fi
     mkdir -p {output.small_reads}/$a
     {GIT_ROOT}/bxtools/bin/bxtools bamtofastq {input.small_bams}/$a.bam {output.small_reads}/$a/
     }}
     export -f prepare_reads
     parallel --jobs {THREADS} prepare_reads ::: {input.small_bams}/*
     """
SnakeMake From line 148 of master/Snakefile
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
shell:
    """
    mkdir -p {output.contigs}
    mkdir -p {output.assemblies_folder}
    function local_assembly {{
    a="$(basename $1 | sed "s/\..*q//")"
    if [ -f "{output.assemblies_folder}/$a/scaffolds.fasta" ]; then
    cp {output.assemblies_folder}/$a/scaffolds.fasta {output.contigs}/$a.fasta
    return
    fi
    {SPADES} --only-assembler -t 1 -m {MEMORY_PER_THREAD} -k {SPADES_K} --cov-cutoff 3 --pe1-1 {input.small_reads}/$a/$a\_R1.fastq --pe1-2 {input.small_reads}/$a/$a\_R2.fastq -o {output.assemblies_folder}/$a
    cp {output.assemblies_folder}/$a/scaffolds.fasta {output.contigs}/$a.fasta
    rm -rf {output.assemblies_folder}/$a/K*
    }}
    export -f local_assembly
    parallel --jobs {THREADS} local_assembly ::: {input.small_reads}/*
    """
SnakeMake From line 171 of master/Snakefile
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
shell:
    """
    mkdir -p {output.splitted_insertions}
    mkdir -p {output.prefilter_contigs}
    python {GIT_ROOT}/split_fasta.py {input.insertions} {output.splitted_insertions}
    mkdir -p quast_res
    mkdir -p {output.filtered_contigs}
    function filter_target_contigs {{
    a="$(basename $1 | sed "s/\..*//")"
    {GIT_ROOT}/contig_length_filter.py 500 {input.contigs}/$a.fasta {output.prefilter_contigs}/$a.fasta
    {QUAST} -t 1 --fast --min-contig 199 -R {output.prefilter_contigs}/$a.fasta {output.splitted_insertions}/$a.fasta -o quast_res/$a
    python {GIT_ROOT}/filter_correct_record.py {output.prefilter_contigs}/$a.fasta quast_res/$a/contigs_reports/all_alignments_$a.tsv {output.filtered_contigs}/$a.fasta
    }}
    export -f filter_target_contigs
    set +oe pipefail
    parallel --jobs {THREADS} filter_target_contigs ::: {input.contigs}/*
    cat {output.filtered_contigs}/*.fasta >{output.contigs}
    rm -rf quast_res
    rm -f core*
    """
SnakeMake From line 199 of master/Snakefile
225
226
227
228
229
shell:
    """
    {QUAST} --fast -R {GENOME} {input.contigs} -o quast_res
    cp quast_res/contigs_reports/all_alignments_final_set_*.tsv {output.final_alignments}
    """
SnakeMake From line 225 of master/Snakefile
237
238
239
240
shell:
    """
    python {GIT_ROOT}/minimap_to_vcf.py {input.final_alignments} {input.contigs} {output.final_vcf}
    """
SnakeMake From line 237 of master/Snakefile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import sys
from Bio import SeqIO

import os

directory = sys.argv[2]
try:
    os.stat(directory)
except:
    os.mkdir(directory)


current_index = 1
for record in SeqIO.parse(sys.argv[1], "fasta"):
    with open(directory + os.sep + str(current_index) + ".fasta", "w") as output_handle:

        SeqIO.write([record], output_handle, "fasta")
        current_index += 1
ShowHide 16 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/1dayac/Novel-X
Name: novel-x
Version: 0.3
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...