Assembly pipeline for 10X chromium genomes of Mytilus species

public public 1yr ago Version: v1.0 0 bookmarks

Assembly pipeline from 10x chromium reads from the preprint "Three new genome assemblies of blue mussel lineages: North and South European Mytilus edulis and Mediterranean Mytilus galloprovincialis" bioRxiv ( https://doi.org/10.1101/2022.09.02.506387 ).

snakemake (in a conda environnement for example) and singularity need to be installed.

Supernova storage workarounds

Supernova use large amount of storage for temporary and final results.

The supernova results are stored on a distant NAS that needs to be mounted first on my system.

sshfs nas4:/share/sea/sea/projects/ref_genomes/assembly_10x/results/supernova_assemblies \
results/supernova_assemblies \
-o idmap=user,compression=no,uid=1000,gid=1000,allow_root

I also used a 4T disk as a temporary local storage for supernova computation sudo mount /dev/sd[x]1 /data/ref_genomes/assembly_10x/tmp

How to run

To run use:

conda activate snake_env
snakemake --use-conda \
--use-singularity --singularity-args "-B /nas_sea:/nas_sea" \
-j {threads} \
[either all_v6, asm_improvement, stats, repeats, annotation, finalize or ncbi_submission (see workflow/Snakefile)]

Code Snippets

12
13
14
15
16
17
18
shell:
    """
    zcat {input} > {output[0]}
    mkdir {output[1]}
    splitMfasta.pl {output[0]} \
    --outputpath={output[1]} --minsize={params.split_size}
    """
27
28
29
30
shell:
    """
    augustus --gff3=on --species=caenorhabditis {input} > {output}
    """
45
46
shell:
    "cat {input} | join_aug_pred.pl > {output}"
70
71
72
73
74
75
76
shell:
    """
    run_rcorrector.pl -1 {input[0]} -2 {input[1]} \
    -t {threads} \
    -od {params.outdir} \
    > {log} 2>&1
    """
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
shell: 
    """
    trim_galore --cores {threads} \
    --phred33 \
    --quality 20 \
    --stringency 1 \
    -e 0.1 \
    --length 70 \
    --output_dir {params.outdir} \
    --basename {params.basename} \
    --dont_gzip \
    --paired \
    {input} \
    > {log} 2>&1
    """
120
121
shell:
    "bwa-mem2 index {input}"
138
139
140
141
142
shell:
    """
    bwa-mem2 mem -t {threads} {input[2]} {input[0]} {input[1]} 2> {log} \
    | samtools view -b -@ {threads} -o {output}
    """
160
161
162
163
164
165
shell:
    """
    samtools merge -@ {threads} {params.tmp_merge} {input}               
    samtools sort -@ {threads} -n -o {output} {params.tmp_merge}
    rm {params.tmp_merge}
    """
184
185
186
187
188
189
190
191
192
193
194
195
shell:
    """
    python /opt/agouti/agouti.py scaffold \
    -assembly {input.fa} \
    -bam {input.bam} \
    -gff {input.gff} \
    -outdir {params.outdir} \
    -minMQ {params.minMQ} -maxFracMM {params.maxFracMM} \
    > {log} 2>&1

    gzip -c {params.outdir}/agouti.agouti.fasta > {output}
    """
SnakeMake From line 184 of rules/agouti.smk
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
run:
    import subprocess
    import json
    if os.path.exists(output.fasta):
        os.remove(output.fasta)
    # Get clusters for Mollusca taxid=6447
    url = 'http://www.orthodb.org/search?level=6447&limit=50000'
    cmd = f'wget "{url}" -O {output.clustids}'
    subprocess.run(cmd, shell=True, check=True, stderr=subprocess.DEVNULL)
    # Loop for each cluster
    with open(output.clustids, 'r') as fr:
        clusters = json.load(fr)
    for C_id in clusters['data']:
        url = f'http://www.orthodb.org/fasta?id={C_id}&species=all'
        cmd = f'wget "{url}" -O - >> {output.fasta}'
        subprocess.run(cmd, shell=True, check=True, stderr=subprocess.DEVNULL)
42
43
44
45
46
47
48
shell:
    """
    zcat {input} > tmp_ref.{wildcards.asm}.fa
    hisat2-build tmp_ref.{wildcards.asm}.fa {params.index_prefix} \
    > {log} 2>&1
    rm tmp_ref.{wildcards.asm}.fa
    """
71
72
73
74
75
76
77
78
79
shell:
    """
    hisat2 -p {threads} \
    -x {params.index_prefix} \
    -1 {input.reads[0]} -2 {input.reads[1]} \
    2> {log} | \
    samtools sort -@ {threads} -O BAM -o {output[0]}
    samtools index {output}
    """
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
shell:
    """
    find {params.out_dir} -name '{params.to_rm}' -type d -delete
    cp /opt/gm_key_64 ~/.gm_key
    braker.pl \
    --genome {input.genome} \
    --prot_seq {input.prot_db} \
    --bam {params.list_bams} \
    --workingdir {params.out_dir} \
    --species {params.species} \
    --useexisting \
    --etpmode --softmasking --cores {threads} \
    --gff3 \
    --GENEMARK_PATH {params.genemark_path} \
    --PROTHINT_PATH {params.genemark_path}/ProtHint/bin \
    > {log} 2>&1
    """
157
158
159
160
161
162
shell:
    """
    gff3_to_fasta -g {input.gff} -f {input.fa} \
    -st pep -d simple -o {params.prefix} \
    > log 2>&1
    """
174
175
176
177
178
shell:
    """
    mantis setup -c {threads} > {log} 2>&1
    touch {output}
    """
195
196
197
198
199
200
201
202
203
204
205
206
shell:
    """
    rm -r {params.outdir}
    mantis run \
    -i {input.pep} \
    -o {params.outdir} \
    -od '{params.org_detail}' \
    -km \
    -gff \
    -c {threads} \
    > {log} 2>&1
    """
20
21
22
23
shell:
    """
    zcat {input} > {output}
    """
33
34
shell:
    "wget -O {output} {params.link}"
48
49
50
51
52
53
shell:
    """
    minimap2 -t {threads} -x map-ont \
    {input.ref} {input.ont_reads} \
    > {output} 2> {log}
    """
70
71
72
73
74
75
76
77
78
79
80
81
82
83
shell:
    """
    java -jar {input.lrscaf} \
    -c {input.ref} \
    -a {input.paf} \
    -o {params.res}/res \
    -micl 1000 \
    -t mm \
    -i 0.3 \
    -p {threads} \
    > {log} 2>&1

    cp {params.res}/res/scaffolds.fasta {output}
    """
 99
100
101
102
103
104
105
106
107
108
109
shell:
    """
    ragtag.py scaffold \
    {input.ref_coruscus} \
    {input.draft_assembly} \
    -o {params.out_dir} \
    --mm2-params '-x asm10 -t {threads}' -u \
    > {log} 2>&1

    cp {params.out_dir}/ragtag.scaffolds.fasta {output}
    """
124
125
126
127
128
129
130
131
shell:
    """
    minimap2 -t {threads} -ax map-ont \
    {input.ref} {input.ont_reads} 2> {log} | \
    samtools sort -@ {threads} | \
    samtools view -b -@ {threads} -o {output}
    samtools index {output}
    """
154
155
156
157
158
159
160
161
162
163
164
165
shell:
    """
    java -Xmx{params.java_mem}G -jar {input.pilon} \
    --genome {input.ref} \
    --frags {input.pe_bam} \
    --nanopore {input.ont_bam} \
    --targets {input.target} \
    --output {params.prefix} \
    --outdir {params.out_dir} \
    --changes --vcf --tracks --diploid --fix all \
    > {log} 2>&1
    """
184
185
186
187
188
189
190
191
192
193
194
shell:
    """
    ragtag.py scaffold \
    {input.ref_coruscus} \
    {input.draft_assembly} \
    -o {params.out_dir} \
    --mm2-params '-x asm10 -t {threads}' -u \
    > {log} 2>&1

    cp {params.out_dir}/ragtag.scaffolds.fasta {output}
    """
202
203
shell:
    "wget -O {output} {params.link}"
212
213
214
215
shell:
    """
    bwa-mem2 index {input}
    """
243
244
245
246
247
248
249
250
shell:
    """
    [ ! -e {input.ref}.0123 ] && bwa-mem2 index {input.ref}
    bwa-mem2 mem -t {threads} {input.ref} {input.R1} {input.R2} 2> {log} | \
    samtools sort -@ {threads} | \
    samtools view -b -@ {threads} -o {output}
    samtools index {output}
    """
262
263
264
265
266
267
268
269
shell:
    """
    for i in {{1..14}}; do
        grep '>' {input.ref} | sed 's/>//' | awk -v record=$i 'NR==record {{print $0}}' \
        > {params.out_dir}/target_$(printf '%02d' $i).txt
    done
    grep '>' {input.ref} | tail -n +15 | sed 's/>//' > {output[14]}
    """
293
294
295
296
297
298
299
300
301
302
303
shell:
    """
    java -Xmx{params.java_mem}G -jar {input.pilon} \
    --genome {input.ref} \
    --frags {input.pe_bam} \
    --targets {input.target} \
    --output {params.prefix} \
    --outdir {params.out_dir} \
    --changes --vcf --tracks --diploid --fix all \
    > {log} 2>&1
    """
323
324
shell:
    "cat {input} | bgzip -c > {output}"
12
13
14
15
16
17
shell:
    """
    seqkit grep -v -s -r -p '^N+$' {input.fa} > {output.fa}
    ln -sr {input[1]} results/blobtoolkit/{wildcards.sample}_{wildcards.version}/GM_1.fastq.gz
    ln -sr {input[2]} results/blobtoolkit/{wildcards.sample}_{wildcards.version}/GM_2.fastq.gz
    """
25
26
27
28
29
30
31
32
33
shell:
    """
    cp {input.conf} {output.conf}
    SC=$( grep '>' {input.fa} | wc -l )
    SP=$( grep -v '>' {input.fa} | wc -m )
    sed -i "s/scaffold-count.*/scaffold-count:\ $SC/" {output.conf}
    sed -i "s/span.*/span:\ $SP/" {output.conf}
    sed -i "s/prefix.*/prefix:\ {wildcards.sample}_{wildcards.version}/" {output.conf}
    """
49
50
51
52
53
54
55
56
57
58
59
60
shell:
    """
    snakemake -p \
    -s /opt/blobtoolkit/insdc-pipeline/Snakefile \
    --directory {params.dir} \
    --use-conda \
    --conda-prefix .conda \
    --configfile {input.conf} \
    -j {threads} \
    --latency-wait 30 \
    --resources btk=1
    """
75
76
77
78
79
80
81
shell:
    """
    {params.blobtools_bin} add \
    --busco {input[2]} \
    --busco {input[1]} \
    {params.blobdir}
    """
 97
 98
 99
100
101
102
103
shell:
    """
    cp {input[1]} {output[1]}
    cp -r {params.indir} results/blobtoolkit/blobdirs/ && \
    tar -czf {output[2]} {params.tardir} && \
    rm -r {params.tardir}
    """
 9
10
11
12
13
14
shell:
    """
    cd {params.outdir}
    wget -c {params.metazoa} -O - | tar -xz
    wget -c {params.mollusca} -O - | tar -xz
    """
27
28
shell:
    "zcat {input} > {output}"
50
51
52
53
54
55
56
shell:
    """
    busco -f -m genome -i {params.fa} -o {params.outdir} \
    -q -c {threads} \
    -l {params.db} > {log} 2>&1
    cp -r {params.outdir} results/busco/ && rm -r {params.outdir}
    """
72
73
74
75
76
77
shell:
    """
    python workflow/scripts/summarize_buscos.py \
    -o {output} \
    --files {input}
    """
14
15
shell:
    "seqkit replace -p '\s.*$' -r '' {input} > {output}"
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
    shell:
        """
        echo "((((mgal_02,GCA900618805),medu_02),mtro_02),GCA017311375);" > {params.seqfile}
        echo "*GCA017311375 {input[0]}" >> {params.seqfile}
        echo "mtro_02 {input[1]}" >> {params.seqfile}
        echo "medu_02 {input[2]}" >> {params.seqfile}
        echo "mgal_02 {input[3]}" >> {params.seqfile}
        echo "GCA900618805 {input[4]}" >> {params.seqfile}

        cactus \
        {params.restart} \
	    --maxCores {threads} \
        results/cactusJobStore \
        {params.seqfile} \
        {output} \
        > {log} 2>&1
        """
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
shell:
    """
    echo '{input.bams}' | tr ' ' '\\n' > results/calling/bam.list
    angsd sites index {input.targets} > {log} 2>&1
    angsd -P {threads} \
    -bam results/calling/bam.list \
    -ref {input.ref} \
    -sites {input.targets} \
    -remove_bads 1 -uniqueOnly 1 -minMapQ 20 -minQ 20 \
    -gl 2 -doMajorMinor 3 -doGlf 2 -doBcf 1 \
    -doPost 1 -doMaf 1 -doGeno 1 -doCounts 1 \
    -setMinChunkSize 10000 \
    -out {params.prefix} \
    >> {log} 2>&1
    """
44
45
46
47
48
shell:
    """
    paste <(zcat {input[0]}) <(zcat {input[1]} | cut -f 4-) | \
    bgzip -c > {output}
    """
66
67
68
69
70
71
72
73
74
75
76
77
78
79
shell:
    """
    python /opt/pcangsd/pcangsd.py \
    -beagle {input} \
    -minMaf {params.min_maf} \
    -iter 200 \
    -e {params.eigenvectors} \
    -o {params.prefix} \
    -sites_save \
    -snp_weights \
    -admix \
    -threads {threads} \
    > {log} 2>&1
    """
94
95
96
97
shell:
    """
    bwa-mem2 index {input} > {log} 2>&1
    """
117
118
119
120
121
122
123
124
125
shell:
    """
    bwa-mem2 mem -t {threads} \
    -R \"{params.rg_string}\" \
    {input.fa} \
    {input.fastqs} 2> {log} | \
    samtools sort -@ {threads} -o {output[0]} -
    samtools index {output[0]}
    """
134
135
136
137
138
shell:
    """
    bcftools query -f '%CHROM\\t%POS\\n' {input} | \
    grep -v "ref" > {output}
    """
154
155
156
157
158
159
160
161
162
163
shell:
    """
    bcftools mpileup \
    -f {input.ref} \
    -R {input.targets} \
    --redo-BAQ -a "FORMAT/AD,FORMAT/DP" \
    -Ou {input.bams} 2> {log} | \
    bcftools call -mG - -Ob --threads {threads} -o {output} \
    >> {log} 2>&1
    """
178
179
180
181
182
183
184
shell:
    """
    bcftools index -f {input.bcf1}
    bcftools index -f {input.bcf2}
    bcftools merge -m snps -Ob -o {output} -R {input.targets} \
    --threads {threads} {input.bcf1} {input.bcf2} > {log} 2>&1
    """
15
16
script:
    "../scripts/btk_conta_extraction.py"
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
shell:
    """
    tail -n +2 {input[1]} | cut -d ',' -f 1 > {params.tmp_kept_list}

    {params.blobtools_bin} filter \
    --list {params.tmp_kept_list} \
    --output {output[0]} {params.blobdir}

    {params.blobtools_bin} filter \
    --list {params.tmp_kept_list} \
    --invert \
    --output {output[1]} {params.blobdir}

    rm {params.tmp_kept_list}
    """
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
shell:
    """
    seqkit grep -f <(tail -n +2 {input.kept} | cut -d ',' -f 1) {input[0]} \
    | seqkit replace -is -p "^N+|N+$" -r "" > {input.fa}_tmp

    seqkit fx2tab -n -i --gc --length -B N {input.fa}_tmp \
    | awk '($2 >= 1000 && $4 < 90) {{print $1}}' > {input.fa}_filt_list
    seqkit grep -f {input.fa}_filt_list {input.fa}_tmp > {input.fa}_tmp2

    seqkit sort --by-length -2 --reverse {input.fa}_tmp2 \
    | seqkit replace -p .+ -r "{params.scaff_prefix}_{{nr}}" --nr-width {params.nr_width} \
    | gzip -c > {output}

    rm {input.fa}_tmp* {input.fa}_filt_list
    """
37
38
39
40
41
42
43
shell:
    """
    wget {params.ftp_0} -O {output[0]}
    wget {params.ftp_1} -O {output[1]}
    wget {params.ftp_2} -O {output[2]}
    wget {params.ftp_3} -O {output[3]}
    """
25
26
27
28
shell:
    """
    cp {input} {output}
    """
41
42
43
44
45
46
shell:
    """
    agat_sp_manage_IDs.pl --gff {input} \
    --ensembl --prefix {params.prefix} \
    -o {output} > {log} 2>&1
    """
57
58
59
60
61
shell:
    """
    agat_convert_sp_gxf2gxf.pl -g {input} -gvi 3 -gvo 3 \
    -o {output} > {log} 2>&1
    """
74
75
76
77
78
79
shell:
    """
    gff3sort.pl {input} > {params.tmp}
    bgzip {params.tmp}
    tabix -p gff {output}
    """
86
87
shell:
    "cat {input} | gzip -c > {output}"
94
95
shell:
    "cp {input} {output}"
19
20
21
22
23
24
25
26
shell:
    """
    ls {input} > {params.files}
    kmc -k{params.k} -t{threads} -m{params.mem} \
    -ci1 -cs10000 @{params.files} \
    {params.db} /tmp/ \
    > {log} 2>&1
    """
42
43
44
45
46
47
48
shell:
    """
    kmc_tools -t{threads} transform \
    {params.db} histogram \
    {output} -cx10000 \
    > {log} 2>&1
    """
65
66
67
68
69
70
71
shell:
    """
    /opt/genomescope2.0/genomescope.R -i {input} \
    -o {params.outdir} -n {params.name_prefix} \
    -k {params.k} -p 2 \
    > {log} 2>&1
    """
16
17
18
19
20
21
22
23
shell:
    """
    kat comp -t {threads} \
    -o {params.outprefix} \
    '{input[0]} {input[1]}' \
    {input[2]} \
    > {log} 2>&1
    """
36
37
38
39
40
41
42
43
shell:
    """
    kat plot spectra-cn \
    -o {output} \
    -t {params.title} \
    {input} \
    > {log} 2>&1
    """
11
12
13
14
shell:
    """
    bwa-mem2 index {input} > {log} 2>&1
    """
32
33
34
35
36
37
38
shell:
    """
    bwa-mem2 mem -t {threads} {input.fa} \
    {input.fastqs} 2> {log} | \
    samtools sort -@ {threads} -o {output[0]} -
    samtools index {output[0]}
    """
49
50
51
52
shell:
    """
    samtools stats -@ {threads} {input} > {output[0]}
    """
67
68
69
70
71
72
73
74
75
76
shell:
    """
    mosdepth \
    -t {threads} \
    --d4 \
    --mapq {params.min_mapq} \
    {params.prefix} \
    {input} \
    > {log} 2>&1
    """
12
13
14
15
16
17
shell:
    """
    zcat {input[0]} | \
    bedtools maskfasta -fi /dev/stdin -bed {input[1]} -fo /dev/stdout | \
    gzip -c > {output}
    """
30
31
32
33
34
shell:
    """
    seqkit rmdup -s -D {params} -i \
    -o {output} {input}
    """
50
51
52
53
54
55
shell:
    """
    seqkit replace -p .+ \
    -r "{params.scaff_prefix}_s{{nr}}" --nr-width {params.nr_width} \
    -o {output} {input}
    """
64
65
66
67
68
69
70
71
72
73
shell:
    """
    seqkit head -n 14 {input} | \
    seqkit replace -p "(.+)" -r "\$1 [location=chromosome] [chromosome=LG{{nr}}]" \
    --nr-width 2 \
    -o {output}

    seqkit fx2tab {input} | tail -n +15 | seqkit tab2fx | 
    gzip -c >> {output}
    """
84
85
86
87
88
89
90
shell:
    """
    paste \
    <(seqkit seq --name --only-id {input[0]}) \
    <(seqkit seq --name --only-id {input[1]}) \
    > {output}
    """
101
102
103
104
105
106
shell:
    """
    awk -F'\\t' 'NR==FNR{{a[$1]=$2; next}}{{id=$1; if(id in a) gsub(id,a[id])}} 1' \
    {input[0]} <(zcat {input[1]}) | \
    gzip -c > {output}
    """
115
116
117
118
119
shell:
    """
    cp {input[0]} {output[0]}
    cp {input[1]} {output[1]}
    """
10
11
12
13
14
15
shell:
    """
    nubeam-dedup -i1 {input.fq1} -i2 {input.fq2} \
    -o1 {output[0]} -o2 {output[1]} \
    > {log} 2>&1
    """
32
33
34
35
36
37
38
shell:
    """
    /opt/proc10xG/process_10xReads.py \
    -o {params.out_prefix} \
    -1 {input[0]} -2 {input[1]} \
    > {log} 2>&1
    """
48
49
script:
    "../scripts/filter_barcodes.R"
68
69
70
71
72
73
74
75
76
77
78
79
80
shell:
    """
    fastp -i {input[0]} -I {input[1]} \
    -o {output[0]} -O {output[1]} \
    --disable_length_filtering \
    --correction \
    --trim_poly_g \
    --overrepresentation_analysis \
    --json {params.report}.json \
    --html {params.report}.html \
    -w {threads} \
    > {log} 2>&1
    """
 96
 97
 98
 99
100
101
102
103
shell:
    """
    /opt/proc10xG/filter_10xReads.py \
    -L {input.barcodes} \
    -1 {input[0]} -2 {input[1]} \
    -o {params.out_prefix} \
    > {log} 2>&1
    """
118
119
120
121
122
123
124
shell:
    """
    /opt/proc10xG/regen_10xReads.py \
    -1 {input[0]} -2 {input[1]} \
    -o {params.out_prefix} \
    > {log} 2>&1
    """
13
14
15
16
17
18
shell:
    """
    zcat {input.fa} > {params.tmp_fa}
    longranger mkref {params.tmp_fa} > {log} 2>&1
    mv refdata-{wildcards.sample}_v1.pseudohap {output[0]}
    """
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
shell:
    """
    longranger align \
    --id={params.run_id} \
    --fastqs={params.input_dir} \
    --sample={params.sample} \
    --reference={input[2]} \
    --localcores={threads} \
    --localmem={params.mem} \
    > {log} 2>&1

    rm -r {output[0]} \
    && cp -al {params.run_id} {output[0]} \
    && rm -r {params.run_id}
    """
67
68
shell:
    "samtools sort -@ {threads} -n -O BAM -o {output} {input} > {log} 2>&1"
80
81
82
83
84
shell:
    """
    cd {params.workdir}
    ngscstat {params.input} 2> ../../../{log}
    """
93
94
95
96
shell:
    """
    calcuts {input[0]} > {output} 2> {log}
    """
104
105
106
107
108
shell:
    "/opt/purge_dups/scripts/hist_plot.py "
    "-c {input[0]} "
    "{input[1]} "
    "{output}"
117
118
shell:
    "split_fa {input} > {output} 2> {log}"
131
132
133
134
shell:
    "(minimap2 -t {threads} "
    "-xasm5 -DP {input} {input} "
    "| gzip -c - > {output}) 2> {log}"
148
149
150
151
shell:
    "purge_dups -2 -M{params.M} -E{params.E} -T {input.cutoffs} "
    "-c {input.basecov} {input.selfmap} > {output} "
    "2> {log}"
163
164
165
166
167
168
shell:
    """
    get_seqs {input.bed} {input.fa} \
    -p {params.prefix}
    gzip {params.prefix}.*.fa
    """
175
176
shell:
    "cp {input} {output}"
183
184
shell:
    "cp {input} {output}"
14
15
shell:
    "zcat {input} > {output}"
28
29
30
31
32
33
shell:
    """
    export LC_ALL=C
    BuildDatabase -name {params.db_name} {input} \
    > {log} 2>&1
    """
49
50
51
52
53
54
shell:
    """
    export LC_ALL=C
    RepeatModeler -database {params.db_name} -LTRStruct -pa {params.pa} \
    > {log} 2>&1
    """
70
71
72
73
74
75
76
77
78
79
80
81
shell:
    """
    cat {input} > {params.tmp_merge}

    cd-hit-est -aS 0.8 -c 0.8 -g 1 -G 0 -A 80 -M 10000 \
    -T {threads} \
    -i {params.tmp_merge} \
    -o {output} \
    > {log} 2>&1

    rm {params.tmp_merge}
    """
 98
 99
100
101
102
103
104
105
106
shell:
    """
    export LC_ALL=C
    RepeatMasker -dir {params.out_dir} \
    -lib {input.families} \
    -xsmall -gff -pa {params.pa} \
    {input.fa} \
    > {log} 2>&1
    """
14
15
16
17
18
19
shell:
    """
    zcat {input} | \
    assembly_stats /dev/stdin \
    > {output}
    """
34
35
script:
    "../scripts/merge_stats.py"
45
46
47
48
shell:
    """
    d4tools stat -s hist {input} > {output}
    """
58
59
script:
    "../scripts/plot_coverage_hist.R"
18
19
20
21
22
23
24
25
26
27
28
29
30
31
shell:
    """
    cd tmp
    [ ! -e {params.run_id}/{params.run_id}.mri.tgz ] && rm -r {params.run_id}
    supernova run \
    --id={params.run_id} \
    --fastqs=../{params.input_dir} \
    --sample={params.sample} \
    --maxreads='all' \
    --localcores={threads} \
    --localmem={params.mem} \
    --accept-extreme-coverage \
    > ../{log} 2>&1;
    """
50
51
52
53
54
55
56
57
58
shell:
    """
    supernova mkoutput \
    --style={params.style} \
    --asmdir={params.asm_dir} \
    --outprefix={params.outprefix} \
    --headers=full \
    > {log} 2>&1
    """
78
79
80
81
82
83
shell:
    """
    tar -cf - -C {params.input_dir}/outs assembly | zstdmt -T{threads} > {params.tmp_archive} 2> {log} && \
    rm -r {params.input_dir}/outs/assembly && \
    cp -r {params.input_dir} {params.output_dir} && rm -r {params.input_dir}
    """
91
92
shell:
    "bash workflow/scripts/collect_assembly_stats.sh"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import json
import pandas as pd
import numpy as np

files = {
    'id': 'identifiers.json',
    'gc': 'gc.json',
    'cov': 'GM_cov.json',
    'length': 'length.json',
    'Ncount': 'ncount.json',
    'superkingdom': 'bestsumorder_superkingdom.json',
    'phylum': 'bestsumorder_phylum.json'
}

blobdir = snakemake.params.blobdir

df = pd.DataFrame()
for var, file in files.items():
    with open(f'{blobdir}/{file}') as fr:
        data = json.load(fr)
        if data['keys'] == []:
            data = data['values']
        else:
            data = [data['keys'][x] for x in data['values']]
        df[var] = data

with open(f'{blobdir}/bestsumorder_positions.json') as fr:
    pos = json.load(fr)

df['besthit_length'] = [(x[0][2] - x[0][1] + 1) if x!=[] else np.nan for x in pos['values']]
df['taxid'] = [str(x[0][0]) if x!=[] else np.nan for x in pos['values']]
df['besthit_perc'] = df.besthit_length/df.length
df['N_perc'] = df.Ncount/df.length

with open(snakemake.params.mollusca_taxids) as fr:
    mollusca_taxids = set(int(x.strip()) for x in fr.readlines())
taxids = [[y[0] for y in x] if x!=[] else [] for x in pos['values']]
df['any_Mollusca_hit'] = [(len(set(x) & mollusca_taxids) > 0) for x in taxids]

df_bacteria = df[(df.superkingdom=='Bacteria')]
df_viruses = df[(df.superkingdom=='Viruses') & (df.besthit_perc > 0.1)]
df_eukaryota = df[
    (df.superkingdom=='Eukaryota') & 
    (~df.phylum.isin(['Mollusca', 'no-hit'])) &
    (df.any_Mollusca_hit == False) &
    (df.besthit_perc > 0.1)
]

conta_ids = df_bacteria.id.tolist() \
+ df_viruses.id.tolist() \
+ df_eukaryota.id.tolist()

df_kept = df[~df.id.isin(conta_ids)]

df_kept.to_csv(snakemake.output.kept, index=False)
df_bacteria.to_csv(snakemake.output.bact, index=False)
df_viruses.to_csv(snakemake.output.virus, index=False)
df_eukaryota.to_csv(snakemake.output.euka, index=False)
 3
 4
 5
 6
 7
 8
 9
10
11
echo "species,version,$(head -n1 results/supernova_assemblies/gallo_v1/outs/summary.csv)" \
> results/supernova_assemblies/supernova_assemblies_stats.csv

for S in 'gallo' 'edu' 'tros'; do
	for V in 'v1' 'v2'; do
		echo "${S},${V},$(tail -n +2 results/supernova_assemblies/${S}_${V}/outs/summary.csv)" \
		>> results/supernova_assemblies/supernova_assemblies_stats.csv
	done
done
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
options(warn = -1) # to remove some ggplot warnings

library(ggplot2)
library(stats)
library(scales)

barcode_input <- snakemake@input[["in_barcodes"]]
barcode_output <- snakemake@output[["out_barcodes"]]
fig_out <- snakemake@output[["figure"]]

X <- read.table(barcode_input, header = F)

H <- hist(X$V2, breaks = seq(0, max(X$V2), 1), plot = F)
D <- data.frame(cbind(H$breaks[2:(max(X$V2)+1)], H$counts))
NZ <- D[!(D$X2 %in% 0), ]
TOT <- sum((NZ$X1)*(NZ$X2))
count_empirical <- splinefun(x = NZ$X1, y = NZ$X2)
MIN <- floor(optimize(function(x) count_empirical(x), c(1, 50))$minimum)
MAX <- round((optimize(function(x) count_empirical(x), c(MIN, 500), maximum = T))$maximum)
LOW_REM <- sum((NZ[NZ$X1 < MIN, ]$X1)*(NZ[NZ$X1 < MIN, ]$X2))
UP_REM <- sum((NZ[NZ$X1 > 10*MAX, ]$X1)*(NZ[NZ$X1 > 10*MAX, ]$X2))
NKEPT <- TOT - LOW_REM - UP_REM
PKEPT <- NKEPT/TOT
X_out <- X[X$V2 >= MIN & X$V2 <= 10*MAX, ]

P <- ggplot(NZ, aes(x = X1, y = X2)) + xlab("Read pairs in bacode") + ylab("Count") +
  scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                labels = trans_format("log10", math_format(10^.x))) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                labels = trans_format("log10", math_format(10^.x))) +
  annotate("rect", xmin = MIN, xmax = 10*MAX, ymin = 0, ymax = +Inf, fill = 'green', alpha = 0.2) +
  annotate("rect", xmin = 0, xmax = MIN, ymin = 0, ymax = +Inf, fill = 'red', alpha = 0.2) + 
  annotate("rect", xmin = 10*MAX, xmax = +Inf, ymin = 0, ymax = +Inf, fill='red', alpha = 0.2) + 
  annotate("text", x = MAX, y = 5e4, label = "Maximum") + 
  annotate("text", x = MAX, y = 3e4, label = MAX) +
  annotate("text", x = MAX, y = 1.7e5, label = "Mean") + 
  annotate("text", x = MAX, y = 1e5, label = round(mean(X_out$V2), 1)) +
  annotate("text", x = 3, y = 100, label = "Removed low") + 
  annotate("text", x = MAX, y = 100, label = "Retained") + 
  annotate("text", x = 2000,y = 100, label = "Removed high") +
  annotate("text", x = 3, y = 55, label = LOW_REM) + 
  annotate("text", x = MAX, y = 55, label = NKEPT) + 
  annotate("text", x = 2000, y = 55, label = UP_REM) +
  annotate("text", x = 3, y = 30, label = round(LOW_REM/TOT, 3)) + 
  annotate("text", x = MAX, y = 30, label = round(NKEPT/TOT, 3)) + 
  annotate("text", x = 2000, y = 30, label = round(UP_REM/TOT, 3)) +
  geom_point() + geom_line(color = "blue") + theme_bw() 

ggsave(fig_out, P, width = 8, height = 4)
writeLines(X_out$V1, barcode_output)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import json
import re

out_df = pd.DataFrame()
for stat_file in snakemake.input:
    m = re.search('results/stats/(.+?)_(.+?).stats.json', stat_file)
    sample = m.group(1)
    version = m.group(2)
    with open(stat_file, 'r') as f:
        data = json.loads(f.read())
    data['C'] = data.pop('Contig Stats')
    data['S'] = data.pop('Scaffold Stats')
    tmpdf = pd.json_normalize(data)
    tmpdf.insert(0, "asm", [f"{sample}_{version}"])
    out_df = out_df.append(tmpdf)

out_df.to_csv(snakemake.output[0], sep=',', index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
library(ggplot2)
library(ggforce)
library(glue)
theme_set(theme_minimal(base_size = 12))

defaultW <- getOption("warn")
options(warn = -1)

# get params
in_hist = snakemake@input[['hist']]
out_pdf = snakemake@output[['pdf']]
sample_name = paste0(snakemake@wildcards[['sample']], "_", snakemake@wildcards[['version']])

df = read.table(in_hist, col.names = c("bin", "count"))[-1,]
df$bin_int = as.numeric(df$bin)
df$bin_int[1001] = 1001

# gross mean
m = sum(df$bin_int*df$count)/sum(df$count)
# take care of too low coverage cases
if (m >= 5){
	max_bin = 2*m
} else {
	max_bin = 5
}

p = ggplot(df, aes(x = bin_int, y = count)) + 
	geom_bar(stat = 'identity') +
	scale_x_continuous(expand = c(0,0)) +
	scale_y_continuous(expand = c(0,0)) +
	labs(
		x = "Coverage",
		y = "Count",
		title = glue("Histogram of coverage for {sample_name} (last bin is >=1000)")
	) +
	theme(
		panel.grid.minor.x = element_blank(),
		panel.grid.major.x = element_blank(),
		plot.margin = margin(10, 25, 10, 25),
		axis.ticks.x = element_line(),
		axis.ticks.length.x = unit(5, "pt"),
		strip.background = element_rect(fill = "grey90", linetype = "blank")
	) +
	facet_zoom(xy = (bin_int > 0 & bin_int <= max_bin), zoom.size = 1, horizontal = F)

ggsave(out_pdf, p, width = 10, height = 8)

options(warn = defaultW)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import pathlib
import defopt
import sys
import pandas as pd

def parse_busco(filename):
    """
    Parse one short_summary BUSCO file
    """
    asm = filename.name.split('.')[3]
    db = filename.name.split('.')[2]
    data = {'asm': asm, 'db': db}
    with open(filename, 'r') as fr:
        for line in fr:
            if "Complete and single-copy BUSCOs" in line:
                data['CS'] = int(line.split("\t")[1])
            elif "Complete and duplicated BUSCOs" in line:
                data['CD'] = int(line.split("\t")[1])
            elif "Fragmented BUSCOs" in line:
                data['F'] = int(line.split("\t")[1])
            elif "Missing BUSCOs" in line:
                data['M'] = int(line.split("\t")[1])
    data['C'] = data['CS'] + data['CD']
    data['T'] = data['CS'] + data['CD'] + data['F'] + data['M']
    return data


def format_output(out, datasets):
    df = pd.DataFrame(datasets)
    df = df.melt(id_vars=['asm', 'db'], 
        value_vars=['CS', 'CD', 'F', 'M', 'C', 'T'],
        var_name='cat', value_name='N')
    if out is None:
        df.to_csv(sys.stdout, sep='\t', index=False)
    else:
        df.to_csv(out, sep='\t', index=False)


def main(files: list[pathlib.Path], out: pathlib.Path=None):
    """
    Summarize a list of short summary BUSCO files into a tidy table.

    :param files: BUSCO short_summary files (space separated list)
    :param out: Summary output as tsv, stdout by default
    """
    datasets = []
    for busco_file in files:
        datasets.append(parse_busco(busco_file))

    format_output(out, datasets)        


if __name__ == '__main__':
    defopt.run(main, strict_kwonly=False,
        short={'out': 'o'})
ShowHide 85 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/alxsimon/assembly_10x
Name: assembly_10x
Version: v1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...