Bash scripts for running QIIME 1.9 and a Snakemake workflow to automate it using singularity container images.

public public 1yr ago 0 bookmarks
QIIME1-workflow

MOTIVATION: Qiime 1.9 has reached the end of its life ycle and thus no longer maintained by its developers. Moreover, QIIME 1.9 is infamously known to be very difficult to install due its many dependencies. Notwithstanding, I still find OTU picking a lot easier using qiime 1.9 than qiime 2. Moreover, I had developed a set of scripts for pathogen analysis using amplicon sequences by leveraging some outputs generated by QIIME 1.9. Which meant that my pathogen analysis scripts no longer work. To solve these problems, I decided to develop a QIMME 1.9 workflow that uses containers to alleviate the burden of software installation and snakemake for workflow orchestration and reproducibility.

It is currently a work under progress, though. The workflow performs microbiome analysis, pathogen analysis and functional annotation using QIIME 1.9 and PICRUSt2.

Code Snippets

39
40
41
42
43
44
45
46
47
48
shell:
    """
    # Get the stats on the sequences using seqkit
    seqkit stats {input} > temp.txt

     # Sort the sequence statistics
     (sed -n '1p' temp.txt; awk 'NR>1{{print}}' temp.txt | \
       sort -V -k1,1) > {output} \
       && rm temp.txt
    """
64
65
66
shell:
    "fastqc --outdir {params.out_dir}/ "
    "--threads {params.threads} {input.forward} {input.rev}"
80
81
shell:
    "multiqc --interactive -f {params.out_dir} -o {params.out_dir}"
106
107
108
109
110
111
112
shell:
    "trimmomatic PE "
    "-threads {threads} "
    "{input.forward} {input.rev} "
    "{output.r1} {output.r1_unpaired} "
    "{output.r2} {output.r2_unpaired} "
    "{params.trimmer} > {log} 2>&1 "
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
shell:
    """
    # -a TGGAATTCTCGGGTGCCAAGG sequence here is the small RNA 3' adaptor
    # https://support-docs.illumina.com/SHARE/AdapterSeq/Content/SHARE/AdapterSeq/TruSeq/RNA/Small-RNA/TruSeqSmallRNA.htm
    # -A CTGTCTCTTATACAC sequece here is the nextera transposa sequence 
    cutadapt \
          -g '{params.forward_primer}' \
          -G '{params.rev_primer}' \
          -o {output.forward_reads} \
          -p {output.rev_reads} \
          --minimum-length  {params.minimum_length} \
          --quality-cutoff  {params.quality_cutoff} \
         {input.forward_reads} {input.rev_reads} > {log} 2>&1

    """
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
shell:
    """ 
     trim_galore \
       -o {params.out_dir} \
       --fastqc  \
       --paired {input.forward} {input.rev} > {log} 2>&1

     #Rename the files
     #Fastq files
    mv {params.out_dir}/{wildcards.sample}_R1_val_1.fq.gz {params.out_dir}/{wildcards.sample}_R1.fastq.gz
    mv {params.out_dir}/{wildcards.sample}_R2_val_2.fq.gz {params.out_dir}/{wildcards.sample}_R2.fastq.gz
     #HTML files
    mv {params.out_dir}/{wildcards.sample}_R1_val_1_fastqc.html {params.out_dir}/{wildcards.sample}_R1_fastqc.html
    mv {params.out_dir}/{wildcards.sample}_R2_val_2_fastqc.html {params.out_dir}/{wildcards.sample}_R2_fastqc.html
     #Zip files
    mv {params.out_dir}/{wildcards.sample}_R1_val_1_fastqc.zip {params.out_dir}/{wildcards.sample}_R1_fastqc.zip
    mv {params.out_dir}/{wildcards.sample}_R2_val_2_fastqc.zip {params.out_dir}/{wildcards.sample}_R2_fastqc.zip
    """
195
196
197
198
199
200
201
shell:
    """
      multiqc \
          --interactive \
          -f {params.in_dir} \
          -o {params.out_dir}  > {log} 2>&1
    """
210
211
212
213
214
215
216
217
218
219
shell:
    """
    # Get the stats on the sequences using seqkit
    seqkit stats {input} > temp.txt

     # Sort the sequence statistics
     (sed -n '1p' temp.txt; awk 'NR>1{{print}}' temp.txt | \
       sort -V -k1,1) > {output} \
       && rm temp.txt
    """
228
229
shell:
    "cat {input} > {output}"
SnakeMake From line 228 of main/Snakefile
239
240
241
242
243
244
shell:
    """
    validate_mapping_file.py \
       -m {input} \
       -o {params.out_dir}
    """
SnakeMake From line 239 of main/Snakefile
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
shell:
    """    
    [ -d {params.out_dir} ] ||  mkdir -p {params.out_dir}
    # Merge reads then delete unnecessary files
    pear \
        -f {input.forward} \
        -r {input.rev} \
        -j {params.threads} \
        -o {params.out_dir}/{wildcards.sample} \
        -m {params.max} \
        -n {params.min} \
        -t {params.min_trim} > {log} 2>&1

   rm -rf \
        {params.out_dir}/{wildcards.sample}.discarded.fastq \
        {params.out_dir}/{wildcards.sample}.unassembled.forward.fastq \
        {params.out_dir}/{wildcards.sample}.unassembled.reverse.fastq 
        mv {params.out_dir}/{wildcards.sample}.assembled.fastq {params.out_dir}/{wildcards.sample}.fastq

   # gzip to save memory

   #gzip {params.out_dir}/{wildcards.sample}.fastq
    """
291
292
293
294
shell:
    """
    seqkit stats {input} > {output} 2> {log}
    """
311
312
313
314
315
316
shell:
    """
    multiple_split_libraries_fastq.py --sampleid_indicator '.fastq' \
         -i {params.reads_dir} -o {params.out_dir} \
         -p {input.parameter_file}
    """
SnakeMake From line 311 of main/Snakefile
338
339
340
341
342
shell:
    """
    # Count all the sequences after split_labrary's quality filtering and concatenation
    count_seqs.py -i {input} > {output}
    """
SnakeMake From line 338 of main/Snakefile
359
360
361
362
363
364
shell:
    """
    # Split the seqs.fna file into per sample fna files 
    # this step is necessary because usearch fails when you have a lot of samples
    split_sequence_file_on_sample_ids.py  -i {input} -o  {params.out_dir}
    """
SnakeMake From line 359 of main/Snakefile
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
shell:
    """

    [ -d {params.out_dir}/{wildcards.sample}/ ] || mkdir -p {params.out_dir}/{wildcards.sample}/
    PROJECT_DIR=`pwd`
    cd {params.out_dir}/{wildcards.sample}/ 

      identify_chimeric_seqs.py \
         -m usearch61 \
         -i ${{PROJECT_DIR}}/{input.fasta} \
         -r {input.reference_database} \
         -o ${{PROJECT_DIR}}/{params.out_dir}/{wildcards.sample}/

    [ -d ${{PROJECT_DIR}}/{params.chimera_dir}/ ] || mkdir -p ${{PROJECT_DIR}}/{params.chimera_dir}/
    # Rename the chimera.txt file so that it will be sample specific
    mv ${{PROJECT_DIR}}/{params.out_dir}/{wildcards.sample}/chimeras.txt \
      ${{PROJECT_DIR}}/{params.chimera_dir}/{wildcards.sample}_chimera.txt && \
    rm -rf  ${{PROJECT_DIR}}/{params.out_dir}/{wildcards.sample}/
    """
SnakeMake From line 378 of main/Snakefile
404
405
shell:
    "cat {input} > {output}"
SnakeMake From line 404 of main/Snakefile
415
416
417
418
419
420
421
422
423
shell:
    """
    # Filter out the chimeric sequences fom seqs.fna
    filter_fasta.py \
       -s {input.chimeras} \
       -f {input.seqs} \
       -o {output} \
       -n
    """
SnakeMake From line 415 of main/Snakefile
431
432
433
434
435
436
437
438
439
shell:
    """
    set +u
    {params.conda_activate}
    set -u

    # Count all the sequences after chimera filtering
    count_seqs.py -i {input} > {output}
    """
SnakeMake From line 431 of main/Snakefile
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
shell:
    """
    #set +u;source activate qiime1;set -u
    if [ {params.method_class} == "denovo" ];then
    pick_de_novo_otus.py \
        -i {input.seqs} \
        -o {params.out_dir} \
        -p {input.parameter_file} \
        -f

   else

    pick_open_reference_otus.py \
        -i {input.seqs} \
        -r {input.reference_database} \
        -o {params.out_dir} \
        -p {input.parameter_file} \
        -m {params.method} \
        --suppress_align_and_tree \
        -f

    fi
    """   
SnakeMake From line 460 of main/Snakefile
495
496
497
498
499
500
501
shell:
    """
    filter_taxa_from_otu_table.py \
        -i {input} \
        -o {output} \
        -n {params.taxa2filter}
    """
SnakeMake From line 495 of main/Snakefile
511
512
513
514
515
516
517
518
519
shell:
    """
    # Filter-out the really rare otus "The recommended procedure is to discard those OTUs with a
    # number of sequences less than 0.005% of the total number of sequences" (Navas-Molina et al, 2013)
    filter_otus_from_otu_table.py \
        -i {input} \
        -o {output} \
        --min_count_fraction {params.min_freq}
    """
SnakeMake From line 511 of main/Snakefile
531
532
533
534
535
536
shell:
    """
    biom summarize-table \
        -i {input} \
        -o {output}
    """
SnakeMake From line 531 of main/Snakefile
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
    shell:
        """
        #  #--tree_fp
        if [ -d {params.out_dir} ]; then rm -rf {params.out_dir}; fi 

        set +u;source activate qiime1;set -u && \
        core_diversity_analyses.py \
		    -i  {input.otu_table} \
		    -o  {params.out_dir} \
		    --mapping_fp {input.mapping_file} \
		    --sampling_depth {params.depth} \
		    --parameter_fp {params.parameter_file} \
		    --categories {params.category} \
            --nonphylogenetic_diversity
        """
SnakeMake From line 552 of main/Snakefile
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
shell:
    """
    {params.conda_activate}
    # Remove the temporary output directory if it already exists
    [ -d picrust2_out_pipeline/ ] && rm -rf picrust2_out_pipeline/

    # ---- Run picrust2 pipeline for function annotation -------- #
    picrust2_pipeline.py \
        -s {input.rep_seqs} \
        -i {input.feature_table} \
        -o picrust2_out_pipeline/ \
        -p {threads} && \
        mv picrust2_out_pipeline/* {params.out_dir}/ && \
        rmdir picrust2_out_pipeline/
    """
622
623
624
625
626
627
628
629
630
631
632
633
634
635
shell:
    """
    {params.conda_activate}
    # ----- Annotate your enzymes, KOs and pathways by adding a description column ------#
    # EC
    add_descriptions.py -i {input.ec} -m EC -o {output.ec}
    # Metacyc Pathway
    add_descriptions.py -i {input.pathway} -m METACYC -o {output.pathway}
    # KO
    add_descriptions.py -i {input.ko} -m KO -o {output.ko} 

    # Unizip the metagenome contribution files - these files describe the micribes contribution the function profiles
    #find {params.outdir} -type f -name "*contrib.tsv.gz" -exec gunzip {{}} \;
    """
SnakeMake From line 622 of main/Snakefile
648
649
650
651
652
shell:
    """
        # Create an empty file
        mkdir -p {params.outdir} && touch {output.ko}
    """
SnakeMake From line 648 of main/Snakefile
667
668
669
670
671
672
673
674
675
676
677
678
679
680
shell:
    """
    # filter out the OTUs of what you are looking for from the given OTU table e.g. human pathogens
    # And convert biom formattted OTU table to tsv 
    filter_taxa_from_otu_table.py \
        -i {input.biom_table} \
        -o {output.biom} -p {input.list2search} && \
    biom convert     \
           -i  {output.biom}   \
           -o  {output.tsv}     \
           --to-tsv     \
           --header-key taxonomy     \
           --output-metadata-id "Consensus Lineage" > {log} 2>&1
    """
SnakeMake From line 667 of main/Snakefile
692
693
694
695
696
697
698
shell:
    """
    # Get the pathogenic OTU names
    OTUS=($(cut -f1 {input.tsv_table} | sed -e 1,2d))
    parallel -j {threads} "grep -wE '{{}}' {input.otu_map}  >> {output}" ::: ${{OTUS[*]}} \
        > {log} 2>&1
    """
SnakeMake From line 692 of main/Snakefile
708
709
710
711
712
713
714
715
shell:
    """
    filter_fasta.py \
        -f {input.seqs} \
        -o {output} \
        -m  {input.otu_map} \
        > {log} 2>&1
    """
SnakeMake From line 708 of main/Snakefile
ShowHide 22 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/olabiyi/qiime1-scripts
Name: qiime1-scripts
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...