Coprolite host Identification pipeline

public public 1yr ago Version: 1.1.1 0 bookmarks

A fully reproducible pipeline for COPROlite and paleofeces host IDentification

CoproID helps you to identify the "true maker" of Illumina sequenced Coprolites/Paleofaeces by checking the microbiome composition and the endogenous DNA.

It combines the analysis of putative host ancient DNA with a machine learning prediction of the feces source based on microbiome taxonomic composition:

  • ( A ) First coproID performs a comparative mapping of all reads agains two (or three) target genomes (genome1, genome2, and eventually genome3) and computes a host-DNA species ratio ( NormalizedRatio )

  • ( B ) Then coproID performs a metagenomic taxonomic profiling, and compares the obtained profiles to modern reference samples of the target species metagenomes. Using machine learning , coproID then estimates the host source from the metagenomic taxonomic composition ( prop_microbiome ).

  • Finally, coproID combines A and B to predict the likely host of the metagenomic sample.

The coproID pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

A detailed description of coproID can be found in the article published in PeerJ .

Quick Start

i. Install nextflow

ii. Install either Docker or Singularity for full pipeline reproducibility (please only use Conda as a last resort; see docs )

iii. Download the pipeline and test it on a minimal dataset with a single command

nextflow run nf-core/coproid -profile test,<docker/singularity/conda/institute>

Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile institute in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

iv. Start running your own analysis!

nextflow run maxibor/coproid --genome1 'GRCh37' --genome2 'CanFam3.1' --name1 'Homo_sapiens' --name2 'Canis_familiaris' --reads '*_R{1,2}.fastq.gz' --krakendb 'path/to/minikraken_db' -profile docker

This command runs coproID to estimate whether the source of test samples ( --reads '*_R{1,2}.fastq.gz' ) are coming from a human ( --genome1 'GRCh37' -name1 'Homo_sapiens' ) or a dog ( --genome2 'CanFam3.1' --name2 'Canis_familiaris' ), and specifies the path to the minikraken database ( --krakendb 'path/to/minikraken_db' ).

NB: The example above assumes access to iGenomes .

See usage docs for all of the available options when running the pipeline.

Documentation

The nf-core/coproid pipeline comes with documentation about the pipeline, found in the docs/ directory:

The nf-core/coproid pipeline comes with documentation about the pipeline, found in the docs/ directory and at the following address: coproid.readthedocs.io

  1. Installation

  2. Pipeline configuration

  3. Running the pipeline

  4. Output and how to interpret the results

  5. Troubleshooting

Credits

nf-core/coproid was written by Maxime Borry .

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines .

For further information or help, don't hesitate to get in touch on Slack (you can join with this invite ).

Citing

coproID has been published in peerJ . The bibtex citation is available below:

@article{borry_coproid_2020,
 title = {{CoproID} predicts the source of coprolites and paleofeces using microbiome composition and host {DNA} content},
 volume = {8},
 issn = {2167-8359},
 url = {https://peerj.com/articles/9001},
 doi = {10.7717/peerj.9001},
 language = {en},
 urldate = {2020-04-20},
 journal = {PeerJ},
 author = {Borry, Maxime and Cordova, Bryan and Perri, Angela and Wibowo, Marsha and Honap, Tanvi Prasad and Ko, Jada and Yu, Jie and Britton, Kate and Girdland-Flink, Linus and Power, Robert C. and Stuijts, Ingelise and Salazar-García, Domingo C. and Hofman, Courtney and Hagan, Richard and Kagoné, Thérèse Samdapawindé and Meda, Nicolas and Carabin, Helene and Jacobson, David and Reinhard, Karl and Lewis, Cecil and Kostic, Aleksandar and Jeong, Choongwon and Herbig, Alexander and Hübner, Alexander and Warinner, Christina},
 month = apr,
 year = {2020},
 note = {Publisher: PeerJ Inc.},
 pages = {e9001}
}

Contributors

James A. Fellows Yates

Tool references

Code Snippets

328
329
330
"""
tar xvzf $ckdb
"""
NextFlow From line 328 of master/main.nf
414
415
416
"""
fastqc -q $reads
"""
429
430
431
"""
mv $genome $outname
"""
NextFlow From line 429 of master/main.nf
442
443
444
"""
mv $genome $outname
"""
NextFlow From line 442 of master/main.nf
456
457
458
"""
mv $genome $outname
"""
NextFlow From line 456 of master/main.nf
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
"""
AdapterRemoval --basename $name \\
               --file1 ${reads[0]} \\
               --file2 ${reads[1]} \\
               --trimns \\
               --trimqualities \\
               --collapse \\
               --minquality 20 \\
               --minlength 30 \\
               --output1 $out1 \\
               --output2 $out2 \\
               --outputcollapsed $col_out \\
               --threads ${task.cpus} \\
               --qualitybase ${params.phred} \\
               --settings $settings
"""
518
519
520
521
522
523
524
525
526
527
528
529
530
531
"""
AdapterRemoval --basename $name \\
               --file1 ${reads[0]} \\
               --file2 ${reads[1]} \\
               --trimns \\
               --trimqualities \\
               --minquality 20 \\
               --minlength 30 \\
               --output1 $out1 \\
               --output2 $out2 \\
               --threads ${task.cpus} \\
               --qualitybase ${params.phred} \\
               --settings $settings
"""
533
534
535
536
537
538
539
540
541
542
543
544
"""
AdapterRemoval --basename $name \\
               --file1 ${reads[0]} \\
               --trimns \\
               --trimqualities \\
               --minquality 20 \\
               --minlength 30 \\
               --output1 $se_out \\
               --threads ${task.cpus} \\
               --qualitybase ${params.phred} \\
               --settings $settings
"""
565
566
567
"""
bowtie2-build $fasta ${bt1_index}
"""
592
593
594
595
596
"""
bowtie2 -x $bt1_index -U ${reads[0]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat
samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile
samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign
"""
598
599
600
601
602
"""
bowtie2 -x $bt1_index -1 ${reads[0]} -2 ${reads[1]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat
samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile
samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign
"""
619
620
621
"""
samtools fastq -1 $out1 -2 $out2 -0 /dev/null -s /dev/null -n -F 0x900 $bam
"""
624
625
626
"""
samtools fastq $bam > $out
"""
643
644
645
"""
bowtie2-build $fasta ${bt2_index}
"""
662
663
664
"""
bowtie2-build $fasta ${bt3_index}
"""
691
692
693
694
695
"""
bowtie2 -x $bt2_index -U ${reads[0]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat
samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile
samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign
"""
697
698
699
700
701
"""
bowtie2 -x $bt2_index -1 ${reads[0]} -2 ${reads[1]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat
samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile
samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign
"""
728
729
730
731
732
"""
bowtie2 -x $bt3_index -U ${reads[0]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat
samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile
samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign
"""
734
735
736
737
738
"""
bowtie2 -x $bt3_index -1 ${reads[0]} -2 ${reads[1]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat
samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile
samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign
"""
757
758
759
"""
samtools view -h -F 4 $bam1 | pmdtools -t ${params.pmdscore} --header $library | samtools view -Sb - > $outfile
"""
773
774
775
"""
samtools view -h -F 4 $bam2 | pmdtools -t ${params.pmdscore} --header $library | samtools view -Sb - > $outfile
"""
790
791
792
"""
samtools view -h -F 4 $bam3 | pmdtools -t ${params.pmdscore} --header $library | samtools view -Sb - > $outfile
"""
815
816
817
818
819
820
821
"""
kraken2 --db ${krakendb} \\
        --threads ${task.cpus} \\
        --output $out \\
        --report $kreport \\
        --paired ${reads[0]} ${reads[1]}
"""    
823
824
825
826
827
828
"""
kraken2 --db ${krakendb} \\
        --threads ${task.cpus} \\
        --output $out \\
        --report $kreport ${reads[0]}
"""
844
845
846
"""
kraken_parse.py -c ${params.minKraken} $kraken_r
"""    
NextFlow From line 844 of master/main.nf
861
862
863
"""
merge_kraken_res.py -o $out
"""    
NextFlow From line 861 of master/main.nf
881
882
883
884
885
886
887
888
889
890
891
"""
sourcepredict -di ${params.sp_dim} \\
              -kne ${params.sp_neighbors} \\
              -me ${params.sp_embed} \\
              -n ${params.sp_norm} \\
              -l ${sp_labels} \\
              -s ${sp_sources} \\
              -t ${task.cpus} \\
              -o $outfile \\
              -e $embed_out $otu_table 
"""
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
"""
samtools index $bam1
samtools index $bam2
samtools index $abam1
samtools index $abam2
normalizedReadCount -n $name \\
                    -b1 $bam1 \\
                    -ab1 $abam1 \\
                    -b2 $bam2 \\
                    -ab2 $abam2 \\
                    -g1 $genome1 \\
                    -g2 $genome2 \\
                    -r1 $organame1 \\
                    -r2 $organame2 \\
                    -i ${params.identity} \\
                    -o $outfile \\
                    -ob1 $obam1 \\
                    -aob1 $aobam1 \\
                    -ob2 $obam2 \\
                    -aob2 $aobam2 \\
                    -ed1 ${params.endo1} \\
                    -ed2 ${params.endo2} \\
                    -p ${task.cpus}
"""
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
"""
samtools index $bam1
samtools index $bam2
normalizedReadCount -n $name \\
                    -b1 $bam1 \\
                    -b2 $bam2 \\
                    -g1 $genome1 \\
                    -g2 $genome2 \\
                    -r1 $organame1 \\
                    -r2 $organame2 \\
                    -i ${params.identity} \\
                    -o $outfile \\
                    -ob1 $obam1 \\
                    -ob2 $obam2 \\
                    -ed1 ${params.endo1} \\
                    -ed2 ${params.endo2} \\
                    -p ${task.cpus}
"""
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
"""
samtools index $bam1
samtools index $bam2
samtools index $bam3
samtools index $abam1
samtools index $abam2
samtools index $abam3
normalizedReadCount -n $name \\
                    -b1 $bam1 \\
                    -ab1 $abam1 \\
                    -b2 $bam2 \\
                    -ab2 $abam2 \\
                    -b3 $bam3 \\
                    -ab3 $abam3 \\
                    -g1 $genome1 \\
                    -g2 $genome2 \\
                    -g3 $genome3 \\
                    -r1 $organame1 \\
                    -r2 $organame2 \\
                    -r3 $organame3 \\
                    -i ${params.identity} \\
                    -o $outfile \\
                    -ob1 $obam1 \\
                    -aob1 $aobam1 \\
                    -ob2 $obam2 \\
                    -aob2 $aobam2 \\
                    -ob3 $obam3 \\
                    -aob3 $aobam3 \\
                    -ed1 ${params.endo1} \\
                    -ed2 ${params.endo2} \\
                    -ed3 ${params.endo3} \\
                    -p ${task.cpus}
"""
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
"""
samtools index $bam1
samtools index $bam2
samtools index $bam3
normalizedReadCount -n $name \\
                    -b1 $bam1 \\
                    -b2 $bam2 \\
                    -b3 $bam3 \\
                    -g1 $genome1 \\
                    -g2 $genome2 \\
                    -g3 $genome3 \\
                    -r1 $organame1 \\
                    -r2 $organame2 \\
                    -r3 $organame3 \\
                    -i ${params.identity} \\
                    -o $outfile \\
                    -ob1 $obam1 \\
                    -ob2 $obam2 \\
                    -ob3 $obam3 \\
                    -ed1 ${params.endo1} \\
                    -ed2 ${params.endo2} \\
                    -ed3 ${params.endo3} \\
                    -p ${task.cpus}
"""
1082
1083
1084
1085
1086
1087
1088
"""
mv $align $bam_name
damageprofiler -i $bam_name -r $fasta -o tmp
mv tmp/${smp_name}/5pCtoT_freq.txt $fwd_name
mv tmp/${smp_name}/3pGtoA_freq.txt $rev_name
mv tmp/${smp_name}/dmgprof.json ${smp_name}.dmgprof.json
"""
NextFlow From line 1082 of master/main.nf
1107
1108
1109
1110
1111
1112
1113
"""
mv $align $bam_name
damageprofiler -i $bam_name -r $fasta -o tmp
mv tmp/${smp_name}/5pCtoT_freq.txt $fwd_name
mv tmp/${smp_name}/3pGtoA_freq.txt $rev_name
mv tmp/${smp_name}/dmgprof.json ${smp_name}.dmgprof.json
"""
NextFlow From line 1107 of master/main.nf
1133
1134
1135
1136
1137
1138
1139
"""
mv $align $bam_name
damageprofiler -i $bam_name -r $fasta -o tmp
mv tmp/${smp_name}/5pCtoT_freq.txt $fwd_name
mv tmp/${smp_name}/3pGtoA_freq.txt $rev_name
mv tmp/${smp_name}/dmgprof.json ${smp_name}.dmgprof.json
"""
NextFlow From line 1133 of master/main.nf
1159
1160
1161
1162
1163
"""
ls -1 *.bpc.csv | head -1 | xargs head -1 > coproID_bp.csv
tail -q -n +2 *.bpc.csv >> coproID_bp.csv
merge_bp_sp.py -c coproID_bp.csv -s $sp -o $outfile
"""
NextFlow From line 1159 of master/main.nf
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
"""
echo ${workflow.manifest.version} > version.txt
jupyter nbconvert \\
        --TagRemovePreprocessor.remove_input_tags='{"remove_cell"}' \\
        --TagRemovePreprocessor.remove_all_outputs_tags='{"remove_output"}' \\
        --TemplateExporter.exclude_input_prompt=True \\
        --TemplateExporter.exclude_output_prompt=True \\
        --ExecutePreprocessor.timeout=200 \\
        --execute \\
        --to html_embed $report
"""
NextFlow From line 1184 of master/main.nf
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
"""
echo ${workflow.manifest.version} > version.txt
jupyter nbconvert \\
        --TagRemovePreprocessor.remove_input_tags='{"remove_cell"}' \\
        --TagRemovePreprocessor.remove_all_outputs_tags='{"remove_output"}' \\
        --TemplateExporter.exclude_input_prompt=True \\
        --TemplateExporter.exclude_output_prompt=True \\
        --ExecutePreprocessor.timeout=200 \\
        --execute \\
        --to html_embed $report
"""
NextFlow From line 1211 of master/main.nf
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
"""
echo ${workflow.manifest.version} > version.txt
jupyter nbconvert \\
        --TagRemovePreprocessor.remove_input_tags='{"remove_cell"}' \\
        --TagRemovePreprocessor.remove_all_outputs_tags='{"remove_output"}' \\
        --TemplateExporter.exclude_input_prompt=True \\
        --TemplateExporter.exclude_output_prompt=True \\
        --ExecutePreprocessor.timeout=200 \\
        --execute \\
        --to html_embed $report
"""
NextFlow From line 1236 of master/main.nf
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
"""
echo $workflow.manifest.version > v_pipeline.txt
echo $workflow.nextflow.version > v_nextflow.txt
fastqc --version > v_fastqc.txt
multiqc --version > v_multiqc.txt
sourcepredict -h  > v_sourcepredict.txt
samtools --version > v_samtools.txt
kraken2 --version > v_kraken2.txt
bowtie2 --version > v_bowtie2.txt
python --version > v_python.txt
AdapterRemoval --version 2> v_adapterremoval.txt
scrape_software_versions.py &> software_versions_mqc.yaml
"""
1299
1300
1301
"""
multiqc -f -d adapter_removal alignment fastqc DamageProfiler software_versions software_versions -c $multiqc_conf
"""
1337
1338
1339
"""
markdown_to_html.py $output_docs -o results_description.html
"""
NextFlow From line 1337 of master/main.nf
ShowHide 33 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...