Exploring the panregulome of grasses with model species Brachypodium distachyon
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
a Snakemake workflow for exploring the panregulome of grasses with model species Brachypodium distachyon
REQUIREMENTS
Snakemake
assembly-stats
BUSCO
R
get_homologues
mafft
Installation
From GitHub
git clone https://github.com/ammarabdalrahem/panregulome-analysis.git
Installing Snakemake
pip3 install snakemake
Note
input your username and password in SnakeFile to login JGI
Usage
snakemake --cores all
WORKFLOW
Rule: all
The main rule that generates all the output files.
Rule: make_directories
Creates the necessary output directories.
Rule: obtain_data
Downloads genome sequence data using the provided script and file list.
Rule: uncompressed_files
Unzips and organizes the downloaded data.
Rule: assembly_assessment
Performs statistical analysis on the assembly files and creates an assessment table.
Rule: BUSCO_run
Runs BUSCO analysis on each assembly file and generates short summary files.
Rule: quality_table
Extracts the BUSCO completeness scores from the summary files and creates a quality table.
Rule: assemblies_evaluation
Generates a boxplot to visualize the quality of the assemblies.
Rule: excluded_poor_quality
Identifies and excludes poor quality samples based on the boxplot results.
Rule: repeat_masking_annotation
Downloads and installs the necessary tools for repeat masking and annotation.
Rule: repeats_length
Computes the total length of repeats for each ecotype and merges it with the total genome length.
Rule: repeats_visualization
Generates a plot to visualize the distribution of repeat lengths.
Rule: annotation_analysis
Performs TE annotation analysis by extracting TE families and their lengths from the annotation files.
Rule: annotation_visualization
Generates a plot to visualize the distribution of TE orders.
Rule: promoters_extraction
Extracts proximal promoter sequences from the genome files.
Rule: get_homologues
Downloads and installs the get_homologues tool.
Rule: gene_clustering
Performs gene clustering using the get_homologues tool.
Rule: header_extraction
Extracts the headers of the gene clusters.
Rule: promoter_clustering
Performs clustering of the promoter sequences.
Rule: global_alignment
Performs global sequence alignment on the gene clusters and promoter sequences.
Rule: local_alignment
Performs local sequence alignment on the gene clusters and promoter sequences.
Rule: trimal_alignment
This rule trims the alignment files using the Trimal software.
Code Snippets
33 34 35 36 | shell: """ mkdir -p tables out figures """ |
50 51 52 53 54 55 | shell: """ cd {output.data} while read file; do bash ../../{input.script} {username} {password} $file; done < ../../{input.ecotypes} """ |
61 62 63 64 65 66 67 68 69 70 71 72 73 74 | shell: """ for file in $(ls {input.data}*gz |rev|cut -d "." -f 2- |rev|cut -d "/" -f 3-); do gunzip -k {input.data}$file.gz ; mv {input.data}$file {output.uncompressed_data} ; done cd out/uncom_data/ mv BdistachyonBd21v2_1_283_v2.0.fa BdistachyonBd21v2_283_v2.fa mv BdistachyonBd21v2_1_283_Bd21v2.1.gene.gff3 BdistachyonBd21v2_283_v2.Bd21.1.gene.gff3 mv BdistachyonBd21v2_1_283_Bd21v2.1.cds_primaryTranscriptOnly.fa BdistachyonBd21v2_283_v2.Bd21.1.cds_primaryTranscriptOnly.fa for f in *Only.fa ; do fnew=`echo $f|cut --complement -d '.' -f 2,3`; mv $f $fnew ; done for f in *.gff3 ; do fnew=`echo $f|cut --complement -d '.' -f 2-4`; mv $f $fnew ; done mkdir genomes mkdir cds mv ./*Only.fa cds/ mv ./*.* genomes/ """ |
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | shell: """ #analysis stats for assembly files #assembly-stats get ssembly statistics from FASTA and FASTQ files. assembly-stats -u {input.data}/*_{{v1.fa,v2.fa}}|cut -f 1-2,7,8> stats.csv #convert output to CSV file cat stats.csv |tr "\\t" "," > stats_table.csv #loop for count and cut genes numbers for i in {input.data}/*.gff3; do sed -n '3p' $i | cut -d ' ' -f 3,4 && cat $i| grep -v "#" | cut -f3|sort | uniq -c|grep "gene" |grep -o '[[:digit:]]*' done | paste -sd "\t\n" > genes.csv #cut only gene numbers cut -d" " -f2 genes.csv |awk '{{print $2}}' > genes_table.csv #join two tables paste -d , stats_table.csv genes_table.csv|perl -l -F/ -e 'print "$F[3]"' > {output.asses_table} #clear data mv genes_table.csv genes.csv stats_table.csv stats.csv tables/ """ |
114 115 116 117 118 119 120 121 122 123 124 125 126 | shell: """ #BUSCO v5.2.2 # Perform BUSCO analysis on each assembly file ls -1 {input.data}*_{{v1.fa,v2.fa}}| sort -u | while read i do NAME_base=$(basename $i) echo "$NAME_base" busco -o ${{NAME_base}} -i $i -l poales -m genome --out_path {output.BUSCO_run} -c 20 -f done """ |
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | shell: """ # extract the busco score of completeness and create a column ls -1 {input.data}*_{{v1.fa,v2.fa}}/short_summary.*|sort -u| while read i do cat $i |sed -n '3p'| rev |cut -d "/" -f 1|rev|tr -s " " && cat $i| perl -ne 'if(/C:([^\[]+)/){{ print $1; }}'| awk '{{print $0","}}' done |paste -sd "\t\n"|sort -u| awk '{{print $2}}'|sed 's/.$//' |sed 's/.$//'> tables/busco_table.csv #create a table for busco column and stats result paste -d , {input.asses_table} tables/busco_table.csv > tables/final_table.csv #creat headers sed -i -e '1i Name,total_length,Ns_count,Gaps,N_genes,BUSCO_complete %' tables/final_table.csv #clear row names cat tables/final_table.csv | awk '{{gsub(/Bdistachyon/,"",$1)}}1' | awk '{{gsub(/_v1.fa/,"",$1)}}1'| awk '{{gsub(/_v2.fa/,"",$1)}}1'> {output.quality_table} rm tables/busco_table.csv tables/final_table.csv """ |
165 166 | shell: "Rscript {input.script}" |
178 179 180 181 182 183 184 185 186 187 188 189 190 191 | shell: """ #save the R output in txt file Rscript {input.script} > r_output.txt #extract the two final line with ecotypes name | cut it | replace "" with * as regx expression for loop cat r_output.txt|tail -n 2|cut -d " " -f 3-15|sed 's/"/*/g' > {output.excluded} #note: outlier more than two line modify it to be universal?? #take a list from outlier of boxplot in R to sure good extraction echo "$(<{output.excluded})" #clear data #move excluded_ecotypes for file in $(echo $(<out/txt/poor_quality.txt) ); do mv out/uncom_data/genomes/$file {output.excluded_dir}; done rm r_output.txt """ |
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | shell: """ #konw your g++ version #g++ -v (for me is 9) and modify it in makefile # creat masking_repeats directory mkdir -p src/scripts/masking_repeats #move to masking_repeats directory cd src/scripts/masking_repeats # download Scripting analyses of genomes in Ensembl Plants by cloning git clone https://github.com/Ensembl/plant-scripts.git #move plant-scripts directory, that creat after cloning cd plant-scripts #installation steps for Repeat masking and annotation tools make install install_repeats cd repeats # run repeat masking and annotation tools for all fasta files ls -1 ../../../../../out/uncom_data/genomes/*.fa| # move to work directory that contain fasta format sort -u | #sort all files by alphabetical arrangement while read i do NAME_base=$(basename $i) echo "$NAME_base" ./Red2Ensembl.py ../../../../../out/uncom_data/genomes/${{NAME_base}} ${{NAME_base}}_file \ --msk_file ${{NAME_base}}.sm.fna --bed_file ${{NAME_base}}.bed --cor 16 && ./AnnotRedRepeats.py \ ../files/nrTEplantsJune2020.fna ${{NAME_base}}_file --bed_file ${{NAME_base}}.nrTEplants.bed --cor 16 done """ |
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | shell: """ #count total length of repeats for each ecotypes ls -1 {input.data}*.fa.bed | sort -u | while read i do cat $i |awk '{{SUM += $3-$2}} END {{print SUM}}' # by subtracting two columns of coordinates for length the summation all for total length done | tr "\\t" "," > repeats_length.csv cat repeats_length.csv #count the total length of genome for each ecotypes assembly-stats -u out/uncom_data/genomes/Bdistachyon*.fa|cut -f 1-2|tr "\\t" ","|awk '{{gsub(/Bdistachyon/,"",$1)}}1' | awk '{{gsub(/_v1.fa/,"",$1)}}1'|cut -d "/" -f 4 > stats_40.csv # join genome total length and total length of repeats in CSV file paste -d , stats_40.csv repeats_length.csv > tables/repeats_analysis.csv #add head sed -i '1i Ecotypes,Genome.size,Repeated' tables/repeats_analysis.csv #clean data rm stats_40.csv repeats_length.csv """ |
273 274 | shell: "Rscript {input.script}" |
283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 | shell: """ #go to bed files directory cd src/scripts/masking_repeats/plant-scripts/repeats/ # cut TE families after # from nrTEplants.bed files and add file -ecotype- name to columns ls -1 *.nrTEplants.bed | sort -u | while read i do perl -lane '$f=(split(/#/,$F[3]))[1];print "$f\t$ARGV"' $i done | tr "\\t" ", " > te_ann_0.csv # make comma as separator #rearrange , put ecotype name as first column cat te_ann_0.csv|awk -F, '{{ print $2","$1 }}' > te_ann_1.csv #filter name of ecotype to be easy to read |and replace / by comma as separator| remove third col (superfamily) cat te_ann_1.csv| awk '{{gsub(/Bdistachyon/,"",$1)}}1' | awk '{{gsub(/_v1.fa.nrTEplants.bed/,"",$1)}}1'|tr '/' ',n'|cut -d, -f3 --complement> te_ann_2.csv #measure the length(bp) for each TE order in nrTEplants.bed files save it as column ls -1 *.nrTEplants.bed | sort -u | while read i do cat $i |awk '{{ print $0, $3 - $2 }}'| cut -d " " -f 2 done | tr "\\t" "," > te_ann_3.csv #merge two tables for final table paste -d , te_ann_2.csv te_ann_3.csv > te_ann.csv #cat te_ann_4.csv|sed 's/^,/NULL,/; :a;s/,,/,NA,/g;ta' > te_ann.csv #add header to table sed -i -e '1i Ecotype,Order,Length' te_ann.csv #reomve files rm te_ann_0.csv rm te_ann_1.csv rm te_ann_2.csv rm te_ann_3.csv #move final file to work directory cp te_ann.csv ./../../../../../tables/ """ |
335 336 | shell: "Rscript {input.script}" |
345 346 | shell: "bash {input.script}" |
356 357 358 359 360 361 362 363 364 | shell: """ cd src/scripts/ git clone https://github.com/eead-csic-compbio/get_homologues.git cd get_homologues ./install.pl cd ../../.. {output.get_homologues} -h """ |
372 373 374 375 376 377 | shell: """ {input.script_get_homologues} -d out/uncom_data/cds -M -t 4 -e -m cluster &> out/logs/log.cds.Pfam cp -r cds_est_homologues/ out/ rm -r cds_est_homologues/ """ |
387 388 389 390 | shell: """ {input.script_header} -i {input.data_homologues} -o {output.cds_ids} """ |
399 400 401 402 | shell: """ {input.script_seq_extraction} -id {input.cds_ids} -f {input.promoter_dir} -o {output.promoter_cluster} """ |
411 412 413 414 415 416 417 418 | shell: """ #cds mafft --globalpair --maxiterate 1000 {input.data_homologues} > out/aln_cds/{wildcards.file} 2> out/logs/aln_cds_log.txt #promoter mafft --globalpair --maxiterate 1000 out/uncom_data/promoter_cluster/{wildcards.file} > out/aln_promoter/{wildcards.file} 2> out/logs/aln_pr_log.txt """ |
432 433 434 435 436 437 438 | shell: """ #cds {params.filter_path} -f {input.cds_align} -o {output.cds_align_filtration}/{wildcards.file} >> tables/global_poor_alignment_cds.csv || true #promoter {params.filter_path} -f {input.promoter_align} -o {output.promoter_align_filtration}/{wildcards.file} >> tables/global_poor_alignment_pr.csv || true """ |
453 454 455 456 457 458 459 460 461 462 463 | shell: """ #cds ./src/scripts/get_homologues/annotate_cluster.pl -f {input.data_homologues} -o {output.cds_local_align} >> out/logs/aln_local_cds_log.txt #promoter #promoter ./src/scripts/get_homologues/annotate_cluster.pl -f {input.promoter_cluster} -o {output.promoter_local_align} >> out/logs/aln_local_pr_log.txt #promoter """ |
478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 | shell: """ #trimal version trimAl 1.2rev59 #cds_global_filterated {params.trimal_path} -in {input.cds_align_filtration} -out {output.trimal_cds_global_align} \ -automated1 >> out/logs/trimal_cds_log.txt #promoter_global_filterated {params.trimal_path} -in {input.promoter_align_filtration} -out {output.trimal_pr_global_align} \ -automated1 >> out/logs/trimal_pr_log.txt #cds_local {params.trimal_path} -in {input.cds_local_align} -out {output.trimal_cds_local_align} \ -automated1 >> out/logs/trimal_cds_local_log.txt #promoter_local {params.trimal_path} -in {input.promoter_local_align} -out {output.trimal_pr_local_align} \ -automated1 >> out/logs/trimal_pr_local_log.txt """ |
Support
- Future updates