Exploring the panregulome of grasses with model species Brachypodium distachyon

public public 1yr ago 0 bookmarks

a Snakemake workflow for exploring the panregulome of grasses with model species Brachypodium distachyon

REQUIREMENTS

Snakemake
assembly-stats
BUSCO
R
get_homologues
mafft

Installation

From GitHub

git clone https://github.com/ammarabdalrahem/panregulome-analysis.git

Installing Snakemake

pip3 install snakemake

Note

input your username and password in SnakeFile to login JGI

Usage

snakemake --cores all

WORKFLOW

Rule: all

The main rule that generates all the output files.

Rule: make_directories

Creates the necessary output directories.

Rule: obtain_data

Downloads genome sequence data using the provided script and file list.

Rule: uncompressed_files

Unzips and organizes the downloaded data.

Rule: assembly_assessment

Performs statistical analysis on the assembly files and creates an assessment table.

Rule: BUSCO_run

Runs BUSCO analysis on each assembly file and generates short summary files.

Rule: quality_table

Extracts the BUSCO completeness scores from the summary files and creates a quality table.

Rule: assemblies_evaluation

Generates a boxplot to visualize the quality of the assemblies.

Rule: excluded_poor_quality

Identifies and excludes poor quality samples based on the boxplot results.

Rule: repeat_masking_annotation

Downloads and installs the necessary tools for repeat masking and annotation.

Rule: repeats_length

Computes the total length of repeats for each ecotype and merges it with the total genome length.

Rule: repeats_visualization

Generates a plot to visualize the distribution of repeat lengths.

Rule: annotation_analysis

Performs TE annotation analysis by extracting TE families and their lengths from the annotation files.

Rule: annotation_visualization

Generates a plot to visualize the distribution of TE orders.

Rule: promoters_extraction

Extracts proximal promoter sequences from the genome files.

Rule: get_homologues

Downloads and installs the get_homologues tool.

Rule: gene_clustering

Performs gene clustering using the get_homologues tool.

Rule: header_extraction

Extracts the headers of the gene clusters.

Rule: promoter_clustering

Performs clustering of the promoter sequences.

Rule: global_alignment

Performs global sequence alignment on the gene clusters and promoter sequences.

Rule: local_alignment

Performs local sequence alignment on the gene clusters and promoter sequences.

Rule: trimal_alignment

This rule trims the alignment files using the Trimal software.

Code Snippets

33
34
35
36
shell:
    """
    mkdir -p  tables out figures
    """
SnakeMake From line 33 of main/Snakefile
50
51
52
53
54
55
shell:
  """
  cd {output.data}

  while read file; do bash ../../{input.script} {username} {password} $file; done < ../../{input.ecotypes}
  """
SnakeMake From line 50 of main/Snakefile
61
62
63
64
65
66
67
68
69
70
71
72
73
74
shell:
  """
  for file in $(ls {input.data}*gz  |rev|cut -d "." -f 2- |rev|cut -d "/" -f 3-); do gunzip -k {input.data}$file.gz ; mv {input.data}$file {output.uncompressed_data} ; done
  cd out/uncom_data/
  mv BdistachyonBd21v2_1_283_v2.0.fa BdistachyonBd21v2_283_v2.fa
  mv BdistachyonBd21v2_1_283_Bd21v2.1.gene.gff3 BdistachyonBd21v2_283_v2.Bd21.1.gene.gff3
  mv BdistachyonBd21v2_1_283_Bd21v2.1.cds_primaryTranscriptOnly.fa BdistachyonBd21v2_283_v2.Bd21.1.cds_primaryTranscriptOnly.fa
  for f in *Only.fa ; do fnew=`echo $f|cut --complement -d '.' -f 2,3`; mv $f $fnew ; done 
  for f in *.gff3 ; do fnew=`echo $f|cut --complement -d '.' -f 2-4`; mv $f $fnew ; done
  mkdir genomes
  mkdir cds
  mv ./*Only.fa cds/
  mv ./*.*  genomes/
  """
SnakeMake From line 61 of main/Snakefile
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
shell:
  """
  #analysis stats for assembly files
  #assembly-stats get ssembly statistics from FASTA and FASTQ files.
  assembly-stats -u {input.data}/*_{{v1.fa,v2.fa}}|cut -f 1-2,7,8> stats.csv
  #convert output to CSV file
  cat stats.csv |tr "\\t" "," > stats_table.csv
  #loop for count and cut genes numbers
  for i in {input.data}/*.gff3; do
  sed -n '3p' $i | cut -d ' ' -f 3,4 && cat $i| grep -v "#" | cut -f3|sort | uniq -c|grep "gene" |grep  -o '[[:digit:]]*'
  done | paste -sd "\t\n" > genes.csv
  #cut only gene numbers
  cut -d" " -f2 genes.csv |awk '{{print $2}}' > genes_table.csv
  #join two tables
  paste -d , stats_table.csv genes_table.csv|perl -l -F/ -e 'print "$F[3]"'  > {output.asses_table}
  #clear data
  mv genes_table.csv genes.csv stats_table.csv stats.csv tables/
  """
114
115
116
117
118
119
120
121
122
123
124
125
126
shell:
  """
  #BUSCO v5.2.2
  # Perform BUSCO analysis on each assembly file
  ls -1 {input.data}*_{{v1.fa,v2.fa}}|
  sort -u |
  while read i
  do
      NAME_base=$(basename $i)
      echo "$NAME_base"
      busco -o ${{NAME_base}} -i $i -l poales -m genome --out_path {output.BUSCO_run} -c 20 -f 
  done 
  """
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
shell:
  """
  # extract the busco score of completeness and create a column
  ls -1  {input.data}*_{{v1.fa,v2.fa}}/short_summary.*|sort -u| 
  while read i
  do
  cat $i |sed -n '3p'| rev |cut -d "/" -f 1|rev|tr -s " " && cat $i|
  perl -ne 'if(/C:([^\[]+)/){{ print $1; }}'|
  awk '{{print $0","}}' 
  done |paste -sd "\t\n"|sort -u| awk '{{print $2}}'|sed 's/.$//' |sed 's/.$//'> tables/busco_table.csv
  #create a table for busco column and stats result 
  paste -d , {input.asses_table} tables/busco_table.csv > tables/final_table.csv
  #creat headers
  sed -i -e '1i Name,total_length,Ns_count,Gaps,N_genes,BUSCO_complete %' tables/final_table.csv
  #clear row names 
  cat tables/final_table.csv | awk '{{gsub(/Bdistachyon/,"",$1)}}1' | awk '{{gsub(/_v1.fa/,"",$1)}}1'| 
  awk '{{gsub(/_v2.fa/,"",$1)}}1'> {output.quality_table}
  rm tables/busco_table.csv tables/final_table.csv

  """
SnakeMake From line 136 of main/Snakefile
165
166
shell:
  "Rscript {input.script}"
SnakeMake From line 165 of main/Snakefile
178
179
180
181
182
183
184
185
186
187
188
189
190
191
shell:
  """
  #save the R output in txt file
  Rscript {input.script} > r_output.txt
  #extract the two final line with ecotypes name | cut it | replace "" with * as regx expression for loop
  cat r_output.txt|tail -n 2|cut -d " " -f 3-15|sed 's/"/*/g' > {output.excluded}
  #note: outlier more than two line modify it to be universal??
  #take a list from outlier of boxplot in R to sure good extraction
  echo "$(<{output.excluded})"
  #clear data
  #move excluded_ecotypes
  for file in $(echo $(<out/txt/poor_quality.txt) ); do mv out/uncom_data/genomes/$file {output.excluded_dir}; done
  rm r_output.txt
  """
SnakeMake From line 178 of main/Snakefile
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
shell:
  """
  #konw your g++ version
  #g++ -v (for me is 9) and modify it in makefile

  # creat masking_repeats directory
  mkdir -p src/scripts/masking_repeats

  #move to masking_repeats directory
  cd src/scripts/masking_repeats

  # download Scripting analyses of genomes in Ensembl Plants by cloning
  git clone https://github.com/Ensembl/plant-scripts.git

  #move plant-scripts directory, that creat after cloning
  cd plant-scripts
  #installation steps for Repeat masking and annotation tools
  make install install_repeats
  cd repeats


  # run repeat masking and annotation tools for all fasta files
  ls -1 ../../../../../out/uncom_data/genomes/*.fa| # move to work directory that contain fasta format
      sort -u | #sort all files by alphabetical arrangement
      while read i
      do
          NAME_base=$(basename $i)
          echo "$NAME_base"
          ./Red2Ensembl.py  ../../../../../out/uncom_data/genomes/${{NAME_base}}  ${{NAME_base}}_file \
          --msk_file ${{NAME_base}}.sm.fna --bed_file ${{NAME_base}}.bed --cor 16 && ./AnnotRedRepeats.py \
          ../files/nrTEplantsJune2020.fna ${{NAME_base}}_file --bed_file ${{NAME_base}}.nrTEplants.bed --cor 16

      done

  """
SnakeMake From line 203 of main/Snakefile
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
shell:
  """
  #count total length of repeats for each ecotypes
  ls -1 {input.data}*.fa.bed | 
      sort -u |
      while read i
      do
          cat $i |awk '{{SUM += $3-$2}} END {{print SUM}}' # by subtracting two columns of coordinates for length the summation all for total length
              done | tr "\\t" ","  > repeats_length.csv
  cat repeats_length.csv
  #count the total length of genome for each ecotypes
  assembly-stats -u out/uncom_data/genomes/Bdistachyon*.fa|cut -f 1-2|tr "\\t" ","|awk '{{gsub(/Bdistachyon/,"",$1)}}1' | awk '{{gsub(/_v1.fa/,"",$1)}}1'|cut -d "/" -f 4 > stats_40.csv
  # join genome total length and total length of repeats in CSV file
  paste -d , stats_40.csv repeats_length.csv > tables/repeats_analysis.csv 
  #add head 
  sed -i '1i Ecotypes,Genome.size,Repeated' tables/repeats_analysis.csv
  #clean data
  rm stats_40.csv repeats_length.csv
  """ 
273
274
shell:
  "Rscript {input.script}"
SnakeMake From line 273 of main/Snakefile
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
shell:
  """
  #go to bed files directory
  cd src/scripts/masking_repeats/plant-scripts/repeats/

  # cut TE families after # from nrTEplants.bed files and add file -ecotype- name to columns
  ls -1 *.nrTEplants.bed |
      sort -u |
      while read i
      do
          perl -lane '$f=(split(/#/,$F[3]))[1];print "$f\t$ARGV"' $i
              done | tr "\\t" ", " > te_ann_0.csv   # make comma as separator

  #rearrange , put ecotype name as first column
  cat te_ann_0.csv|awk -F, '{{ print $2","$1 }}'  > te_ann_1.csv

  #filter name of ecotype to be easy to read |and replace /  by comma as separator| remove third col (superfamily)
  cat te_ann_1.csv| awk '{{gsub(/Bdistachyon/,"",$1)}}1' | awk '{{gsub(/_v1.fa.nrTEplants.bed/,"",$1)}}1'|tr '/' ',n'|cut -d, -f3 --complement> te_ann_2.csv

  #measure the length(bp) for each TE order in nrTEplants.bed files save it as column
  ls -1 *.nrTEplants.bed |
      sort -u |
      while read i
      do
          cat $i |awk '{{ print $0, $3 - $2 }}'| cut -d " " -f 2
              done | tr "\\t" ","  > te_ann_3.csv


  #merge two tables for final table
  paste -d , te_ann_2.csv te_ann_3.csv > te_ann.csv
  #cat te_ann_4.csv|sed 's/^,/NULL,/; :a;s/,,/,NA,/g;ta' > te_ann.csv

  #add header to table
  sed -i -e '1i Ecotype,Order,Length' te_ann.csv

  #reomve files

  rm te_ann_0.csv
  rm te_ann_1.csv
  rm te_ann_2.csv
  rm te_ann_3.csv

  #move final file to work directory
  cp te_ann.csv ./../../../../../tables/
  """
SnakeMake From line 283 of main/Snakefile
335
336
shell:
  "Rscript {input.script}"
SnakeMake From line 335 of main/Snakefile
345
346
shell:
  "bash {input.script}"
SnakeMake From line 345 of main/Snakefile
356
357
358
359
360
361
362
363
364
shell:
  """
  cd src/scripts/
  git clone https://github.com/eead-csic-compbio/get_homologues.git
  cd get_homologues
  ./install.pl
  cd ../../..
  {output.get_homologues} -h
  """
SnakeMake From line 356 of main/Snakefile
372
373
374
375
376
377
shell:
  """
  {input.script_get_homologues} -d out/uncom_data/cds -M -t 4 -e -m cluster &> out/logs/log.cds.Pfam
  cp -r cds_est_homologues/ out/
  rm -r cds_est_homologues/
  """
SnakeMake From line 372 of main/Snakefile
387
388
389
390
shell:
  """
  {input.script_header} -i {input.data_homologues} -o {output.cds_ids}
  """
SnakeMake From line 387 of main/Snakefile
399
400
401
402
shell:
  """
  {input.script_seq_extraction} -id {input.cds_ids} -f {input.promoter_dir} -o {output.promoter_cluster}
  """
SnakeMake From line 399 of main/Snakefile
411
412
413
414
415
416
417
418
shell: 
  """
  #cds
  mafft --globalpair --maxiterate 1000 {input.data_homologues} > out/aln_cds/{wildcards.file}  2> out/logs/aln_cds_log.txt  

  #promoter
  mafft --globalpair --maxiterate 1000 out/uncom_data/promoter_cluster/{wildcards.file} > out/aln_promoter/{wildcards.file} 2> out/logs/aln_pr_log.txt
  """
432
433
434
435
436
437
438
shell: 
  """
  #cds
  {params.filter_path} -f {input.cds_align} -o {output.cds_align_filtration}/{wildcards.file}  >> tables/global_poor_alignment_cds.csv || true
  #promoter
  {params.filter_path} -f {input.promoter_align} -o {output.promoter_align_filtration}/{wildcards.file}  >> tables/global_poor_alignment_pr.csv || true
  """
SnakeMake From line 432 of main/Snakefile
453
454
455
456
457
458
459
460
461
462
463
shell: 
  """
  #cds
  ./src/scripts/get_homologues/annotate_cluster.pl -f {input.data_homologues} -o  {output.cds_local_align} >>  out/logs/aln_local_cds_log.txt 
  #promoter

  #promoter
  ./src/scripts/get_homologues/annotate_cluster.pl -f {input.promoter_cluster} -o {output.promoter_local_align} >>  out/logs/aln_local_pr_log.txt 
  #promoter

  """
SnakeMake From line 453 of main/Snakefile
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
shell: 
  """
  #trimal version trimAl 1.2rev59

  #cds_global_filterated
  {params.trimal_path} -in {input.cds_align_filtration} -out {output.trimal_cds_global_align} \
  -automated1 >> out/logs/trimal_cds_log.txt 


  #promoter_global_filterated
  {params.trimal_path} -in {input.promoter_align_filtration} -out {output.trimal_pr_global_align} \
  -automated1 >> out/logs/trimal_pr_log.txt 

  #cds_local
  {params.trimal_path} -in {input.cds_local_align} -out {output.trimal_cds_local_align} \
  -automated1 >> out/logs/trimal_cds_local_log.txt 


  #promoter_local
  {params.trimal_path} -in {input.promoter_local_align} -out {output.trimal_pr_local_align} \
  -automated1 >> out/logs/trimal_pr_local_log.txt 

  """
SnakeMake From line 478 of main/Snakefile
ShowHide 19 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ammarabdalrahem/panregulome-analysis
Name: panregulome-analysis
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...